A sophisticated multi-agent reasoning system designed for challenging mathematical problems, inspired by systems like Gemini 2.5 Pro Deep Think and the successful IMO25 solver.
MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through:
- Multi-agent exploration with diverse reasoning approaches (3 agents by default, configurable)
- Rigorous verification using a 2-pass consensus threshold (configurable)
- Iterative improvement based on verification feedback
- OpenRouter reasoning API for deep mathematical thinking
- RSA-inspired aggregation for solution refinement
- Strategy network for cross-agent insight sharing
- Shared workspace for agent collaboration
- 3 parallel agents by default (configurable: 2 for lightweight, 3+ for advanced)
- Temperature diversity (0.3, 0.6, 1.0) ensures varied exploration strategies
- Independent reasoning followed by collaborative verification
- Effort-based reasoning: "low", "medium", "high" effort levels via OpenRouter API
- Adaptive allocation: Low effort (temp ≤ 0.4), Medium (0.4-0.8), High (> 0.8)
- Configurable token budgets: 4K for lightweight coding, 64K for complex reasoning
- 2-pass threshold by default (configurable: 1 for lightweight, 2+ for advanced)
- Cross-agent verification: Agents verify each other's solutions
- Mathematical rigor: Focus on complete proofs, not just correct answers
- Consensus building: Multiple verified solutions required
- Feedback-driven: Solutions improved based on verification feedback
- Error correction: Automatic identification and fixing of mathematical errors
- Logical gap filling: Strengthening incomplete reasoning steps
optillm/mars/
├── __init__.py # Package exports
├── mars.py # Main orchestration with parallel execution
├── agent.py # Individual agent implementation
├── workspace.py # Shared collaboration workspace
├── verifier.py # Multi-pass verification system
├── aggregator.py # RSA-inspired solution aggregation
├── strategy_network.py # Cross-agent insight sharing
├── answer_extraction.py # Clean answer extraction with thinking tags
├── prompts.py # Mathematical reasoning prompts
└── README.md # This documentation
DEFAULT_CONFIG = {
'num_agents': 3, # Number of parallel agents
'max_iterations': 5, # Maximum improvement iterations
'verification_passes_required': 2, # Consecutive passes needed
'consensus_threshold': 2, # Verified solutions for consensus
'min_verified_solutions': 1, # Minimum to proceed
'max_tokens': 64000, # Token budget for complex reasoning
'max_verification_attempts': 3, # Max verification retries
'early_termination': True, # Stop on consensus
'use_reasoning_api': True, # Enable OpenRouter reasoning
# RSA-inspired aggregation
'enable_aggregation': True, # Enable solution aggregation
'population_size': 6, # Population for diversity
'aggregation_size': 3, # Solutions per aggregation
'aggregation_loops': 3, # Aggregation iterations
# Strategy Network
'enable_strategy_network': True, # Cross-agent insight sharing
'strategy_extraction_enabled': True, # Extract reasoning strategies
'cross_agent_enhancement': True, # Enhanced solutions via peer strategies
# Thinking tags and answer extraction
'use_thinking_tags': True, # Wrap reasoning in <think> tags
'answer_extraction_mode': 'auto', # 'auto', 'code', 'math', or 'none'
}LIGHTWEIGHT_CONFIG = {
'num_agents': 2, # Reduced agent count
'max_iterations': 2, # Faster iteration limit
'verification_passes_required': 1, # Single-pass verification
'consensus_threshold': 1, # Lower threshold for 2 agents
'min_verified_solutions': 1,
'max_tokens': 4000, # Smaller token budget
'max_verification_attempts': 2,
'early_termination': True,
'use_reasoning_api': True,
# Disable expensive features for speed
'enable_aggregation': False, # Skip RSA aggregation
'enable_strategy_network': False, # Skip strategy network
'strategy_extraction_enabled': False,
'cross_agent_enhancement': False,
# Thinking tags still enabled
'use_thinking_tags': True,
'answer_extraction_mode': 'auto',
}Note: MARS automatically uses lightweight config when max_tokens ≤ 4000 in the request.
# Start OptiLLM with MARS support
python optillm.py --model google/gemma-2.5-flash-lite --approach mars
# Make API call
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mars-google/gemma-2.5-flash-lite",
"messages": [
{
"role": "user",
"content": "Solve this IMO problem: Find all positive integers n such that..."
}
]
}'import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="anything")
response = client.chat.completions.create(
model="google/gemma-2.5-flash-lite",
messages=[
{"role": "user", "content": "Mathematical problem here"}
],
extra_body={"optillm_approach": "mars"}
)response = client.chat.completions.create(
model="google/gemma-2.5-flash-lite",
messages=[
{"role": "system", "content": "<optillm_approach>mars</optillm_approach>"},
{"role": "user", "content": "Mathematical problem here"}
]
)- Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0)
- Each agent independently analyzes the problem
- Generate initial solutions using OpenRouter reasoning API with effort levels
- All agent API calls executed in parallel via ThreadPoolExecutor
- Solutions stored in shared workspace
- Maintain population of N=6 solutions for diversity
- Select K=3 solutions for aggregation
- Run T=3 aggregation loops to refine solutions
- Parallel execution of aggregation API calls
- Enhanced solutions added back to workspace
- Extract reasoning strategies from agent solutions
- Identify successful patterns and techniques
- Share strategies across agents
- Generate enhanced solutions using peer insights
- Parallel execution of strategy extraction and enhancement
- Cross-agent verification of all solutions
- Each solution requires 2 consecutive "CORRECT" assessments (configurable)
- Verification feedback captured for improvement
- Solutions marked as verified/unverified
- Parallel execution of verification calls
- Unverified solutions improved based on feedback
- Agents address specific issues identified in verification
- Re-verification of improved solutions
- Process continues until consensus or max iterations (5 default)
- Parallel execution of improvement and verification
- Numerical voting: If 2+ agents agree on same numerical answer, use that solution
- Best verified solution: Otherwise, select highest-scoring verified solution
- Synthesis: If no verified solution, synthesize from top 3 solutions
- Answer extraction: Apply thinking tags and extract clean answer (if enabled)
- Complete solution with mathematical rigor
MARS is designed to excel on challenging mathematical benchmarks:
- IMO (International Mathematical Olympiad): Complex proof-based problems
- AIME (American Invitational Mathematics Examination): Numerical competition problems
- LiveCodeBench: Competitive programming challenges
- Mathematical reasoning tasks: General problem-solving capabilities
- Accuracy: Percentage of correctly solved problems
- Verification Rate: Percentage of solutions passing 5-pass threshold
- Reasoning Efficiency: Tokens used per correct solution
- Consensus Quality: Agreement between verified solutions
Evaluation results using google/gemini-2.5-flash-lite-preview-09-2025 via OpenRouter:
| Benchmark | Approach | Problems | Correct | Accuracy | Notes |
|---|---|---|---|---|---|
| AIME 2025 | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
| AIME 2025 | MARS | 30 | 22 | 73.3% | +9 problems (+30pp) |
| IMO 2025 | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct |
| IMO 2025 | MARS (lite) | 6 | 2 | 33.3% | +1 problem (+16.6pp) |
| LiveCodeBench v5/v6 | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
| LiveCodeBench v5/v6 | MARS + Thinking | 105 | 53 | 50.48% | +12 problems (+29.3%) |
- Results: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
- Improvement: +9 problems (+69.2% relative improvement), +30.0 percentage points
- Key Success Factor: Multi-agent collaboration with verification effectively solves numerical competition problems
- Approach: 3 agents with diverse temperatures, iterative verification and refinement
- Results: 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%)
- Improvement: +1 problem (+100% relative improvement), +16.6 percentage points
- Problems Solved: Problem 2 (geometry proof) + Problem 4 (number theory)
- Runtime: ~10 minutes per problem (vs ~40 seconds baseline)
- Key Success Factor: Multi-agent exploration with disabled thinking tags allows full proof visibility
- Configuration:
use_thinking_tags=False,answer_extraction_mode="none"for proof problems
- Results: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
- Improvement: +12 problems (+29.3% relative improvement), +11.43 percentage points
- Code Extraction: 87/105 (82.9%) vs baseline 54/105 (51.4%) - +61.1% improvement
- Key Success Factor: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
- Multi-agent benefit: Different temperature agents explore varied solution approaches
- MARS excels at numerical competition problems: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
- MARS improves proof-based problems: +100% relative improvement on IMO 2025 (16.7% → 33.3%)
- Thinking tags are problem-type dependent:
- ✅ Enable for code generation: +29.3% improvement on LiveCodeBench
- ✅ Enable for numerical problems: Multi-agent reasoning effective on AIME
- ❌ Disable for proof problems: IMO proofs need full visibility to evaluators
- Multi-agent diversity provides significant value - different temperature agents explore complementary approaches
- Code extraction rate is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
- ✅ AIME 2025: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) +30pp improvement
- ✅ IMO 2025: Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) +16.6pp improvement
- ✅ LiveCodeBench v5/v6: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) +11.43pp improvement
All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.
For proof-based problems like IMO, disable thinking tags to ensure full proof visibility:
extra_body = {
"optillm_approach": "mars",
"mars_config": {
"use_thinking_tags": False, # Full proof visibility
"answer_extraction_mode": "none" # Proofs are the answer
}
}All evaluations use pass@1 accuracy metric.
- Agent 0: Temperature 0.3 (Conservative, rigorous, low effort)
- Agent 1: Temperature 0.6 (Balanced approach, medium effort)
- Agent 2: Temperature 1.0 (Maximum exploration, high effort)
Note: Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6)
- Low effort (temp ≤ 0.4):
{"reasoning": {"effort": "low"}}- Conservative reasoning - Medium effort (0.4 < temp ≤ 0.8):
{"reasoning": {"effort": "medium"}}- Balanced reasoning - High effort (temp > 0.8):
{"reasoning": {"effort": "high"}}- Maximum reasoning depth
Note: OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities.
Solutions are verified based on:
- Mathematical correctness: Accurate calculations and logic
- Completeness: All problem aspects addressed
- Rigor: Proper justification for each step
- Clarity: Clear mathematical communication
- Format compliance: Proper answer formatting
- IMO25 Solver: 5/6 problems solved with 5-consecutive-pass verification
- Gemini 2.5 Pro Deep Think: Native reasoning tokens and thinking budgets
- OpenRouter Reasoning API: Standardized interface for deep thinking
- CEPO Architecture: Multi-file approach pattern in OptiLLM
- Multi-model support: Different models for different agent roles
- Dynamic temperature adjustment: Adaptive exploration strategies
- Specialized agent roles: Proof-focused, computation-focused, verification-focused
- Knowledge base integration: Access to mathematical theorems and techniques
- Interactive verification: Human-in-the-loop verification for critical problems