Genetic Eval-driven Prompt Algorithm
Automatically evolve and optimize your CLAUDE.md system prompts using AI-powered evolutionary algorithms.
# Start watcher - auto-optimizes when CLAUDE.md changes
node scripts/gepa/hooks/auto-optimize.js watch ./CLAUDE.mdOutput shows before/after comparison:
╔════════════════════════════════════════════════════════════╗
║ BEFORE / AFTER COMPARISON ║
╠════════════════════════════════════════════════════════════╣
║ Metric Before After Change ║
╠════════════════════════════════════════════════════════════╣
║ Lines 125 142 +17 (+14%) ║
║ Est. Tokens 873 920 +47 (+5%) ║
║ MUST rules 1 3 +2 (+200%) ║
║ NEVER rules 3 5 +2 (+67%) ║
╚════════════════════════════════════════════════════════════╝
Section Changes:
Added:
+ Error Handling
+ Performance Guidelines
Summary:
Token budget: +47 tokens
Rule density: +4 explicit rules
┌─────────────────────────────────────────────────────────────────┐
│ GEPA Loop │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Seed │───►│ Mutate │───►│ Eval │───►│ Select │ │
│ │ Prompt │ │ (AI gen) │ │ (Claude) │ │ (best) │ │
│ └──────────┘ └──────────┘ └──────────┘ └────┬─────┘ │
│ ▲ │ │
│ │ ┌──────────┐ │ │
│ └──────────────┤ Reflect │◄────────────────────┘ │
│ │(insights)│ │
│ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
# 1. Initialize with your current CLAUDE.md
node scripts/gepa/optimize.js init ./CLAUDE.md
# 2. Run full optimization (10 generations)
node scripts/gepa/optimize.js run
# 3. Apply the best result
cp scripts/gepa/generations/current ./CLAUDE.md| Command | Description |
|---|---|
init [path] |
Initialize with a CLAUDE.md file |
mutate |
Generate new prompt variants |
eval [variant] |
Run evals on a specific variant |
score |
Score all variants and select best |
run [N] |
Full optimization loop for N generations |
status |
Show current optimization status |
diff [a] [b] |
Compare two variants |
GEPA uses 14 mutation strategies (v2.0), cycling through them. The first 6 are foundational; the remaining 8 are derived from Anthropic's Claude prompting best practices (2026-03).
- rephrase - Reword for clarity without changing meaning
- add_examples - Add concrete examples where abstract (3-5 few-shot)
- remove_redundancy - DRY up repetitive instructions
- restructure - Reorganize for better flow, critical rules early
- add_constraints - Add guardrails for failure modes ("do X instead of Y")
- simplify - Break down complex rules into numbered steps
- add_xml_structure - Wrap sections in descriptive XML tags for unambiguous parsing
- add_role - Add/refine role definition to focus behavior and tone
- add_motivation - Add "why" context so Claude generalizes rules to edge cases
- calibrate_tool_usage - Dial back aggressive tool-triggering language for Opus 4.6
- add_self_check - Add verification criteria at key decision points
- reduce_overengineering - Constrain Claude from adding unnecessary abstractions
- add_guardrails - Explicit failure-mode prevention for agentic workflows
- improve_error_handling - Fallback instructions for incomplete/missing data
Before evaluation, each mutation goes through a self-review step:
mutate → self-review → refine → eval → select
The review checks: preservation, coherence, specificity, token budget, no drift, no conflicts, no prompt overengineering. This catches errors before burning eval budget.
Unlike random mutations, GEPA analyzes why prompts fail:
# Analyze session patterns
node scripts/gepa/hooks/reflect.js analyze
# Generate targeted improvements
node scripts/gepa/hooks/reflect.js reflectThe reflection engine examines:
- Common error patterns
- Tool call success rates
- User feedback (thumbs up/down)
- Performance by variant
Then generates targeted mutation suggestions.
To track real usage for evals, add to ~/.claude/settings.json:
{
"hooks": {
"postToolCall": [
{
"command": "node ~/.claude/gepa/hooks/eval-tracker.js track-tool"
}
],
"postSession": [
{
"command": "node ~/.claude/gepa/hooks/eval-tracker.js save"
}
]
}
}Or use the StackMemory daemon integration:
# Add to your .env
GEPA_ENABLED=true
GEPA_DIR=~/.claude/gepaEdit config.json:
{
"evolution": {
"populationSize": 4, // Variants per generation
"generations": 10, // Max generations
"selectionRate": 0.5 // Top 50% survive
},
"evals": {
"minSamplesPerVariant": 5, // Evals per variant
"timeout": 120000 // 2 min per eval
},
"scoring": {
"threshold": 0.8 // Stop when 80% success
}
}Evals live in evals/*.jsonl:
{
"id": "eval-001",
"name": "simple_function",
"prompt": "Write a function that checks if a string is a palindrome",
"expected": {
"has_function": true,
"handles_edge_cases": true
},
"weight": 1.0
}| Check | What It Looks For |
|---|---|
has_function |
Function definition in output |
handles_edge_cases |
Null/empty/edge case handling |
uses_async |
async/await usage |
bug_fixed |
Fix-related language |
explains_fix |
Explanation of changes |
| Custom key | Looks for key as substring |
scripts/gepa/
├── config.json # Settings
├── state.json # Current state
├── optimize.js # Main optimizer
├── hooks/
│ ├── eval-tracker.js # Session tracking hook
│ └── reflect.js # Reflection engine
├── evals/
│ └── coding-tasks.jsonl
├── generations/
│ ├── gen-000/
│ │ └── baseline.md
│ ├── gen-001/
│ │ ├── variant-a.md
│ │ ├── variant-b.md
│ │ └── baseline.md
│ └── current -> gen-001/variant-a.md
└── results/
├── scores.jsonl
└── sessions/
- Start with good evals - Garbage in, garbage out
- Run multiple generations - Improvements compound
- Review diffs - Understand what changed
- Keep baseline - Always compare against original
- Monitor for drift - Watch for unintended changes
$ node optimize.js run 3
============================================================
GENERATION 1/3
============================================================
Generating 4 variants for generation 1...
Creating variant-a using strategy: rephrase
Creating variant-b using strategy: add_examples
Creating variant-c using strategy: remove_redundancy
Creating variant-d using strategy: restructure
Scoring 5 variants in generation 1...
Running evals on baseline... Score: 65.0%
Running evals on variant-a... Score: 72.0%
Running evals on variant-b... Score: 78.0%
Running evals on variant-c... Score: 70.0%
Running evals on variant-d... Score: 68.0%
Results:
1. variant-b: 78.0% <-- BEST
2. variant-a: 72.0%
3. variant-c: 70.0%
4. variant-d: 68.0%
5. baseline: 65.0%
New best: variant-b (78.0%)
============================================================
OPTIMIZATION COMPLETE
============================================================
Best variant: variant-b
Best score: 85.2%
Generations: 3
To apply: cp generations/current /path/to/your/CLAUDE.md
"claude CLI not found"
Set ANTHROPIC_API_KEY for API fallback.
Slow evals
Reduce minSamplesPerVariant in config.
Poor results Add more diverse evals covering failure modes.
Drift from original intent Add evals that test for desired behaviors explicitly.