Train agent skills like you train neural networks β with epochs, learning rates, and validation gates β but without touching model weights.
64c8f76086bed7bd7a5ce664a7a14f40_raw.mp4
βΆ Watch the full demo on YouTube
Requirements: Python 3.10+
git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .
# For ALFWorld benchmark (optional):
pip install -e ".[alfworld]"
alfworld-downloadcp .env.example .env
# Edit .env with your API credentials, then:
source .envAzure OpenAI (recommended):
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
# Option 1: API key auth
export AZURE_OPENAI_API_KEY="your-key"
# Option 2: Azure CLI auth (no API key needed)
export AZURE_OPENAI_AUTH_MODE="azure_cli"Note:
AZURE_OPENAI_ENDPOINTis always required. Without it, all LLM calls will fail.
OpenAI directly:
export OPENAI_API_KEY="sk-..."Anthropic Claude:
export ANTHROPIC_API_KEY="sk-ant-..."Qwen (local vLLM):
export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"SkillOpt expects data in a split directory with train/, val/, test/ subdirectories, each containing a JSON file (e.g., items.json).
data/my_split/
βββ train/items.json
βββ val/items.json
βββ test/items.json
Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:
[
{
"id": "unique_item_id",
"question": "Who wrote the novel ...",
"context": "[DOC] relevant passage text ...",
"answers": ["expected answer"]
}
]See skillopt/envs/<benchmark>/dataloader.py for the exact format each benchmark expects.
Note: Benchmark datasets are not included in this repository. Prepare your own data following the format above.
| Benchmark | Type | Config |
|---|---|---|
| SearchQA | QA | configs/searchqa/default.yaml |
| ALFWorld | Embodied agent | configs/alfworld/default.yaml |
| DocVQA | Document QA | configs/docvqa/default.yaml |
| LiveMathematicianBench | Math | configs/livemathematicianbench/default.yaml |
| SpreadsheetBench | Code generation | configs/spreadsheetbench/default.yaml |
| OfficeQA | Tool-augmented QA | configs/officeqa/default.yaml |
# Minimal example β train on SearchQA:
python scripts/train.py \
--config configs/searchqa/default.yaml \
--split_dir /path/to/your/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5
# Train on LiveMathematicianBench:
python scripts/train.py \
--config configs/livemathematicianbench/default.yaml \
--split_dir /path/to/your/livemath_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5
# Train on ALFWorld:
python scripts/train.py \
--config configs/alfworld/default.yaml \
--split_dir /path/to/your/alfworld_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/ \
--optimizer_model gpt-5.5 \
--target_model gpt-5.5Key CLI arguments:
| Argument | Description | Example |
|---|---|---|
--config |
Benchmark config YAML | configs/searchqa/default.yaml |
--split_dir |
Path to data split directory | /path/to/split |
--azure_openai_endpoint |
Azure OpenAI endpoint URL | https://your-resource.openai.azure.com/ |
--optimizer_model |
Optimizer model deployment name | gpt-5.5 |
--target_model |
Target model deployment name | gpt-5.5 |
--num_epochs |
Number of training epochs | 4 |
--batch_size |
Batch size per step | 40 |
--workers |
Parallel rollout workers | 8 |
--out_root |
Output directory | outputs/my_run |
Evaluate a trained skill on specific data splits without training:
# Evaluate on test set only:
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill outputs/my_run/best_skill.md \
--split valid_unseen \
--split_dir /path/to/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/
# Evaluate on all splits (train + val + test):
python scripts/eval_only.py \
--config configs/searchqa/default.yaml \
--skill outputs/my_run/best_skill.md \
--split all \
--split_dir /path/to/searchqa_split \
--azure_openai_endpoint https://your-resource.openai.azure.com/| Split | Description |
|---|---|
valid_unseen |
Test set |
valid_seen |
Validation set |
train |
Training set |
all |
All splits combined (default) |
Each run writes to a structured output directory:
outputs/<run_name>/
βββ config.json # Flattened runtime config
βββ history.json # Per-step training history
βββ runtime_state.json # Resume checkpoint
βββ best_skill.md # Best validated skill document
βββ skills/skill_vXXXX.md # Skill snapshot per step
βββ steps/step_XXXX/ # Per-step artifacts (patches, evals)
βββ slow_update/epoch_XX/ # Slow update logs
βββ meta_skill/epoch_XX/ # Meta skill logs
Re-running the same command auto-resumes from the last completed step.
Launch the monitoring dashboard (optional):
pip install -e ".[webui]"
python -m skillopt_webui.app| Flag | Default | Description |
|---|---|---|
--port |
7860 | Server port |
--host |
0.0.0.0 |
Bind address |
--share |
off | Create a public Gradio share link |
# With public share link (useful for remote servers)
python -m skillopt_webui.app --share@article{skillopt2026,
title={SKILLOPT: Executive Strategy for Self-Evolving Agent Skills},
author={SkillOpt Team},
year={2026}
}