SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Train agent skills like you train neural networks — with epochs, learning rates, and validation gates — but without touching model weights.

🎬 SkillOpt Demo Video

64c8f76086bed7bd7a5ce664a7a14f40_raw.mp4

▶ Watch the full demo on YouTube

Install

Requirements: Python 3.10+

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

# For ALFWorld benchmark (optional):
pip install -e ".[alfworld]"
alfworld-download

Configure API Credentials

cp .env.example .env
# Edit .env with your API credentials, then:
source .env

Azure OpenAI (recommended):

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
# Option 1: API key auth
export AZURE_OPENAI_API_KEY="your-key"
# Option 2: Azure CLI auth (no API key needed)
export AZURE_OPENAI_AUTH_MODE="azure_cli"

Note: AZURE_OPENAI_ENDPOINT is always required. Without it, all LLM calls will fail.

OpenAI directly:

export OPENAI_API_KEY="sk-..."

Anthropic Claude:

export ANTHROPIC_API_KEY="sk-ant-..."

Qwen (local vLLM):

export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"

Data Preparation

SkillOpt expects data in a split directory with train/, val/, test/ subdirectories, each containing a JSON file (e.g., items.json).

data/my_split/
├── train/items.json
├── val/items.json
└── test/items.json

Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:

[
  {
    "id": "unique_item_id",
    "question": "Who wrote the novel ...",
    "context": "[DOC] relevant passage text ...",
    "answers": ["expected answer"]
  }
]

See skillopt/envs/<benchmark>/dataloader.py for the exact format each benchmark expects.

Note: Benchmark datasets are not included in this repository. Prepare your own data following the format above.

Supported Benchmarks

Benchmark	Type	Config
SearchQA	QA	`configs/searchqa/default.yaml`
ALFWorld	Embodied agent	`configs/alfworld/default.yaml`
DocVQA	Document QA	`configs/docvqa/default.yaml`
LiveMathematicianBench	Math	`configs/livemathematicianbench/default.yaml`
SpreadsheetBench	Code generation	`configs/spreadsheetbench/default.yaml`
OfficeQA	Tool-augmented QA	`configs/officeqa/default.yaml`

Quick Start

Training

# Minimal example — train on SearchQA:
python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on LiveMathematicianBench:
python scripts/train.py \
    --config configs/livemathematicianbench/default.yaml \
    --split_dir /path/to/your/livemath_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on ALFWorld:
python scripts/train.py \
    --config configs/alfworld/default.yaml \
    --split_dir /path/to/your/alfworld_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

Key CLI arguments:

Argument	Description	Example
`--config`	Benchmark config YAML	`configs/searchqa/default.yaml`
`--split_dir`	Path to data split directory	`/path/to/split`
`--azure_openai_endpoint`	Azure OpenAI endpoint URL	`https://your-resource.openai.azure.com/`
`--optimizer_model`	Optimizer model deployment name	`gpt-5.5`
`--target_model`	Target model deployment name	`gpt-5.5`
`--num_epochs`	Number of training epochs	`4`
`--batch_size`	Batch size per step	`40`
`--workers`	Parallel rollout workers	`8`
`--out_root`	Output directory	`outputs/my_run`

Eval Only

Evaluate a trained skill on specific data splits without training:

# Evaluate on test set only:
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

# Evaluate on all splits (train + val + test):
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split all \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

Split	Description
`valid_unseen`	Test set
`valid_seen`	Validation set
`train`	Training set
`all`	All splits combined (default)

Output Structure

Each run writes to a structured output directory:

outputs/<run_name>/
├── config.json              # Flattened runtime config
├── history.json             # Per-step training history
├── runtime_state.json       # Resume checkpoint
├── best_skill.md            # Best validated skill document
├── skills/skill_vXXXX.md   # Skill snapshot per step
├── steps/step_XXXX/        # Per-step artifacts (patches, evals)
├── slow_update/epoch_XX/   # Slow update logs
└── meta_skill/epoch_XX/    # Meta skill logs

Re-running the same command auto-resumes from the last completed step.

WebUI

Launch the monitoring dashboard (optional):

pip install -e ".[webui]"
python -m skillopt_webui.app

Flag	Default	Description
`--port`	7860	Server port
`--host`	`0.0.0.0`	Bind address
`--share`	off	Create a public Gradio share link

# With public share link (useful for remote servers)
python -m skillopt_webui.app --share

Citation

@article{skillopt2026,
  title={SKILLOPT: Executive Strategy for Self-Evolving Agent Skills},
  author={SkillOpt Team},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
configs		configs
docs		docs
scripts		scripts
skillopt-assets		skillopt-assets
skillopt		skillopt
skillopt_webui		skillopt_webui
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
index.html		index.html
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
skillopt.html		skillopt.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

🎬 SkillOpt Demo Video

Install

Configure API Credentials

Data Preparation

Supported Benchmarks

Quick Start

Training

Eval Only

Output Structure

WebUI

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

🎬 SkillOpt Demo Video

Install

Configure API Credentials

Data Preparation

Supported Benchmarks

Quick Start

Training

Eval Only

Output Structure

WebUI

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages