This agent is designed to show the basic principles for tackling software engineering problems from two prominent benchmarks: SWE-bench and TerminalBench. It is not meant to be a production ready implementation.
The Agent Starter Pack (ASP) is the recommended way to create a new project from this sample: you get a production-oriented layout, deployment choices, and CI/CD scaffolding. The copy in adk-samples remains the upstream source for browsing and contributions.
| Feature | Description |
|---|---|
| Interaction Type | Autonomous |
| Complexity | Advanced |
| Agent Type | Single Agent |
| Components | Tools: Shell |
| Vertical | Software Engineering |
The SWE Benchmark Agent uses a sophisticated orchestrator pattern:
- Orchestrator: Manages the agent lifecycle and coordinates tool execution
- Environment: Docker-based isolated execution environment (SWEBenchEnvironment or TerminalBenchEnvironment)
- Tools: File operations (read, edit, create), shell commands, and submission
- Agent: LLM-powered agent (Gemini) with built-in planner and thinking capabilities
The agent operates autonomously within the Docker environment, using shell commands and file operations to solve software engineering tasks.
- Python 3.10+
- uv
- Google Cloud SDK (
gcloud) installed and authenticated (for Vertex / Gemini) - Git
- Docker (for SWE-bench and TerminalBench evaluation via
swe_benchmark_agent.main)
The Agent Starter Pack is the recommended way to create and deploy a production-ready version of this agent. Start from a new directory (replace my-swe-agent with your project name):
uvx agent-starter-pack create my-swe-agent -a adk@swe-benchmark-agent
cd my-swe-agentInstall dependencies (including dev tools for tests):
uv sync --group devConfigure Google Cloud (environment variables or a .env file):
export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<your-project-id>
export GOOGLE_CLOUD_LOCATION=globalAuthenticate:
gcloud auth application-default login
gcloud auth application-default set-quota-project $GOOGLE_CLOUD_PROJECTDuring setup, the starter pack will prompt you for deployment options and adds production-oriented tooling (for example automated CI/CD deployment scripts).
Alternative: install Agent Starter Pack with pip
If you prefer not to use uvx, create a virtual environment and install the CLI:
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade agent-starter-pack
agent-starter-pack create my-swe-agent -a adk@swe-benchmark-agent
cd my-swe-agentThen continue with uv sync --group dev and the configuration steps above.
Clone this repository directly (contributors and advanced use)
Use this workflow when working against the upstream repository (for example to open a pull request). New projects should still use the Agent Starter Pack as described above.
git clone https://github.com/google/adk-samples.git
cd adk-samples/python/agents/swe-benchmark-agent
uv sync --group devSet the same GOOGLE_* environment variables and run gcloud auth application-default login as in the recommended path. Running the agent, tests, and evaluations uses the same commands as below.
Talk to the sample agent with the ADK CLI:
uv run adk run swe_benchmark_agentOr use the web UI:
uv run adk webSelect swe_benchmark_agent in the UI if prompted. The interactive agent explains how to run full benchmark evaluations; those use Docker via swe_benchmark_agent.main (see Running evaluations).
uv run pytest tests -vThe SWE Agent can be evaluated on both SWE-bench and TerminalBench benchmarks to measure its performance on real-world software engineering tasks.
To run evaluation on the full SWE-bench Verified dataset:
uv run python -m swe_benchmark_agent.main --full-dataset --evaluate --max-workers 4To evaluate on a specific number of instances (e.g., the first 10):
uv run python -m swe_benchmark_agent.main --instance-id-or-count 10 --evaluateTo evaluate on a single instance:
uv run python -m swe_benchmark_agent.main --instance-id-or-count django__django-12345 --evaluateTo run evaluation on the full TerminalBench core dataset:
uv run python -m swe_benchmark_agent.main --dataset terminalbench --full-dataset --evaluate --max-workers 4To evaluate on a specific number of tasks (e.g., the first 5):
uv run python -m swe_benchmark_agent.main --dataset terminalbench --instance-id-or-count 5 --evaluateTo evaluate on a single task:
uv run python -m swe_benchmark_agent.main --dataset terminalbench --instance-id-or-count blind-maze-explorer-5x5 --evaluateThe SWE Agent can be customized to better suit your requirements. For example:
- Use a different model: Adjust the model in
swe_benchmark_agent/main.py(benchmark orchestration) orswe_benchmark_agent/agent.py(interactiveadk runentry point). - Add more tools: Add tools to the agent to give it more capabilities.
- Support more benchmarks: Add support for more benchmarks by creating a new environment and updating
swe_benchmark_agent/main.py.