swe-benchmark-agent

SWE Benchmark Agent

Overview

This agent is designed to show the basic principles for tackling software engineering problems from two prominent benchmarks: SWE-bench and TerminalBench. It is not meant to be a production ready implementation.

The Agent Starter Pack (ASP) is the recommended way to create a new project from this sample: you get a production-oriented layout, deployment choices, and CI/CD scaffolding. The copy in adk-samples remains the upstream source for browsing and contributions.

Agent Details

Feature	Description
Interaction Type	Autonomous
Complexity	Advanced
Agent Type	Single Agent
Components	Tools: Shell
Vertical	Software Engineering

Agent architecture:

The SWE Benchmark Agent uses a sophisticated orchestrator pattern:

Orchestrator: Manages the agent lifecycle and coordinates tool execution
Environment: Docker-based isolated execution environment (SWEBenchEnvironment or TerminalBenchEnvironment)
Tools: File operations (read, edit, create), shell commands, and submission
Agent: LLM-powered agent (Gemini) with built-in planner and thinking capabilities

The agent operates autonomously within the Docker environment, using shell commands and file operations to solve software engineering tasks.

Prerequisites

Python 3.10+
uv
Google Cloud SDK (gcloud) installed and authenticated (for Vertex / Gemini)
Git
Docker (for SWE-bench and TerminalBench evaluation via swe_benchmark_agent.main)

Recommended: Using Agent Starter Pack

The Agent Starter Pack is the recommended way to create and deploy a production-ready version of this agent. Start from a new directory (replace my-swe-agent with your project name):

uvx agent-starter-pack create my-swe-agent -a adk@swe-benchmark-agent
cd my-swe-agent

Install dependencies (including dev tools for tests):

uv sync --group dev

Configure Google Cloud (environment variables or a .env file):

export GOOGLE_GENAI_USE_VERTEXAI=true
export GOOGLE_CLOUD_PROJECT=<your-project-id>
export GOOGLE_CLOUD_LOCATION=global

Authenticate:

gcloud auth application-default login
gcloud auth application-default set-quota-project $GOOGLE_CLOUD_PROJECT

During setup, the starter pack will prompt you for deployment options and adds production-oriented tooling (for example automated CI/CD deployment scripts).

Alternative: install Agent Starter Pack with pip

If you prefer not to use uvx, create a virtual environment and install the CLI:

python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade agent-starter-pack
agent-starter-pack create my-swe-agent -a adk@swe-benchmark-agent
cd my-swe-agent

Then continue with uv sync --group dev and the configuration steps above.

Clone this repository directly (contributors and advanced use)

Use this workflow when working against the upstream repository (for example to open a pull request). New projects should still use the Agent Starter Pack as described above.

git clone https://github.com/google/adk-samples.git
cd adk-samples/python/agents/swe-benchmark-agent
uv sync --group dev

Set the same GOOGLE_* environment variables and run gcloud auth application-default login as in the recommended path. Running the agent, tests, and evaluations uses the same commands as below.

Running the Agent

Talk to the sample agent with the ADK CLI:

uv run adk run swe_benchmark_agent

Or use the web UI:

uv run adk web

Select swe_benchmark_agent in the UI if prompted. The interactive agent explains how to run full benchmark evaluations; those use Docker via swe_benchmark_agent.main (see Running evaluations).

Running tests

uv run pytest tests -v

Running evaluations

The SWE Agent can be evaluated on both SWE-bench and TerminalBench benchmarks to measure its performance on real-world software engineering tasks.

SWE-bench Evaluation

To run evaluation on the full SWE-bench Verified dataset:

uv run python -m swe_benchmark_agent.main --full-dataset --evaluate --max-workers 4

To evaluate on a specific number of instances (e.g., the first 10):

uv run python -m swe_benchmark_agent.main --instance-id-or-count 10 --evaluate

To evaluate on a single instance:

uv run python -m swe_benchmark_agent.main --instance-id-or-count django__django-12345 --evaluate

TerminalBench Evaluation

To run evaluation on the full TerminalBench core dataset:

uv run python -m swe_benchmark_agent.main --dataset terminalbench --full-dataset --evaluate --max-workers 4

To evaluate on a specific number of tasks (e.g., the first 5):

uv run python -m swe_benchmark_agent.main --dataset terminalbench --instance-id-or-count 5 --evaluate

To evaluate on a single task:

uv run python -m swe_benchmark_agent.main --dataset terminalbench --instance-id-or-count blind-maze-explorer-5x5 --evaluate

Customization

The SWE Agent can be customized to better suit your requirements. For example:

Use a different model: Adjust the model in swe_benchmark_agent/main.py (benchmark orchestration) or swe_benchmark_agent/agent.py (interactive adk run entry point).
Add more tools: Add tools to the agent to give it more capabilities.
Support more benchmarks: Add support for more benchmarks by creating a new environment and updating swe_benchmark_agent/main.py.

Name		Name	Last commit message	Last commit date
parent directory ..
swe_benchmark_agent		swe_benchmark_agent
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

SWE Benchmark Agent

Overview

Agent Details

Agent architecture:

Prerequisites

Recommended: Using Agent Starter Pack

Running the Agent

Running tests

Running evaluations

SWE-bench Evaluation

TerminalBench Evaluation

Customization

FilesExpand file tree

swe-benchmark-agent

Directory actions

More options

Directory actions

More options

Latest commit

History

swe-benchmark-agent

Folders and files

parent directory

README.md

SWE Benchmark Agent

Overview

Agent Details

Agent architecture:

Prerequisites

Recommended: Using Agent Starter Pack

Running the Agent

Running tests

Running evaluations

SWE-bench Evaluation

TerminalBench Evaluation

Customization