eval

A2UI Evaluation Framework

This folder contains evaluation tests (aka evals) for the A2UI project. An evaluation test verifies that a prompt produces expected results conforming to the A2UI schema and semantic rules.

Design

For a detailed overview of the design, secrets management, and contamination prevention, see the DESIGN.md file.

Running Evaluations

To run the evaluations, you need to use the Inspect AI CLI via uv. Make sure you are in this directory (evals/eval).

Prerequisites

Set your Gemini API key:
```
export GEMINI_API_KEY="your_api_key"
```
Decrypt Datasets (First Time Setup): The evaluation datasets are encrypted at rest in the repository to prevent base model contamination. To decrypt them in your repo for evaluation, you need to initialize Transcrypt with the shared password. From the evals/eval directory, run:
```
bin/transcrypt -p <PASSWORD>
```

You can request the password from any member of the A2UI team (it's not really a secret, but it's also not going on Github in plaintext).

After this one time setup, you will have local plaintext access to the decrypted datasets in the datasets/ directory, and they will be encrypted and decrypted transparently by git.

Run Evals

To run the evaluations with a specific model (e.g., Gemini 2.0 Flash):

uv run inspect eval tasks.py --model google/gemini-3-flash-preview --display plain

Viewing Evaluation Results

Inspect AI provides a web-based log viewer to explore the results of your evaluations.

To start the log viewer:

uv run inspect view start

This will start a local web server (usually at http://localhost:7575) and open the viewer in your browser. It will automatically load logs from the logs/ directory.

Listing Available Models

To list the available Gemini models supported by your API key:

uv run inspect eval tasks.py -T list_models=True --model google/gemini-3-flash-preview

(the --model flag is required even though it is ignored)

Running Unit Tests

To run the unit tests for the evaluation framework (dataset loader, solvers, scorers):

uv run python -m pytest

Name		Name	Last commit message	Last commit date
parent directory ..
a2ui_eval		a2ui_eval
bin		bin
datasets		datasets
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
README.md		README.md
pyproject.toml		pyproject.toml
tasks.py		tasks.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

A2UI Evaluation Framework

Design

Running Evaluations

Prerequisites

Run Evals

Viewing Evaluation Results

Listing Available Models

Running Unit Tests

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

A2UI Evaluation Framework

Design

Running Evaluations

Prerequisites

Run Evals

Viewing Evaluation Results

Listing Available Models

Running Unit Tests