A locally-running AI agent that autonomously controls a virtual desktop inside a Docker container using the Qwen3-VL vision-language model. The user issues natural language commands; the agent analyzes live screenshots of the VM, plans multi-step actions, and executes mouse/keyboard inputs to accomplish the task — all running on your own hardware with no cloud APIs required.
Most "computer use" demos rely on cloud-hosted models (GPT-4V, Claude, etc.). This project runs the entire pipeline locally:
- A Docker container (
trycua/cua-xfce) provides a full XFCE Linux desktop accessible via VNC and a REST API. - A Planner model (text-only GGUF, e.g. Octopus-Planning Q5_K_M) breaks down complex user objectives into atomic, verifiable steps.
- A Qwen3-VL 8B vision-language model (GGUF format, accelerated on your NVIDIA GPU via
llama-cpp-python) looks at the VM's screen, executes each step, and verifies success. - A PyQt6 Mission Control UI lets you issue commands, watch the agent work in real-time, inspect each step, and intervene when needed.
Hierarchical Agent Loop:
User Objective
→ Planner (text-only GGUF) → [Step 1, Step 2, Step 3, ...]
→ For each step:
Screenshot → Executor (Qwen3-VL) → Action (click/type/scroll)
Screenshot → Verifier (Qwen3-VL) → Pass/Fail
→ All steps done → Objective complete
- Qwen3-VL 8B vision-language model (GGUF, runs locally on GPU)
- Docker Sandbox — isolated virtual desktop via
trycua/cua-xfcecontainer - Hierarchical Planning — local GGUF planner decomposes complex objectives into atomic steps, verified after execution
- Local Model File Browser — browse and select local
.ggufmodel files directly from the GUI, or download from HuggingFace - Auto GPU Layer Detection — automatically detects available VRAM and calculates optimal GPU offloading
- Mission Control UI — professional 5-panel PyQt6 interface with planner settings panel
- Live VM Screen — direct mouse/keyboard interaction with the VM
- Agent Trace — step-by-step plan visualization, metrics, structured logs
- Plan Verification — each plan step is verified against success criteria using the vision model
- Safety Guards — repeat detection, coordinate validation, step limit
- Turkish → English Translation — commands are auto-translated (optional)
- JSON Log Export — export structured logs for debugging/analysis
| Component | Minimum |
|---|---|
| OS | Ubuntu 22.04+ |
| Python | 3.10 |
| GPU | NVIDIA with CUDA support (8 GB+ VRAM recommended) |
| NVIDIA Driver | 535+ |
| Docker | 24.0+ |
| RAM | 16 GB+ recommended |
Check if the driver is installed:
nvidia-smiIf not installed:
sudo apt update
sudo apt install -y nvidia-driver-535
sudo reboot# Install Docker Engine
sudo apt update
sudo apt install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io
# Allow running Docker without sudo (re-login required)
sudo usermod -aG docker $USER
newgrp docker
# Verify
docker run --rm hello-worlddocker pull trycua/cua-xfce:latestIf Miniconda is not installed:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Close and reopen your terminalCreate and activate the environment:
conda create -n cua python=3.10 -y
conda activate cuapip install torch torchvision --index-url https://download.pytorch.org/whl/cu130Note: For different CUDA versions, visit https://pytorch.org/get-started/locally/
The standard pip install llama-cpp-python does not include CUDA support. To run on an NVIDIA GPU, use a prebuilt wheel from the JamePeng fork:
-
Download the
.whlfile matching your system:- Python version:
cp310(Python 3.10) - Platform:
linux_x86_64 - CUDA version:
cu130(CUDA 13.0) orcu124(CUDA 12.4) - Example:
llama_cpp_python-0.3.23+cu130-cp310-cp310-linux_x86_64.whl
- Python version:
-
Install the downloaded wheel:
conda activate cua
pip install llama_cpp_python-0.3.23+cu130-cp310-cp310-linux_x86_64.whlCheck your CUDA version: look at the "CUDA Version" line in
nvidia-smioutput.
conda activate cua
pip install -r requirements.txtThe hierarchical planner uses a separate text-only GGUF model to decompose complex objectives into steps. You can either:
Option A — Use a local .gguf file (recommended):
Download any text-only GGUF model (e.g. Octopus-Planning Q5_K_M) and select it in the GUI via the 📂 Browse button.
Option B — Auto-download from HuggingFace:
Set PLANNER_GGUF_REPO_ID and PLANNER_GGUF_MODEL_FILENAME in src/config.py.
Note: The planner runs on CPU by default when the executor model (Qwen3-VL) is already using the GPU, to avoid CUDA memory conflicts. Auto GPU detection handles this automatically.
To auto-translate Turkish commands to English:
pip install sentencepiece
# The model (Helsinki-NLP/opus-mt-tc-big-tr-en) downloads automatically on first run.conda activate cua
python gui_mission_control_local.pyThe local-first variant runs the entire pipeline on your hardware with a hierarchical planning system:
- Planner Settings Panel — select local GGUF models via file browser, configure GPU layers (auto/manual), HuggingFace fallback
- Hierarchical Plans — complex objectives are decomposed into verified atomic steps
- Auto GPU Detection — optimal VRAM utilization calculated automatically
- Top Bar — Docker/Model status, step counter, latency
- Left — Command input, preset commands, agent step trace with plan visualization
- Center — Live VM screen (mouse/keyboard active)
- Right — Last action detail, metrics, sandbox info, config
- Bottom — Structured logs with JSON export
conda activate cua
python gui_mission_control.pyOpens a professional 5-panel interface without the hierarchical planner.
conda activate cua
python gui_main.pyconda activate cua
python main.py| Shortcut | Action |
|---|---|
Ctrl+Enter |
Run command |
Escape |
Stop running command |
F11 |
Toggle fullscreen |
Ctrl+L |
Clear logs |
CuaOS/
│
├── gui_mission_control_local.py # Local-first Mission Control with hierarchical planner (recommended)
├── gui_mission_control.py # Standard Mission Control UI
├── gui_mission_control_advance.py # Advanced Mission Control UI
├── gui_main.py # Classic UI
├── main.py # Terminal-only agent loop
├── setup.py # Package setup
├── requirements.txt # Python dependencies
├── README.md
├── SECURITY.md
│
├── src/ # Source modules
│ ├── __init__.py
│ ├── config.py # All configuration parameters
│ ├── sandbox.py # Docker container REST API wrapper
│ ├── llm_client.py # Qwen3-VL model loading & inference (with VRAM diagnostics)
│ ├── planner.py # Plan data models, ABC, system prompt
│ ├── planner_local.py # Local GGUF planner with auto GPU & text fallback parser
│ ├── verifier.py # Plan step verification using vision model
│ ├── agent_loop.py # Hierarchical agent loop (plan → execute → verify)
│ ├── agent_runner_v2.py # V2 agent runner
│ ├── vision.py # Screenshot capture, resize, preview
│ ├── actions.py # Action execution (click, type, scroll)
│ ├── guards.py # Safety checks (repeat guard, validation)
│ ├── translation.py # Translation helper
│ ├── design_system.py # UI design tokens & stylesheet
│ └── panels.py # UI panel widgets
│
├── tests/ # Unit tests
│ ├── test_planner.py
│ ├── test_verifier.py
│ └── test_agent_loop.py
│
├── assets/ # Demo videos & media
│
└── img/ # Runtime screenshots (auto-generated)
└── (click previews, screen captures)
All parameters are in src/config.py:
| Parameter | Default | Description |
|---|---|---|
SANDBOX_IMAGE |
trycua/cua-xfce:latest |
Docker image for the VM |
API_PORT |
8001 |
Container API port (host side) |
VNC_RESOLUTION |
1920x1080 |
VM screen resolution |
N_GPU_LAYERS |
-1 (all) |
Executor model GPU layers (-1 = all) |
N_CTX |
2048 |
Model context length |
MAX_STEPS |
20 |
Maximum steps per command |
GGUF_REPO_ID |
mradermacher/Qwen3-VL-8B... |
HuggingFace model repository |
PLANNER_GGUF_LOCAL_PATH |
"" |
Direct path to a local .gguf planner model |
PLANNER_N_GPU_LAYERS |
-1 (auto) |
Planner GPU layers (-1 = auto-detect based on VRAM) |
PLANNER_PROVIDER |
local |
Planner backend (local GGUF or api) |
| Problem | Solution |
|---|---|
ModuleNotFoundError: PyQt6 |
conda activate cua && pip install PyQt6 |
Docker permission denied |
sudo usermod -aG docker $USER + re-login |
Sandbox API timeout |
Container startup takes 60–120s, wait for it |
CUDA out of memory |
Reduce N_GPU_LAYERS in src/config.py |
llama-cpp CUDA error |
Ensure you installed the wheel matching your CUDA version |
| Slow model download | The first run downloads a ~5 GB GGUF model, be patient |
Status Legend: ✅ Done · 🔄 In Progress · ⬜ Not Started
| # | Feature | Description | Status |
|---|---|---|---|
| 1 | Project Restructuring | Reorganize files into src/, assets/, img/ directories; update all import paths |
✅ |
| 2 | Mission Control UI | Professional 5-panel PyQt6 interface with live VM view, command panel, inspector, and logs | ✅ |
| 3 | README & Documentation | Comprehensive README with installation guide, configuration reference, and troubleshooting | ✅ |
| 4 | Hierarchical Planner | Local GGUF planner decomposes complex objectives into atomic steps with verification | ✅ |
| 5 | Local Model File Browser | GUI file picker for selecting local .gguf models with HuggingFace fallback |
✅ |
| 6 | Auto GPU Layer Detection | Automatically detect VRAM and calculate optimal GPU offloading for planner model | ✅ |
| 7 | Plan Verification | Each plan step is verified against success criteria using the vision model | ✅ |
| 8 | Multi-Model Support | Allow switching between different VLMs (Qwen3-VL, LLaVA, InternVL) via config or UI dropdown | ⬜ |
| 9 | Conversation Memory | Persistent chat history so the agent remembers context across multiple commands in a session | ⬜ |
| 10 | Action Undo / Rollback | Snapshot VM state before each action and allow rollback on failure | ⬜ |
| 11 | Multi-Monitor / Multi-VM | Support controlling multiple Docker containers simultaneously from a single UI | ⬜ |
| 12 | Voice Command Input | Accept voice commands via Whisper (local STT) instead of typing | ⬜ |
| 13 | Windows & macOS Support | Cross-platform compatibility with native installers and platform-specific sandboxes | ⬜ |
MIT


