Aether-Native

A Linux desktop agent that watches your screen and controls your mouse/keyboard to get stuff done.

I built this because I got tired of writing brittle shell scripts for desktop automation. I wanted something that could actually see the screen, understand what's going on, and figure out the next step without me hardcoding coordinates.

The Pitch

You say: "Open Brave, go to YouTube, search for that Chris Brown song, and play it"

The agent does:

Takes a screenshot
Asks a vision model "what's on screen?"
Asks a planner "what should I click/type next?"
Moves the mouse, clicks, types, waits
Repeats until done

No pre-recorded macros. No hardcoded coordinates. It figures it out each time.

Architecture

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│   Screenshot │────▶│  Vision Model    │────▶│  UI Elements│
│   (ffmpeg)   │     │  (OpenRouter or  │     │  (buttons,  │
│              │     │   local llava)   │     │   inputs)   │
└─────────────┘     └──────────────────┘     └─────────────┘
                                                    │
                                                    ▼
┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│   ydotool   │◀────│  Planner         │◀────│  RALPH Loop │
│   (mouse/   │     │  (local Ollama   │     │  (perceive  │
│   keyboard) │     │   or OpenRouter) │     │   → plan    │
│             │     │                  │     │   → act)    │
└─────────────┘     └──────────────────┘     └─────────────┘

RALPH = the loop that keeps going: Perceive → Reason → Act → Learn

Quickstart

You need Linux (GNOME/Wayland or X11) and a few things installed:

# Python deps
pip install -e ".[dev]"

# For Wayland mouse/keyboard control
sudo ydotoold --socket-path=/tmp/.ydotool_socket &
sudo chmod 666 /tmp/.ydotool_socket

# Optional: local models (free, slow on CPU)
ollama pull llama3.2:1b
ollama pull llava

# Required: OpenRouter API key for vision (fast, ~$0.01 per screenshot)
export OPENROUTER_API_KEY="sk-or-v1-..."

Demos

All demos live in demos/. Run them from the repo root (so imports work):

# Full agent: cloud vision + cloud planning
# Needs OPENROUTER_API_KEY set
python demos/demo_hybrid_agent.py

# Just vision: see what the model detects on screen
python demos/demo_streaming_capture.py

# Keyboard-only: no vision, just shell commands + keystrokes
# Fastest, but less smart
python demos/demo_keyboard_agent.py

# Mouse demo: move the cursor around
python demos/demo_mouse.py

How It Actually Works

Screenshot — ffmpeg grabs the screen silently. On Wayland it falls back to GNOME portal (which makes a sound, working on fixing that).
Vision — Sends the screenshot to a multimodal model. We use OpenRouter's gpt-4o-mini because it's fast (~10s) and cheap. Local llava:7b works too but takes ~3 minutes per image on CPU.
Plan — The vision model returns a list of UI elements with coordinates. A text LLM (either OpenRouter cloud or local Ollama) decides what to do next: click here, type this, wait, etc.
Act — ydotool executes the action on your actual desktop.
Verify — Takes another screenshot to see if anything changed. If not, retries.

What Actually Works

✅ Taking screenshots via ffmpeg (silent on X11, portal fallback on Wayland)
✅ Cloud vision (OpenRouter gpt-4o-mini) — fast, accurate
✅ Cloud planning (OpenRouter gpt-4o-mini) — ~3s per plan
✅ Local planning (Ollama llama3.2:1b) — ~26s on CPU, but free
✅ Mouse/keyboard control via ydotool on Wayland
✅ Window focus enforcement with xdotool before typing
✅ Retry logic when vision API fails
✅ Coordinate scaling from screenshot space to display space

What's Broken / Annoying

⚠️ Credits burn fast. 50 API calls = ~$2-3. Use local Ollama for planning to save money.
⚠️ Wayland screenshots trigger GNOME sounds. ffmpeg x11grab is silent but captures black for native Wayland apps. Portal works but beeps. No perfect solution yet.
⚠️ Focus is tricky. The agent sometimes types into the wrong window if it's not fast enough to switch. xdotool helps but isn't 100%.
⚠️ Coordinates are approximate. Vision models guess element positions. Usually close enough, but sometimes misses buttons.
⚠️ Task completion detection is basic. Checks if audio is playing and if Brave is focused. Could be smarter.

What I Learned

Single-action planning is non-negotiable. I tried letting the planner generate 3 actions at once. Every time, the second or third action was wrong because the screen had already changed after the first action. Now it plans 1 action, executes it, and re-perceives.

Cloud vision is worth paying for. llava:7b on CPU takes ~3 minutes to analyze a screenshot. OpenRouter's gpt-4o-mini takes ~10 seconds. That's the difference between "this works" and "this is unusable."

Focus is harder than I thought. I assumed typing would go where the mouse was. Nope — it goes to whatever window X11 thinks is focused. And on Wayland that's even messier. xdotool helps but isn't perfect.

Portal screenshots are noisy. Every GNOME portal capture triggers a camera-shutter sound and a visual flash. ffmpeg x11grab is silent but captures black for native Wayland apps. No clean solution yet — it's either noisy or incomplete.

Coordinates lie. The vision model says "button at (500, 300)" but that's relative to the 640px screenshot, not your actual 1920x1080 or 3072x1728 display. You have to scale them, and even then it's approximate.

Credits vanish faster than you expect. 50 API calls burned ~$2-3 in one afternoon of testing. That's fine for occasional use, but for development you want local planning (Ollama) and only pay for vision.

What Would Make This Way Better

GPU for local vision. If I had a GPU, llava:7b would drop from ~180s to ~2s per image. Then I wouldn't need OpenRouter at all. That's the biggest bottleneck.

A smaller vision model. moondream:1.8b is supposedly much faster on CPU than llava:7b. Haven't tested it yet but it's on the list.

Caching vision results. The desktop background and taskbar don't change between screenshots. Why re-analyze them every time? Cache static elements and only look at what changed.

Continuous video recording. Instead of capturing a screenshot every 10 seconds, run ffmpeg in the background recording at 0.5 fps. Grab the latest frame instantly when you need it. No per-capture delay, no portal noise.

Learned action sequences. After the agent successfully opens YouTube and searches a few times, it should remember the sequence. Next time, skip the vision + planning and just execute the cached steps.

Better focus control. Instead of alt+Tab guessing, use wmctrl -a <window_id> or xdotool windowfocus <id> to explicitly focus the target window before typing.

Multiple completion signals. Right now I just check "is audio playing?" A better check would be: is Brave focused? Does the URL contain "youtube.com/watch"? Is there a pause button visible? Combine all three.

PipeWire for audio. pactl is a compatibility layer. Using PipeWire directly might give more accurate per-app audio detection.

Limitations

Linux only. No plans for macOS/Windows.
Wayland needs ydotoold running as root (kinda sketchy, but it's what we have).
Electron apps (Discord, Spotify native) are invisible to AT-SPI, so vision is required.
You need an OpenRouter key for decent speed. Local vision is too slow without a GPU.

File Layout

aether/
├── aether/
│   ├── core/           # Loop, brain, memory, verifier
│   ├── perception/     # Screenshot, vision, AT-SPI fallback
│   ├── action/         # ydotool execution
│   ├── brain/          # Ollama and OpenRouter clients
│   └── macro/          # Macro recorder (basic)
├── demos/              # Demo scripts
├── tests/              # Unit + integration tests
└── README.md           # This file

Internal docs, architecture specs, and detailed progress logs are kept locally (not in git).

License

AGPL-3.0

Contributing

This is a personal project but PRs are welcome. Just:

Run pytest before committing
Don't commit API keys (use env vars)
Keep it simple

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
aether		aether
demos		demos
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aether-Native

The Pitch

Architecture

Quickstart

Demos

How It Actually Works

What Actually Works

What's Broken / Annoying

What I Learned

What Would Make This Way Better

Limitations

File Layout

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Aether-Native

The Pitch

Architecture

Quickstart

Demos

How It Actually Works

What Actually Works

What's Broken / Annoying

What I Learned

What Would Make This Way Better

Limitations

File Layout

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages