On-device agent-native multimodal generation with memory and skills, ported from GEMS to run fully on Android.
GEMS uses an agent loop to iteratively improve text-to-image generation:
- Decompose — breaks your prompt into verifiable requirements ("Is there a book?", "Is the lighting golden?")
- Generate — creates an image with Stable Diffusion Turbo (Vulkan GPU)
- Verify — checks each requirement against the generated image (Gemma 4)
- Refine — rewrites the prompt to fix failures
- Repeat — generates an improved image with the refined prompt
The app shows a side-by-side comparison of direct generation vs the GEMS agent loop.
| Direct Generation | GEMS Output |
|---|---|
![]() |
![]() |
The GEMS agent triggered the landscape skill, which enhanced the prompt with detailed instructions about atmospheric depth, natural lighting, and composition. The GEMS-enhanced output shows a mountain scene with a lake reflection, wildflowers, and dramatic sky — significantly more detailed than the direct generation.
| Direct Generation | GEMS Output |
|---|---|
![]() |
![]() |
The GEMS agent triggered the anime (Makoto Shinkai) skill, which enhanced the prompt with Shinkai's signature cinematic details — volumetric god rays, dramatic cumulonimbus clouds transitioning from orange to magenta, lens flare from the setting sun, hyper-detailed station architecture, and wet platform reflections. The GEMS-enhanced output captures the breathtaking photorealistic-meets-anime look of films like Your Name.
| Component | Implementation |
|---|---|
| LLM | Gemma 4 E2B via LiteRT-LM (GPU, ~1-2s/call) |
| Image Gen | SD Turbo via stable-diffusion.cpp + Vulkan GPU (~15-30s) |
| Agent Loop | Kotlin port of GEMS.py (Decompose → Generate → Verify → Refine) |
| UI | Jetpack Compose + Material 3 |
| DI | Hilt |
| DB | Room (agent memory persistence) |
- Hardware: Android device with Vulkan GPU support (tested on Pixel 9 / Tensor G4 / Android 16)
- Storage: ~8GB free on device for model files
The fastest way to try Android GEMS is to download the prebuilt APK from the GitHub Releases page and install it directly on your device. No Android Studio or build setup required.
- Download
app-debug.apkfrom the latest release - Transfer it to your device (or download directly on the phone)
- Open the APK and allow installation from unknown sources if prompted
- Launch GEMS Android
On first launch, tap Download Models on the home screen. The app will download all four models (~7.7 GB total) directly from Hugging Face and store them in app storage:
- SD Turbo (Image Generator) — 1.9 GB
- TAESD (Fast Decoder) — 9 MB
- Gemma 4 E2B (LLM — faster) — 2.4 GB
- Gemma 4 E4B (LLM — smarter) — 3.4 GB
Downloads use Android's DownloadManager and continue in the background even if you leave the app.
- Prompt field — type any text-to-image prompt
- Image gen steps — 1 (fast, ~15s) / 2 (balanced, ~30s) / 4 (quality, ~60s)
- LLM model — E2B (fast) or E4B (smart)
- GEMS iterations slider — how many refine-and-regenerate cycles (1–5)
- Run Android GEMS — runs direct generation and the GEMS agent loop side by side
- Gemma 4 Demo — test the LLM with streaming text and optional image input
- Direct Image Gen Demo — test image generation standalone
- Direct Generation — baseline image from your prompt
- GEMS Rounds — each round's image side by side with verification scores
- GEMS metadata — refined prompt, final score, total iterations, skill used
- Status updates — live progress of each agent loop step
Only needed if you want to build from source. To just try the app, use the prebuilt APK above.
macOS:
brew install openjdk@17Linux (Ubuntu/Debian):
sudo apt install openjdk-17-jdkVerify:
java -version # Should show 17+- Download from https://developer.android.com/studio
- Install and open Android Studio
- Complete the setup wizard — it will install:
- Android SDK (API 35)
- Android SDK Build-Tools
- Android SDK Platform-Tools (includes
adb)
After installation, note your SDK path:
- macOS:
~/Library/Android/sdk - Linux:
~/Android/Sdk
Open Android Studio → Settings → Languages & Frameworks → Android SDK → SDK Tools tab:
- Check NDK (Side by side) → Install version 25.1.8937393 or later
- Check CMake → Install
Or via command line:
$ANDROID_HOME/cmdline-tools/latest/bin/sdkmanager "ndk;25.1.8937393" "cmake;3.22.1"Add to your ~/.zshrc or ~/.bashrc:
export ANDROID_HOME=~/Library/Android/sdk # macOS
# export ANDROID_HOME=~/Android/Sdk # Linux
export JAVA_HOME="/Applications/Android Studio.app/Contents/jbr/Contents/Home" # macOS
# export JAVA_HOME=/usr/lib/jvm/java-17-openjdk # Linux
export PATH=$ANDROID_HOME/platform-tools:$PATHReload:
source ~/.zshrcVerify:
adb --version # Should work
java -version # Should show 17+Python 3.10+ is needed for downloading models:
pip3 install huggingface_hubmacOS:
brew install cmake gitLinux:
sudo apt install cmake git build-essentialgit clone <repo-url>
cd android_gemsThe app needs three model files (~4.3GB total). You can download them directly on your phone (recommended) or via command line.
After installing and launching the app (steps 11-13), tap "Download Models" on the home screen. The built-in model manager downloads all three models directly to the device:
| Model | Size | Source |
|---|---|---|
| SD Turbo (Image Generator) | 1.9GB | Green-Sky/SD-Turbo-GGUF |
| TAESD (Fast Decoder) | 9MB | madebyollin/taesd |
| Gemma 4 E2B (LLM) | 2.4GB | litert-community/gemma-4-E2B-it-litert-lm |
All models are publicly available — no authentication required. Downloads continue in the background if you navigate away.
Skip to Step 9 if using this option.
cd models/
./download_models.sh # downloads to models/ directory on your computerThen connect your device and push:
./push_to_device.sh # pushes all model files to the device- Enable Developer Options on your phone (Settings → About Phone → tap Build Number 7 times)
- Enable USB Debugging (Settings → Developer Options → USB Debugging)
- Connect via USB cable
- Accept the debugging prompt on your phone
Verify:
adb devices
# Should show your devicecd models/
./push_to_device.sh
cd ..This pushes all model files to /data/local/tmp/ on the device.
The image generator uses stable-diffusion.cpp with Vulkan GPU acceleration. Build it as a shared library for Android:
# Set NDK path
export NDK=$ANDROID_HOME/ndk/25.1.8937393
# Clone stable-diffusion.cpp (if not already in libs/)
git clone --recursive https://github.com/leejet/stable-diffusion.cpp.git libs/stable-diffusion.cpp
# Update Vulkan headers for C++ support
git clone --depth 1 https://github.com/KhronosGroup/Vulkan-Headers.git /tmp/Vulkan-Headers
cp /tmp/Vulkan-Headers/include/vulkan/*.hpp \
$NDK/toolchains/llvm/prebuilt/*/sysroot/usr/include/vulkan/
cp /tmp/Vulkan-Headers/include/vulkan/*.h \
$NDK/toolchains/llvm/prebuilt/*/sysroot/usr/include/vulkan/
mkdir -p $NDK/toolchains/llvm/prebuilt/*/sysroot/usr/include/vk_video
cp /tmp/Vulkan-Headers/include/vk_video/*.h \
$NDK/toolchains/llvm/prebuilt/*/sysroot/usr/include/vk_video/
# Configure and build
cd libs/stable-diffusion.cpp
mkdir -p build-android && cd build-android
cmake .. \
-DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-33 \
-DSD_VULKAN=ON -DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DVulkan_GLSLC_EXECUTABLE=$NDK/shader-tools/*/glslc
cmake --build . -j8 --target stable-diffusion
# Build JNI shared library
CLANG=$NDK/toolchains/llvm/prebuilt/*/bin/aarch64-linux-android33-clang++
OMP_STATIC=$NDK/toolchains/llvm/prebuilt/*/lib64/clang/*/lib/linux/aarch64/libomp.a
$CLANG -shared -fPIC -o libsdcpp.so ../jni_bridge.cpp -I.. \
-Wl,--whole-archive \
libstable-diffusion.a ggml/src/libggml.a ggml/src/libggml-base.a \
ggml/src/libggml-cpu.a ggml/src/ggml-vulkan/libggml-vulkan.a \
thirdparty/libwebp/libwebp.a thirdparty/libwebp/libsharpyuv.a \
thirdparty/libwebp/libwebpmux.a $OMP_STATIC \
-Wl,--no-whole-archive \
-lvulkan -llog -landroid -lm -lz -ldl -static-libstdc++
# Strip and copy to app
$NDK/toolchains/llvm/prebuilt/*/bin/llvm-strip libsdcpp.so
mkdir -p ../../../app/src/main/jniLibs/arm64-v8a
cp libsdcpp.so ../../../app/src/main/jniLibs/arm64-v8a/
cd ../../.../gradlew assembleDebug
adb install -t app/build/outputs/apk/debug/app-debug.apkadb shell am start -n com.gems.android/.ui.MainActivityOr just tap the Android GEMS icon on your phone.
LiteRtLmEngine (Gemma 4, GPU) ──────────┐
▼
SdCppEngine (SD Turbo, Vulkan GPU) ─► AgentOrchestrator ──► ComparisonScreen
▼
SkillManager (assets/skills/) ──────► AgentMemory (Room DB)
GPU memory is managed by closing one engine before loading the other — the LLM and image generator take turns using the GPU.
After setup, these files should be at /data/local/tmp/ on the device:
| File | Size | Format | Purpose |
|---|---|---|---|
gemma-4-E2B-it.litertlm |
2.4GB | LiteRT-LM | Gemma 4 E2B multimodal LLM (text + vision) |
sd_turbo.gguf |
1.9GB | GGUF Q8 | SD Turbo image generator (1-4 step distilled, 8-bit quantized) |
taesd.safetensors |
9MB | SafeTensors | Tiny AutoEncoder decoder (10x faster than full VAE) |
- App crashes on launch: Make sure
libsdcpp.sois inapp/src/main/jniLibs/arm64-v8a/ - "No SD model found": Run
./models/push_to_device.shto push models to device - Image gen OOM: Close other apps. The image generator needs ~3GB GPU memory
- LLM returns empty: After image gen, the GPU state may be corrupted. The app auto-retries on CPU
- Second image gen crashes: The native context is reset between runs to avoid Vulkan state corruption
| Component | Original GEMS (Server) | Android GEMS (On-Device) |
|---|---|---|
| LLM (MLLM) | Kimi-K2.5 (cloud API) | Gemma 4 E2B (2.4GB, on-device GPU) |
| Image Generator | Z-Image-Turbo (cloud API) | SD Turbo Q8 GGUF (1.9GB, Vulkan GPU) |
| VAE Decoder | Full VAE (server) | TAESD tiny decoder (9MB, 10x faster) |
| Skill Routing | LLM-based routing | Skipped on mobile (saves ~4s) |
| Max Iterations | 3 | Configurable 1-5 (default 2) |
| Verification | Multimodal (image + text) | Multimodal (Gemma 4 vision input) |
| Runtime | Python, multiple GPU servers | Kotlin, single mobile device |
Original GEMS:
Prompt: "You are a strategic Skill Router. Your goal is to determine if the user's
request genuinely requires a specialized skill or if it can be handled by standard
generation. Available Skills: {manifest}. User Request: {prompt}.
Respond ONLY with the SKILL_ID or NONE."
If a skill matches, it enhances the prompt using skill-specific instructions.
Android GEMS: Skipped on mobile to save ~4s (2 LLM calls). The original prompt is used directly. Skill routing can be re-enabled for complex prompts.
Original GEMS:
Prompt: "Analyze the user's image generation prompt. Break it down into specific
visual requirements. For each requirement, write a question that can be answered
with a simple 'yes' or 'no'. YOU MUST RESPOND ONLY WITH A JSON ARRAY OF STRINGS.
Example format: ["Is there a cat?", "Is the cat black?", "Is it sitting on a rug?"]"
Android GEMS (identical logic):
System: "You are a requirements agent. Break prompts into yes/no questions. Respond only in JSON."
User: "Analyze the user's image generation prompt. Break it down into specific visual
requirements. For each requirement, write a question answerable with yes or no.
YOU MUST RESPOND ONLY WITH A JSON ARRAY OF STRINGS.
Example: ["Is there a cat?", "Is the cat black?"]
User Prompt: {prompt}"
Original GEMS: Calls Z-Image-Turbo server API → returns image bytes.
Android GEMS: Calls SdCppEngine.generate(prompt) → stable-diffusion.cpp loads SD Turbo on Vulkan GPU → runs 1-4 DDIM steps → TAESD decodes latent → returns 512x512 Bitmap. Takes ~15-30s.
Original GEMS (parallel, multimodal):
Prompt per question: "Image: <image>
Answer the following question with only 'yes' or 'no' based on the provided image: {question}"
Uses ThreadPoolExecutor to verify all questions in parallel with the MLLM.
Android GEMS (sequential, multimodal):
System: "You are a verification agent. Answer only 'yes' or 'no'."
User: "Look at this image. Answer with ONLY 'yes' or 'no': {question}"
[Image: PNG bytes of the generated image]
Runs sequentially (single model instance). Gemma 4 sees the actual image via vision input.
Original GEMS:
Prompt: "Task: Summarize the experience of the current image generation attempt.
--- CURRENT ATTEMPT ---
Prompt used: {current_prompt}
Passed requirements: {passed}
Failed requirements: {failed}
Reasoning/Thought before generation: {current_thought}
Image: <image>
--- PREVIOUS EXPERIENCES ---
{previous_experiences}
--- ANALYSIS ---
Based on the provided image, the verification results, your previous thought process,
and historical experiences, write a concise summary of what worked, what failed, and
what strategy should be adopted in the next attempt. Keep it under 100 words."
Android GEMS (similar, without image in summarizer):
System: "You are a summarization agent. Be concise, under 100 words."
User: "Task: Summarize the experience of the current image generation attempt.
Prompt used: {currentPrompt}
Passed: {passed}
Failed: {failed}
Previous experiences: {prevExpStr}
Write a concise summary under 100 words of what to improve."
Original GEMS:
Prompt: "Task: Refine the image generation prompt based on previous failed attempts
and accumulated experiences.
Original Intent: {original_prompt}
--- ATTEMPT HISTORY ---
{history_log with <image> tags}
--- ANALYSIS ---
Review the history above. Rewrite a new, comprehensive prompt. This prompt must:
1. Explicitly reinforce the requirements that failed in the latest attempt.
2. Maintain and protect the requirements that were successfully met to avoid regressions.
3. Adopt the strategies suggested in the 'Experience' section.
4. Use clear, non-conflicting descriptive language.
Return ONLY the prompt text itself."
Android GEMS (similar):
System: "You are a prompt refinement agent. Rewrite prompts to fix failures."
User: "Refine the image generation prompt based on previous attempts.
Original Intent: {originalPrompt}
--- ATTEMPT HISTORY ---
Attempt {i}: Experience: {exp}, Prompt: {prompt}, Failed: {failed}
Rewrite a comprehensive prompt that:
1. Reinforces failed requirements.
2. Maintains successful requirements.
3. Uses clear, descriptive language.
Return ONLY the prompt text itself."
| Aspect | Original GEMS | Android GEMS |
|---|---|---|
| Verification | Parallel (ThreadPoolExecutor) | Sequential (single model) |
| Vision in Verifier | Yes (MLLM sees image) | Yes (Gemma 4 vision) |
| Vision in Summarizer | Yes (image passed) | No (text-only summary) |
| Vision in Refiner | Yes (history images passed) | No (text-only refinement) |
| Skill Routing | Active | Skipped (saves ~4s) |
| think_with_thought | Separate reasoning channel | Not available (stripped <think> blocks) |
| GPU Memory | Multiple server GPUs | Single mobile GPU, engines take turns |
| Agent Memory | In-memory trajectory | Room DB persistence + WorkManager compression |
- GEMS — original agent-native multimodal generation paper
- stable-diffusion.cpp — C++ SD inference with Vulkan GPU
- LiteRT-LM — on-device LLM runtime
- AI Edge Gallery — reference for LiteRT-LM integration





