feat(opencode): add voice input by heimoshuiyu · Pull Request #29663 · anomalyco/opencode

heimoshuiyu · 2026-05-28T04:09:08Z

Issue for this PR

Type of change

Bug fix
New feature
Refactor / code improvement
Documentation

What does this PR do?

THIS PR WAS WRITTEN BY A HUMAN. I know this is a big PR. THIS IS NOT AI SLOP.

This PR adds voice input to OpenCode. Users can record audio in the Web/App and TUI interfaces and have it transcribed directly into the prompt input. The design follows OpenCode's existing frontend/backend separation — all transcription logic lives on the server, and clients (Web/App and TUI) only handle audio recording and call the server via the SDK/API.

Core voice module (packages/opencode/src/voice/)

A new Voice service built on Effect, supporting two transcription backends:
- LALM (Large Audio Language Model, recommended) — Sends audio as a content part to a multimodal LLM (e.g. MiMo-V2.5, Gemini 3.5 Flash, Qwen3 Omni 30B A3B Instruct) for transcription.
- Whisper — Sends audio to a Whisper-compatible API endpoint (e.g. OpenAI /v1/audio/transcriptions or a self-hosted service).
Audio format conversion via ffmpeg — automatically converts non-wav/mp3 formats (webm, ogg, etc.) to mp3 before sending to the transcription service.
The entire transcription pipeline supports cancellation via AbortSignal.

Configuration (packages/opencode/src/config/config.ts)

A new voice config block in opencode.json, with type selecting the transcription backend:

{
  "voice": {
    "type": "lalm",
    "lalm": {
      "model": "opencode/mimo-v2.5-free"
      // "model": "opencode-go/mimo-v2.5"
      // "model": "opencode/gemini-3.5-flash"
    },
    "whisper": {
      "url": "http://127.0.0.1:5000/v1/audio/transcriptions",
      "apiKey": "sk-abc123def456ghi789",
      "model": "whisper-1"
    }
  }
}

LALM only requires specifying a model that supports audio input (auth is handled by existing provider config). Whisper requires an additional API URL and key. See the docs for the full set of config options.

Practical experience: Gemini 3.5 Flash has the best transcription quality but is also the most expensive; MiMo V2.5 is good enough for me — surprisingly capable and very cheap; Qwen3 Omni 30B A3B Instruct is great for self-hosting, runnable via llama.cpp on a personal machine for privacy-preserving deployment.

HTTP API (packages/opencode/src/server/routes/instance/httpapi/)

New POST /voice/transcribe endpoint, accepting base64 audio, optional context (e.g. text the user has already typed in the input), images, and request parameters that can override the project-level config above — consistent with the design of other prompt endpoints in the project.
The backend API follows the project's Effect conventions.
SDK types and client have been regenerated, including the new audio.transcribe endpoint.

Web/App UI (packages/app/src/components/prompt-input/voice.tsx)

VoiceButton component with microphone/stop/loading states, wired into both v1 and v2 layouts.
createVoiceInput hook manages the browser MediaRecorder lifecycle, calls transcription via the SDK, and handles retries. Recording data stays in memory (Blob), released on successful transcription, preserved on failure for retry.
Registers mod+shift+v keyboard shortcut for voice input.
Shows retry/cancel buttons when transcription fails or returns empty text — it's frustrating to record a long audio clip and see an error, so preserving the recording for retry avoids making the user re-record.

TUI (packages/opencode/src/cli/cmd/tui/)

useVoice hook manages terminal audio recording via subprocesses (ffmpeg/arecord/sox/rec — auto-detected by platform). Temporary recording files are stored in os.tmpdir() (e.g. /tmp/opencode-voice-<uuid>.mp3), cleaned up after successful transcription, preserved on failure for retry.
Renders a voice button in the prompt bar with recording/stopped/transcribing states.
Registers <leader>v as the default keybinding.
tui.json supports voice.command and voice.mime for custom recorder configuration.

Documentation (packages/web/src/content/docs/)

Updated config.mdx with the full voice config reference.
Updated tui.mdx with TUI voice recorder configuration docs.

Design considerations

LALM over Whisper: Large language models have orders of magnitude more voice training data than Whisper and stronger contextual awareness, producing noticeably better transcriptions. Hence LALM is the default backend.
LALM via Vercel AI SDK, Whisper via Effect HttpClient: LALM uses the Vercel AI SDK generateText interface, while Whisper bypasses the Vercel AI SDK and builds the multipart form-data request directly through Effect's HttpClient. This is because the Vercel AI SDK's Whisper transcription interface is still experimental — the API is unstable and cumbersome, so constructing the request ourselves is simpler and more reliable.
LALM prompt structure: Composed of a system prompt, user context, and audio content. The overall structure looks roughly like this:

system: <lalm.txt — transcription rules>

messages:
  user:
    <TRANSCRIPTION_CONTEXT>
    directory: /path/to/project
    branch: main
    User: Help me create a file
    Assistant: Sure, what should the filename be?
    Text already in the input box
    </TRANSCRIPTION_CONTEXT>
    <audio starts>
    [audio content]
    <audio ends>
    Transcribe the audio between <audio starts> and <audio ends>. Output ONLY the transcription text.

When building conversation context, special filtering is applied: messages are traversed from back to front, and after a user message there are typically multiple assistant messages (from the agent's multi-turn tool calls, etc.). The code only pairs the first assistant message with actual text content, while skipping synthetic text parts and summary messages. This is because in an agent loop, the last assistant message usually contains the final conclusion and is the most important one.

Minimized thinking level during transcription: Voice transcription requests set temperature to 0 and use smallOptions to reduce the model's thinking level, avoiding unnecessary reasoning overhead. Special handling is applied for MiMo — it enables thinking by default, so it must be explicitly disabled in provider/transform.ts.

How did you verify your code works?

I've been using the voice input feature for over two months, transcribing thousands of requests during that time. I've been continuously rebasing against upstream and iterating on the code. I understand my code. I now feel the quality is high enough to merge back into the main branch.

My primary development environment is Linux. I've also done my best to test on a MacBook and a Windows VM, but there may be gaps — if you find platform compatibility issues, please point them out.

Screenshots / recordings

hi @thdxr now you can voice prompt this :)

demo-1.mp4

demo-2.mp4

demo-3.mp4

another project android IME using this opencode transcribe API https://voice.aquarium39.moe

demo-1.mp4

our company product using opencode as agent core, can easily implement the voice input feature.

demo-5.mp4

Checklist

I have tested my changes locally
I have not included unrelated changes in this PR

If you do not follow this template your PR will be automatically rejected.

github-actions · 2026-05-28T04:10:07Z

The following comment was made by an LLM, it may be inaccurate:

Potential Duplicates Found

Based on my search, here are related PRs that might be addressing similar functionality:

PR feat: add first-party voice transcription with local Whisper #11345 - feat: add first-party voice transcription with local Whisper
- feat: add first-party voice transcription with local Whisper #11345
- Related: This PR adds voice transcription using Whisper. Your current PR also supports Whisper as one of the two transcription backends, alongside LALM.
PR feat: Add voice input using browser speech recognition (web only) #18225 - feat: Add voice input using browser speech recognition (web only)
- feat: Add voice input using browser speech recognition (web only) #18225
- Related: This PR adds voice input for the web interface. Your current PR extends this with voice input across Web, App, and TUI, plus server-side transcription.
PR feat(opencode): add native video and audio file reading support #18005 - feat(opencode): add native video and audio file reading support
- feat(opencode): add native video and audio file reading support #18005
- Related: This PR adds audio file reading support, which may overlap with your audio handling and format conversion logic.

heimoshuiyu added 2 commits May 28, 2026 11:14

add voice input feature

b9bedc0

sdk

2d31932

heimoshuiyu requested a review from adamdotdevin as a code owner May 28, 2026 04:09

github-actions Bot added the contributor label May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(opencode): add voice input#29663

feat(opencode): add voice input#29663
heimoshuiyu wants to merge 2 commits into
anomalyco:devfrom
heimoshuiyu:feat/voice-input

heimoshuiyu commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heimoshuiyu commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue for this PR

Type of change

What does this PR do?

How did you verify your code works?

Screenshots / recordings

Checklist

Uh oh!

github-actions Bot commented May 28, 2026

Potential Duplicates Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heimoshuiyu commented May 28, 2026 •

edited

Loading