Skip to content

feat(opencode): add voice input#29663

Open
heimoshuiyu wants to merge 2 commits into
anomalyco:devfrom
heimoshuiyu:feat/voice-input
Open

feat(opencode): add voice input#29663
heimoshuiyu wants to merge 2 commits into
anomalyco:devfrom
heimoshuiyu:feat/voice-input

Conversation

@heimoshuiyu
Copy link
Copy Markdown
Contributor

@heimoshuiyu heimoshuiyu commented May 28, 2026

Issue for this PR

Closes #18226
Closes #4695

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

THIS PR WAS WRITTEN BY A HUMAN. I know this is a big PR. THIS IS NOT AI SLOP.

This PR adds voice input to OpenCode. Users can record audio in the Web/App and TUI interfaces and have it transcribed directly into the prompt input. The design follows OpenCode's existing frontend/backend separation — all transcription logic lives on the server, and clients (Web/App and TUI) only handle audio recording and call the server via the SDK/API.

Core voice module (packages/opencode/src/voice/)

  • A new Voice service built on Effect, supporting two transcription backends:
    • LALM (Large Audio Language Model, recommended) — Sends audio as a content part to a multimodal LLM (e.g. MiMo-V2.5, Gemini 3.5 Flash, Qwen3 Omni 30B A3B Instruct) for transcription.
    • Whisper — Sends audio to a Whisper-compatible API endpoint (e.g. OpenAI /v1/audio/transcriptions or a self-hosted service).
  • Audio format conversion via ffmpeg — automatically converts non-wav/mp3 formats (webm, ogg, etc.) to mp3 before sending to the transcription service.
  • The entire transcription pipeline supports cancellation via AbortSignal.

Configuration (packages/opencode/src/config/config.ts)

A new voice config block in opencode.json, with type selecting the transcription backend:

{
  "voice": {
    "type": "lalm",
    "lalm": {
      "model": "opencode/mimo-v2.5-free"
      // "model": "opencode-go/mimo-v2.5"
      // "model": "opencode/gemini-3.5-flash"
    },
    "whisper": {
      "url": "http://127.0.0.1:5000/v1/audio/transcriptions",
      "apiKey": "sk-abc123def456ghi789",
      "model": "whisper-1"
    }
  }
}

LALM only requires specifying a model that supports audio input (auth is handled by existing provider config). Whisper requires an additional API URL and key. See the docs for the full set of config options.

Practical experience: Gemini 3.5 Flash has the best transcription quality but is also the most expensive; MiMo V2.5 is good enough for me — surprisingly capable and very cheap; Qwen3 Omni 30B A3B Instruct is great for self-hosting, runnable via llama.cpp on a personal machine for privacy-preserving deployment.

HTTP API (packages/opencode/src/server/routes/instance/httpapi/)

  • New POST /voice/transcribe endpoint, accepting base64 audio, optional context (e.g. text the user has already typed in the input), images, and request parameters that can override the project-level config above — consistent with the design of other prompt endpoints in the project.
  • The backend API follows the project's Effect conventions.
  • SDK types and client have been regenerated, including the new audio.transcribe endpoint.

Web/App UI (packages/app/src/components/prompt-input/voice.tsx)

  • VoiceButton component with microphone/stop/loading states, wired into both v1 and v2 layouts.
  • createVoiceInput hook manages the browser MediaRecorder lifecycle, calls transcription via the SDK, and handles retries. Recording data stays in memory (Blob), released on successful transcription, preserved on failure for retry.
  • Registers mod+shift+v keyboard shortcut for voice input.
  • Shows retry/cancel buttons when transcription fails or returns empty text — it's frustrating to record a long audio clip and see an error, so preserving the recording for retry avoids making the user re-record.

TUI (packages/opencode/src/cli/cmd/tui/)

  • useVoice hook manages terminal audio recording via subprocesses (ffmpeg/arecord/sox/rec — auto-detected by platform). Temporary recording files are stored in os.tmpdir() (e.g. /tmp/opencode-voice-<uuid>.mp3), cleaned up after successful transcription, preserved on failure for retry.
  • Renders a voice button in the prompt bar with recording/stopped/transcribing states.
  • Registers <leader>v as the default keybinding.
  • tui.json supports voice.command and voice.mime for custom recorder configuration.

Documentation (packages/web/src/content/docs/)

  • Updated config.mdx with the full voice config reference.
  • Updated tui.mdx with TUI voice recorder configuration docs.

Design considerations

  • LALM over Whisper: Large language models have orders of magnitude more voice training data than Whisper and stronger contextual awareness, producing noticeably better transcriptions. Hence LALM is the default backend.
  • LALM via Vercel AI SDK, Whisper via Effect HttpClient: LALM uses the Vercel AI SDK generateText interface, while Whisper bypasses the Vercel AI SDK and builds the multipart form-data request directly through Effect's HttpClient. This is because the Vercel AI SDK's Whisper transcription interface is still experimental — the API is unstable and cumbersome, so constructing the request ourselves is simpler and more reliable.
  • LALM prompt structure: Composed of a system prompt, user context, and audio content. The overall structure looks roughly like this:
system: <lalm.txt — transcription rules>

messages:
  user:
    <TRANSCRIPTION_CONTEXT>
    directory: /path/to/project
    branch: main
    User: Help me create a file
    Assistant: Sure, what should the filename be?
    Text already in the input box
    </TRANSCRIPTION_CONTEXT>
    <audio starts>
    [audio content]
    <audio ends>
    Transcribe the audio between <audio starts> and <audio ends>. Output ONLY the transcription text.

When building conversation context, special filtering is applied: messages are traversed from back to front, and after a user message there are typically multiple assistant messages (from the agent's multi-turn tool calls, etc.). The code only pairs the first assistant message with actual text content, while skipping synthetic text parts and summary messages. This is because in an agent loop, the last assistant message usually contains the final conclusion and is the most important one.

  • Minimized thinking level during transcription: Voice transcription requests set temperature to 0 and use smallOptions to reduce the model's thinking level, avoiding unnecessary reasoning overhead. Special handling is applied for MiMo — it enables thinking by default, so it must be explicitly disabled in provider/transform.ts.

How did you verify your code works?

I've been using the voice input feature for over two months, transcribing thousands of requests during that time. I've been continuously rebasing against upstream and iterating on the code. I understand my code. I now feel the quality is high enough to merge back into the main branch.

My primary development environment is Linux. I've also done my best to test on a MacBook and a Windows VM, but there may be gaps — if you find platform compatibility issues, please point them out.

Screenshots / recordings

hi @thdxr now you can voice prompt this :)

Screenshot_20260528_175817
demo-1.mp4
demo-2.mp4
demo-3.mp4

another project android IME using this opencode transcribe API https://voice.aquarium39.moe

demo-1.mp4

our company product using opencode as agent core, can easily implement the voice input feature.

demo-5.mp4

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

If you do not follow this template your PR will be automatically rejected.

@github-actions
Copy link
Copy Markdown
Contributor

The following comment was made by an LLM, it may be inaccurate:

Potential Duplicates Found

Based on my search, here are related PRs that might be addressing similar functionality:

  1. PR feat: add first-party voice transcription with local Whisper #11345 - feat: add first-party voice transcription with local Whisper

  2. PR feat: Add voice input using browser speech recognition (web only) #18225 - feat: Add voice input using browser speech recognition (web only)

  3. PR feat(opencode): add native video and audio file reading support #18005 - feat(opencode): add native video and audio file reading support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]:: Add voice input using browser speech recognition (web only) [FEATURE]: Speech-to-Text Voice Input for Lazy People in OpenCode

1 participant