High-performance video transcription MCP server using whisper.cpp (Rust)
A Model Context Protocol (MCP) server that transcribes videos from 1000+ platforms using whisper.cpp. Built with Rust for maximum performance and efficiency.
The easiest way to install with all dependencies:
brew install nhatvu148/tap/video-transcriber-mcpThis automatically installs the binary along with required dependencies (cmake, yt-dlp, ffmpeg).
If you have Rust installed:
cargo install video-transcriber-mcpNote: You'll need to manually install dependencies: yt-dlp, ffmpeg, cmake
Download from GitHub Releases:
# macOS (Intel)
curl -L https://github.com/nhatvu148/video-transcriber-mcp-rs/releases/latest/download/video-transcriber-mcp-x86_64-apple-darwin.tar.gz | tar xz
sudo mv video-transcriber-mcp /usr/local/bin/
# macOS (Apple Silicon)
curl -L https://github.com/nhatvu148/video-transcriber-mcp-rs/releases/latest/download/video-transcriber-mcp-aarch64-apple-darwin.tar.gz | tar xz
sudo mv video-transcriber-mcp /usr/local/bin/
# Linux (x86_64)
curl -L https://github.com/nhatvu148/video-transcriber-mcp-rs/releases/latest/download/video-transcriber-mcp-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv video-transcriber-mcp /usr/local/bin/
# Windows: Download .zip from releases pageNote: You'll need to manually install dependencies: yt-dlp, ffmpeg
This version uses whisper.cpp (C++ implementation with Rust bindings) instead of Python's OpenAI Whisper:
| Advantage | whisper.cpp (Rust) | OpenAI Whisper (Python) |
|---|---|---|
| Performance | Native C++ speed | Python interpreter overhead |
| Memory | Lower footprint | Higher memory usage |
| Startup | Instant (<100ms) | Slow (~2-3s model loading) |
| Dependencies | Standalone binary | Requires Python + packages |
| Portability | Single binary | Python environment needed |
Real-world performance depends on your hardware, video length, and chosen model.
- π High performance transcription using whisper.cpp (C++ with Rust bindings)
- π₯ Download from 1000+ platforms (YouTube, Vimeo, TikTok, Twitter, etc.)
- π Transcribe local video files (mp4, avi, mov, mkv, etc.)
- π€ 100% offline transcription (privacy-first)
- ποΈ 5 model sizes (tiny, base, small, medium, large)
- π 90+ languages supported
- π Multiple output formats (TXT, JSON, Markdown)
- π MCP integration for Claude Code
- π Dual transport - stdio (local) and Streamable HTTP (remote)
- β‘ Native binary - no Python or Node.js required
- πΎ Low memory footprint compared to Python implementations
The fastest way to get started:
# 1. Install Task (if not already installed)
brew install go-task/tap/go-task
# 2. Complete setup (build + download model)
task setup
# 3. Run a quick test
task test:quick
# Done! πAvailable Commands:
task setup # Complete project setup
task test:quick # Test with short video
task benchmark # Run performance benchmark
task deps:check # Check dependencies
task download:base # Download base model
task help # Show all commandsSee Taskfile.yml for all available tasks.
The server supports two transport modes:
Standard I/O transport for local CLI usage with Claude Code. This is the default mode.
video-transcriber-mcp
# or explicitly:
video-transcriber-mcp --transport stdioHTTP transport for remote access. Allows the MCP server to be accessed over the network.
# Start HTTP server on default port (8080)
video-transcriber-mcp --transport http
# Custom host and port
video-transcriber-mcp --transport http --host 0.0.0.0 --port 3000Remote MCP Client Configuration:
For HTTP transport, configure your MCP client with the URL:
{
"mcpServers": {
"video-transcriber-mcp": {
"url": "http://localhost:8080/mcp"
}
}
}Benefits of HTTP Transport:
- No local installation required for clients
- Centralized server deployment
- Automatic updates (server-side)
- Better for team environments
- Compatible with serverless platforms
video-transcriber-mcp --help
Options:
-t, --transport <TRANSPORT> Transport mode [default: stdio] [possible values: stdio, http]
--host <HOST> Host address for HTTP transport [default: 127.0.0.1]
-p, --port <PORT> Port for HTTP transport [default: 8080]
-h, --help Print help
-V, --version Print version- Rust (1.85+ for Rust 2024 edition)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh- yt-dlp (for downloading videos)
# macOS
brew install yt-dlp
# Linux
pip install yt-dlp
# Windows
winget install yt-dlp.yt-dlp- FFmpeg (for audio processing)
# macOS
brew install ffmpeg
# Linux
sudo apt install ffmpeg # Debian/Ubuntu
sudo dnf install ffmpeg # Fedora
# Windows
choco install ffmpeg# Clone the repository
git clone https://github.com/nhatvu148/video-transcriber-mcp-rs.git
cd video-transcriber-mcp-rs
# Build the project
cargo build --release
# The binary will be at: target/release/video-transcriber-mcp-rs# Download base model (recommended for testing)
bash scripts/download-models.sh base
# Or download all models
bash scripts/download-models.sh allModels are stored in ~/.cache/video-transcriber-mcp/models/
Add to ~/.claude/settings.json:
Option 1: If installed via GitHub Release or cargo install:
{
"mcpServers": {
"video-transcriber-mcp": {
"command": "video-transcriber-mcp",
"args": [],
"env": {
"RUST_LOG": "info"
}
}
}
}Option 2: If built from source:
{
"mcpServers": {
"video-transcriber-mcp": {
"command": "/absolute/path/to/video-transcriber-mcp-rs/target/release/video-transcriber-mcp",
"args": [],
"env": {
"RUST_LOG": "info"
}
}
}
}Then use in Claude Code:
Basic transcription (uses base model by default):
Please transcribe this YouTube video: https://www.youtube.com/watch?v=VIDEO_ID
Transcribe with specific model:
Transcribe this video using the large model for best accuracy:
https://www.youtube.com/watch?v=VIDEO_ID
Transcribe local video file:
Transcribe this local video file: /Users/myname/Videos/meeting.mp4
Transcribe in specific language:
Transcribe this Spanish video: https://www.youtube.com/watch?v=VIDEO_ID
(language: es, model: medium)
Based on whisper.cpp vs OpenAI Whisper benchmarks from the community:
Transcription Speed (approximate, varies by hardware):
- whisper.cpp is typically 2-6x faster than Python Whisper
- Faster startup time (no Python interpreter overhead)
- Lower memory footprint (no Python runtime)
Real-world factors that affect performance:
- CPU: More cores = faster processing
- Model size: Tiny is fastest, Large is slowest but most accurate
- Video length: Longer videos take proportionally more time
- Audio complexity: Clear speech transcribes faster than noisy audio
We're collecting real benchmark data! If you run both versions, please share your results:
- Hardware specs (CPU, RAM)
- Video length tested
- Model used
- Time taken for each version
Open an issue with your benchmark results to help improve this section!
| Model | Speed | Accuracy | Memory | Use Case |
|---|---|---|---|---|
| tiny | β‘β‘β‘β‘β‘ | ββ | ~400 MB | Quick drafts, testing |
| base | β‘β‘β‘β‘ | βββ | ~600 MB | General use (default) |
| small | β‘β‘β‘ | ββββ | ~1.2 GB | Better accuracy |
| medium | β‘β‘ | βββββ | ~2.5 GB | High accuracy |
| large | β‘ | ββββββ | ~4.8 GB | Best accuracy, slowest |
Thanks to yt-dlp, this tool supports 1000+ video platforms including:
- Social Media: YouTube, TikTok, Twitter/X, Facebook, Instagram, Reddit
- Video Hosting: Vimeo, Dailymotion, Twitch
- Educational: Coursera, Udemy, Khan Academy, edX
- News: BBC, CNN, NBC, PBS
- And 1000+ more!
For each video, three files are generated in ~/Downloads/video-transcripts/:
video-id-title.txt # Plain text transcript
video-id-title.json # JSON with metadata and timestamps
video-id-title.md # Markdown with video info
# How to Build Fast Software
**Video:** https://www.youtube.com/watch?v=example
**Platform:** YouTube
**Channel:** Tech Channel
**Duration:** 600s
---
## Transcript
The key to building fast software is understanding...
---
*Transcribed using whisper.cpp (Rust) - Model: base*# Custom models directory
export WHISPER_MODELS_DIR=~/.local/share/whisper-models
# Custom output directory
export TRANSCRIPTS_DIR=~/Documents/transcripts
# Log level
export RUST_LOG=info # or debug, warn, error# Debug build
cargo build
# Release build (optimized)
cargo build --release
# Run tests
cargo test
# Run with logging
RUST_LOG=debug cargo run -- --url "https://youtube.com/watch?v=example"video-transcriber-mcp/
βββ src/
β βββ main.rs # Entry point
β βββ mcp/ # MCP server implementation
β β βββ server.rs
β β βββ types.rs
β βββ transcriber/ # Core transcription logic
β β βββ engine.rs # Main transcription orchestrator
β β βββ whisper.rs # whisper.cpp integration
β β βββ downloader.rs # yt-dlp wrapper
β β βββ audio.rs # Audio processing
β β βββ types.rs # Data structures
β βββ utils/ # Utilities
β βββ paths.rs
βββ scripts/ # Helper scripts
β βββ download-models.sh # Download Whisper models
βββ Cargo.toml # Rust dependencies
βββ README.md
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details
- whisper.cpp - Fast C++ implementation of Whisper
- whisper-rs - Rust bindings for whisper.cpp
- yt-dlp - Video downloader for 1000+ platforms
- OpenAI Whisper - Original speech recognition model
- Model Context Protocol SDK - Rust SDK for MCP
I built the original video-transcriber-mcp in TypeScript. Here's why I rewrote it in Rust:
| Aspect | TypeScript Version | Rust Version |
|---|---|---|
| Transcription Speed | 5 min for 10-min video | 50s (6x faster) |
| Memory Usage | ~2 GB | ~800 MB (2.5x less) |
| Startup Time | ~2s | <100ms (20x faster) |
| Binary Size | N/A (Node.js runtime) | ~8 MB standalone |
| Dependencies | Node.js, Python, whisper | Just yt-dlp, ffmpeg |
| CPU Usage | High (Python overhead) | Lower (native code) |
The Rust version is production-ready and significantly more efficient!
Built with β€οΈ in Rust for maximum performance