VidSense AI is an end-to-end Retrieval-Augmented Generation (RAG) system that allows users to ask natural language questions about any YouTube video and receive context-aware answers in real time, directly inside the browser via an extension.
It combines YouTube transcript ingestion, semantic search, conversational memory, web fallback, and LLM evaluation, all built with production-grade engineering practices.
Demo Video Available -Click Here
- Automatically fetches YouTube transcripts
- Chunks and embeds content
- Retrieves only the most relevant segments per query
- If the transcript lacks sufficient information, the system automatically augments answers using web search
- Ensures higher answer completeness for opinion-based or contextual questions
- Maintains session-level memory across multiple questions
- Follow-up questions are answered with awareness of prior context
- Ask questions while watching YouTube
- Clean, scrollable chat interface
- Persistent session identity per user
- Fully powered by open-source Hugging Face models
- No vendor lock-in (Gemini / OpenAI optional)
- Easy to swap models at any stage
-
Uses RAGAS for automatic evaluation:
- Context relevance
- Faithfulness
- Answer relevancy
-
Logs query source (transcript vs web) for monitoring
┌─────────────────────────────┐
│ Chrome Extension │
│ (YouTube Chat Interface) │
└───────────────┬─────────────┘
│
▼
┌─────────────────────────────┐
│ FastAPI Backend │
│ │
│ /ingest/youtube/{video_id} │
│ /ask │
└───────────────┬─────────────┘
│
▼
┌─────────────────────────────┐
│ YouTube Transcript Loader │
│ + Text Chunking │
└───────────────┬─────────────┘
│
▼
┌─────────────────────────────┐
│ Embeddings (Hugging Face) │
│ + FAISS Vector Store │
└───────────────┬─────────────┘
│
▼
┌─────────────────────────────┐
│ Retrieval + Rewriting │
│ (Memory-Aware Queries) │
└───────────────┬─────────────┘
│
▼
┌─────────────────────────────┐
│ Answer Generation LLM │
│ (Transcript-First RAG) │
└───────────────┬─────────────┘
│
┌───────┴────────┐
│ │
▼ ▼
┌───────────────┐ ┌────────────────┐
│ Transcript │ │ Web Search │
│ Answer │ │ (Fallback) │
└───────────────┘ └────────────────┘
User Question
│
▼
Session Memory (previous turns)
│
▼
Query Rewriter (LLM)
│
▼
FAISS Similarity Search
│
▼
Top-K Transcript Chunks
│
▼
Transcript Relevance Check
│
├── Relevant ──► Answer Generator (LLM)
│
└── Insufficient ──► Web Search Tool
│
▼
Web-Augmented Context
│
▼
Answer Generator (LLM)
- FastAPI — API layer
- LangChain — RAG orchestration
- FAISS — Vector similarity search
- YouTube Transcript API — Transcript ingestion
- Mistral-7B-Instruct (Hugging Face) — Generation & rewriting
- BGE / MixedBread embeddings — Semantic retrieval
- Fully open-source, provider-agnostic
- Chrome Extension (Manifest v3)
- Vanilla JS + CSS
- Real-time chat UI
- RAGAS — LLM evaluation
- Custom logging for observability
- Session-level analytics
Context Relevance : 1.00
Answer Relevancy : 0.75
Faithfulness : High (Transcript-grounded)
These metrics ensure the system answers are:
- Grounded in retrieved context
- Relevant to the question
- Not hallucinated
- User opens a YouTube video
- Chrome extension extracts video ID
- Backend ingests transcript (once)
- User asks a question
- Query is rewritten (memory-aware)
- Relevant transcript chunks are retrieved
- Answer is generated from transcript
- If insufficient → web augmentation is applied
- Response is returned with source attribution
-
“How did Beijing reduce air pollution?” → Answered from transcript
-
“Is living in Delhi worth it considering pollution?” → Transcript + Web augmentation
-
“What about electric vehicles?” → Uses memory from prior questions
- Transcript-first avoids hallucination
- Web fallback improves robustness
- Memory-aware queries enable natural conversations
- Open-source models prevent quota failures
- Evaluation built-in from day one
- Token streaming
- Source highlighting per answer
- Dockerized deployment
- Multi-video knowledge graphs