🎥 VidSense AI — Chat With Any YouTube Video

VidSense AI is an end-to-end Retrieval-Augmented Generation (RAG) system that allows users to ask natural language questions about any YouTube video and receive context-aware answers in real time, directly inside the browser via an extension.

It combines YouTube transcript ingestion, semantic search, conversational memory, web fallback, and LLM evaluation, all built with production-grade engineering practices.

Demo Video Available -Click Here

🚀 Key Features

🔹 Transcript-First RAG

Automatically fetches YouTube transcripts
Chunks and embeds content
Retrieves only the most relevant segments per query

🔹 Intelligent Web Fallback

If the transcript lacks sufficient information, the system automatically augments answers using web search
Ensures higher answer completeness for opinion-based or contextual questions

🔹 Conversational Memory

Maintains session-level memory across multiple questions
Follow-up questions are answered with awareness of prior context

🔹 Chrome Extension UI

Ask questions while watching YouTube
Clean, scrollable chat interface
Persistent session identity per user

🔹 LLM-Agnostic Design

Fully powered by open-source Hugging Face models
No vendor lock-in (Gemini / OpenAI optional)
Easy to swap models at any stage

🔹 Evaluation & Observability

Uses RAGAS for automatic evaluation:
- Context relevance
- Faithfulness
- Answer relevancy
Logs query source (transcript vs web) for monitoring

🧱 Architecture Diagram

┌─────────────────────────────┐
│        Chrome Extension     │
│  (YouTube Chat Interface)   │
└───────────────┬─────────────┘
                │
                ▼
┌─────────────────────────────┐
│        FastAPI Backend      │
│                             │
│  /ingest/youtube/{video_id} │
│  /ask                       │
└───────────────┬─────────────┘
                │
                ▼
┌─────────────────────────────┐
│   YouTube Transcript Loader │
│   + Text Chunking           │
└───────────────┬─────────────┘
                │
                ▼
┌─────────────────────────────┐
│   Embeddings (Hugging Face) │
│   + FAISS Vector Store      │
└───────────────┬─────────────┘
                │
                ▼
┌─────────────────────────────┐
│     Retrieval + Rewriting   │
│   (Memory-Aware Queries)    │
└───────────────┬─────────────┘
                │
                ▼
┌─────────────────────────────┐
│     Answer Generation LLM   │
│   (Transcript-First RAG)    │
└───────────────┬─────────────┘
                │
        ┌───────┴────────┐
        │                │
        ▼                ▼
┌───────────────┐  ┌────────────────┐
│ Transcript    │  │ Web Search     │
│ Answer        │  │ (Fallback)     │
└───────────────┘  └────────────────┘

🔍 Detailed RAG Flow

User Question
     │
     ▼
Session Memory (previous turns)
     │
     ▼
Query Rewriter (LLM)
     │
     ▼
FAISS Similarity Search
     │
     ▼
Top-K Transcript Chunks
     │
     ▼
Transcript Relevance Check
     │
     ├── Relevant ──► Answer Generator (LLM)
     │
     └── Insufficient ──► Web Search Tool
                              │
                              ▼
                    Web-Augmented Context
                              │
                              ▼
                    Answer Generator (LLM)

🧩 Tech Stack

Backend

FastAPI — API layer
LangChain — RAG orchestration
FAISS — Vector similarity search
YouTube Transcript API — Transcript ingestion

LLMs & Embeddings

Mistral-7B-Instruct (Hugging Face) — Generation & rewriting
BGE / MixedBread embeddings — Semantic retrieval
Fully open-source, provider-agnostic

Frontend

Chrome Extension (Manifest v3)
Vanilla JS + CSS
Real-time chat UI

Evaluation & Ops

RAGAS — LLM evaluation
Custom logging for observability
Session-level analytics

📊 Evaluation Metrics (Sample)

Context Relevance : 1.00
Answer Relevancy  : 0.75
Faithfulness      : High (Transcript-grounded)

These metrics ensure the system answers are:

Grounded in retrieved context
Relevant to the question
Not hallucinated

🛠 How It Works

User opens a YouTube video
Chrome extension extracts video ID
Backend ingests transcript (once)
User asks a question
Query is rewritten (memory-aware)
Relevant transcript chunks are retrieved
Answer is generated from transcript
If insufficient → web augmentation is applied
Response is returned with source attribution

🧪 Example Queries

“How did Beijing reduce air pollution?” → Answered from transcript
“Is living in Delhi worth it considering pollution?” → Transcript + Web augmentation
“What about electric vehicles?” → Uses memory from prior questions

🧠 Design Decisions (Why This Matters)

Transcript-first avoids hallucination
Web fallback improves robustness
Memory-aware queries enable natural conversations
Open-source models prevent quota failures
Evaluation built-in from day one

📌 Future Improvements

Token streaming
Source highlighting per answer
Dockerized deployment
Multi-video knowledge graphs

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
backend		backend
extension		extension
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎥 VidSense AI — Chat With Any YouTube Video

🚀 Key Features

🔹 Transcript-First RAG

🔹 Intelligent Web Fallback

🔹 Conversational Memory

🔹 Chrome Extension UI

🔹 LLM-Agnostic Design

🔹 Evaluation & Observability

🧱 Architecture Diagram

🔍 Detailed RAG Flow

🧩 Tech Stack

Backend

LLMs & Embeddings

Frontend

Evaluation & Ops

📊 Evaluation Metrics (Sample)

🛠 How It Works

🧪 Example Queries

🧠 Design Decisions (Why This Matters)

📌 Future Improvements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎥 VidSense AI — Chat With Any YouTube Video

🚀 Key Features

🔹 Transcript-First RAG

🔹 Intelligent Web Fallback

🔹 Conversational Memory

🔹 Chrome Extension UI

🔹 LLM-Agnostic Design

🔹 Evaluation & Observability

🧱 Architecture Diagram

🔍 Detailed RAG Flow

🧩 Tech Stack

Backend

LLMs & Embeddings

Frontend

Evaluation & Ops

📊 Evaluation Metrics (Sample)

🛠 How It Works

🧪 Example Queries

🧠 Design Decisions (Why This Matters)

📌 Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages