GeM-Saathi

Inspiration

India's Government e-Marketplace (GeM) is a massive digital procurement portal, yet a vast majority of the 63 million rural Indian MSMEs are locked out of these economic opportunities. The core barrier? Procurement tenders are published as dense, complex English PDFs filled with bureaucratic and legal jargon. A small street light manufacturer in Uttar Pradesh cannot naturally query or interpret these documents using broken english keywords. We were inspired to bridge this language and technical gap by building GeM-Saathi—a multilingual AI assistant that allows MSMEs to describe their business via text or voice in their native regional language, and instantly matches them with highly relevant government contracts.

What we learned

We learned the profound difference between basic semantic matching and building a scalable, enterprise-grade data engineering pipeline to support it. Simple cosine similarity $similarity(A, B) = \frac{A \cdot B}{||A|| ||B||}$ isn't enough when bidding requires exact matches for constraints like turnover limits or technical certifications. We also fundamentally learned how to leverage the Databricks Delta Lake Medallion architecture to provide a secure, time-travel-capable audit trail across our RAG (Retrieval-Augmented Generation) inference loops, ensuring that AI decisions remain verifiable.

How we built our project

GeM-Saathi is built upon a Medallion Architecture hosted entirely on the Databricks ecosystem:

Ingestion (Bronze): Over 100 real GeM PDF bids are ingested, chunked, and stored natively in a Delta Lake gem_tenders_delta Bronze table.
Indexing (Silver): Chunks are organically embedded using all-MiniLM-L6-v2 (384-dimensional vectors) and persisted into a DBFS-backed FAISS vector index.
Retrieval & Inference: When an MSME speaks into the app in regional Hindi, the audio is processed and routed through IIT Bombay's IndicTrans2 model to formulate an English query. We retrieve the top 3 matching documents via FAISS.
Generation & Audit (Gold): Sarvam AI's Sarvam-1 LLM constructs a grounded eligibility checklist highlighting precisely why the user matches. The decision is translated back into Hindi out to the Streamlit UI, and the transaction is permanently recorded into our gem_rag_results_delta Gold Audit table. The entire solution is hosted serverlessly via Databricks Apps.

Challenges we faced

Deployment Constraints: Databricks Apps impose strict workspace upload limits (10MB maximums). Our localized vector database binaries initially exceeded these thresholds. We iteratively solved this by designing a dynamic packaging script that splits our vector blobs down into micro-chunks during deployment, seamlessly reconstructing the FAISS database index upon container boot time.
Multilingual Context Loss: Preserving hyper-technical numeric thresholds (like EMD requirement values or ISO specifications) across a Hindi -> English -> AI -> Hindi translation pipeline proved highly unstable. Bounding our context explicitly with FAISS and relying on India-native models like IndicTrans2 and Sarvam-1 dramatically mitigated translation decay.

Built with

Languages: Python
Databricks Ecosystem: Databricks Apps (Serverless Hosting), Databricks Delta Lake (Medallion Architecture & Time-Travel Auditing), Databricks File System (DBFS), PySpark
India-Native AI Models: IndicTrans2 (IIT Bombay) for dual-layer regional language translations, Sarvam-1 (Sarvam AI) for contextual RAG awareness.
Vector & Embedding Tech: all-MiniLM-L6-v2 (Sentence-Transformers) for local dense embedding execution, FAISS CPU for semantic indexing.
Libraries & Frameworks: Streamlit (Frontend interface), pandas, PyPDF2 & pdfplumber (Binary parsing).