Inspiration
India's Government e-Marketplace (GeM) is a massive digital procurement portal, yet a vast majority of the 63 million rural Indian MSMEs are locked out of these economic opportunities. The core barrier? Procurement tenders are published as dense, complex English PDFs filled with bureaucratic and legal jargon. A small street light manufacturer in Uttar Pradesh cannot naturally query or interpret these documents using broken english keywords. We were inspired to bridge this language and technical gap by building GeM-Saathi—a multilingual AI assistant that allows MSMEs to describe their business via text or voice in their native regional language, and instantly matches them with highly relevant government contracts.
What we learned
We learned the profound difference between basic semantic matching and building a scalable, enterprise-grade data engineering pipeline to support it. Simple cosine similarity $similarity(A, B) = \frac{A \cdot B}{||A|| ||B||}$ isn't enough when bidding requires exact matches for constraints like turnover limits or technical certifications. We also fundamentally learned how to leverage the Databricks Delta Lake Medallion architecture to provide a secure, time-travel-capable audit trail across our RAG (Retrieval-Augmented Generation) inference loops, ensuring that AI decisions remain verifiable.
How we built our project
GeM-Saathi is built upon a Medallion Architecture hosted entirely on the Databricks ecosystem:
- Ingestion (Bronze): Over 100 real GeM PDF bids are ingested, chunked, and stored natively in a Delta Lake
gem_tenders_deltaBronze table. - Indexing (Silver): Chunks are organically embedded using
all-MiniLM-L6-v2(384-dimensional vectors) and persisted into a DBFS-backed FAISS vector index. - Retrieval & Inference: When an MSME speaks into the app in regional Hindi, the audio is processed and routed through IIT Bombay's
IndicTrans2model to formulate an English query. We retrieve the top 3 matching documents via FAISS. - Generation & Audit (Gold): Sarvam AI's
Sarvam-1LLM constructs a grounded eligibility checklist highlighting precisely why the user matches. The decision is translated back into Hindi out to the Streamlit UI, and the transaction is permanently recorded into ourgem_rag_results_deltaGold Audit table. The entire solution is hosted serverlessly via Databricks Apps.
Challenges we faced
- Deployment Constraints: Databricks Apps impose strict workspace upload limits (10MB maximums). Our localized vector database binaries initially exceeded these thresholds. We iteratively solved this by designing a dynamic packaging script that splits our vector blobs down into micro-chunks during deployment, seamlessly reconstructing the FAISS database index upon container boot time.
- Multilingual Context Loss: Preserving hyper-technical numeric thresholds (like EMD requirement values or ISO specifications) across a
Hindi -> English -> AI -> Hinditranslation pipeline proved highly unstable. Bounding our context explicitly with FAISS and relying on India-native models like IndicTrans2 and Sarvam-1 dramatically mitigated translation decay.
Built with
- Languages: Python
- Databricks Ecosystem: Databricks Apps (Serverless Hosting), Databricks Delta Lake (Medallion Architecture & Time-Travel Auditing), Databricks File System (DBFS), PySpark
- India-Native AI Models:
IndicTrans2(IIT Bombay) for dual-layer regional language translations,Sarvam-1(Sarvam AI) for contextual RAG awareness. - Vector & Embedding Tech:
all-MiniLM-L6-v2(Sentence-Transformers) for local dense embedding execution, FAISS CPU for semantic indexing. - Libraries & Frameworks: Streamlit (Frontend interface), pandas, PyPDF2 & pdfplumber (Binary parsing).
Built With
- all-minilm-l6-v2
- databricks
- indictrans2
- python
- sarvam-1
- streamlit
Log in or sign up for Devpost to join the conversation.