In many enterprise systemsβespecially in test automation, system operations, and industrial or infrastructure-related domains βa large amount of critical logic still exists in the form of legacy scripts or DSL-style code (e.g., .inc files).
These codebases typically have the following characteristics:
- Written in non-mainstream or legacy scripting languages
- Inconsistent naming conventions and insufficient documentation
- Spread across many projects and directories
- Difficult for new engineers or cross-team developers to understand
Traditional code search tools (file name or keyword-based) are insufficient to answer questions such as:
- βWhat does this function actually do?β
- βWhen should this function be used?β
- βWhich part of the system handles this behavior?β
This project was created to address exactly this problem.
The goal of this project is to build a function-level knowledge question-answering system for legacy codebases , with a focus on:
- Automatically understanding legacy code
- Extracting complete function or logic blocks from
.incand similar scripts - Using local open-source LLMs (e.g., Gemma, LLaMA) to generate semantic explanations and documentation
- Extracting complete function or logic blocks from
- Transforming code into searchable knowledge assets
- Structuring βfunction code + LLM-generated explanation + metadataβ into JSON
- Preparing high-quality input for vector search and RAG pipelines
- Building a local RAG-based chatbot
- Storing embeddings in a local vector database (Chroma)
- Retrieving the most relevant functions based on user questions
- Using an LLM to generate clear, contextual answers
**The system does not aim to generate new code directly.
Instead, it focuses on understanding, explaining, and making existing code queryable.**
Key principles include:
- Function-level granularity
- Explainability over raw generation
- Low-risk evolution (original code remains untouched)
- Fully local execution (no cloud dependency)
Legacy .inc / Script Code β Function / Code Block Extraction β LLM-Based Semantic Explanation Generation β Annotated Files + Structured JSON Output β Vector Storage (Chroma) β RAG Retrieval + Local LLM Answer Generation β Interactive Chatbot Q&A
- Scans
.incfiles and similar scripts - Identifies function or logic block boundaries using rules and regex
- Excludes control-flow constructs (
if,switch, etc.) to avoid incorrect segmentation - Ensures each extracted block is a self-contained, interpretable unit
π Related directory:
src/backend/function_extractor/
- Supports local open-source models:
- Gemma
- LLaMA 3 (8B Instruct)
- Uses carefully designed prompt templates to generate:
- Semantic descriptions
- Input / output explanations
- Typical usage scenarios
- Generated explanations are:
- Inserted into annotated versions of the source files
- Exported as structured JSON for downstream processing
π Related directories:
src/backend/function_extractor/models/ src/backend/function_extractor/prompt/template/
For each extracted function, the system records:
- Project name
- File path
- Function name and parameters
- Original code block
- LLM-generated semantic explanation
These JSON artifacts form the core knowledge base for the RAG system.
- Uses Chroma as a local vector database
- Embeds function explanations and code semantics
- User query β vector similarity search β relevant functions retrieved
- Retrieved context is passed to an LLM to generate a clear and concise answer
π Related directory:
src/backend/rag/
- βWhich function handles Windows time synchronization?β
- βWhere is the logic related to pump startup implemented?β
- βWhat does this function in an
.incfile actually do?β - βIs there an existing function that implements a similar test flow?β
RAGFunctionMentorChatbot/ βββ src/ β βββ backend/ β β βββ function_extractor/# Function extraction & annotation β β βββ rag/# RAG retrieval & QA β β βββ data_io/# File read/write utilities β βββ frontend/# Chatbot / UI (extensible) β βββ misc/ βββ data/# Raw code and intermediate artifacts βββ resources/ βββ tests/ βββ requirements.txt βββ run.sh βββ README.md
pip install -r requirements.txt
Or manually:
pip install langchain chromadb fastembed streamlit streamlit-chat
Refer to:
Download and run Gemma or LLaMA models locally.
streamlit run frontend/streamlit_app.py
Then open:
http://localhost:8501
- β Original code remains unchanged (annotated versions are separate)
- β Fully local execution; no data leaves the machine
- β Designed for real-world legacy systems, not idealized greenfield projects
- β Lays the foundation for refactoring, governance, and onboarding
- β Function-level code extraction implemented
- β LLM-based annotation and explanation validated
- β Structured JSON knowledge assets generated
- π§ͺ RAG retrieval and chatbot prototype in progress
A local RAG system that transforms legacy script-based code into a searchable, function-level knowledge base, enabling engineers to understand and query complex systems through natural language.