This document outlines the steps and best practices for adding support for a new programming language to CodeGraphContext. By following this guide, contributors can efficiently integrate new languages and leverage the Neo4j graph for verification.
CodeGraphContext uses a modular architecture for multi-language support:
- Generic
TreeSitterParser(ingraph_builder.py): This acts as a wrapper, dispatching parsing tasks to language-specific implementations. - Language-Specific Parser Modules (in
src/codegraphcontext/tools/languages/): Each language (e.g., Python, JavaScript) has its own module (e.g.,python.py,javascript.py) containing:- Tree-sitter queries (
<LANG>_QUERIES). - A
<Lang>TreeSitterParserclass that encapsulates language-specific parsing logic. - A
pre_scan_<lang>function for initial symbol mapping.
- Tree-sitter queries (
GraphBuilder(ingraph_builder.py): Manages the overall graph building process, including file discovery, pre-scanning, and dispatching to the correct language parser.
- Create a new file:
src/codegraphcontext/tools/languages/typescript.py. - Add the necessary imports:
from pathlib import Path,from typing import Any, Dict, Optional, Tuple,import logging,import ast(if needed for AST manipulation). - Define
TS_QUERIES(Tree-sitter queries for TypeScript). - Create a
TypescriptTreeSitterParserclass. - Create a
pre_scan_typescriptfunction.
This is the most critical and often iterative step. You'll need to define queries for:
functions: Function declarations, arrow functions, methods.classes: Class declarations, class expressions.imports: ES6 imports (import ... from ...), CommonJSrequire().calls: Function calls, method calls.variables: Variable declarations (let,const,var).docstrings: (Optional) How documentation comments are identified.lambda_assignments: (Optional, Python-specific) If the language has similar constructs.
Tips for Query Writing:
- Consult Tree-sitter Grammars: Find the
node-types.jsonor grammar definition for your language (e.g.,tree-sitter-typescript). - Use
tree-sitter parse: Use thetree-sitter parsecommand-line tool to inspect the AST of sample code snippets. This is invaluable for identifying correct node types and field names. - Start Simple: Begin with basic queries and gradually add complexity.
- Test Iteratively: After each query, test it with sample code.
This class (e.g., TypescriptTreeSitterParser) will encapsulate the language-specific logic.
__init__(self, generic_parser_wrapper):- Store
generic_parser_wrapper,language_name,language,parserfrom the generic wrapper. - Load
TS_QUERIESusingself.language.query(query_str).
- Store
- Helper Methods:
_get_node_text(self, node): Extracts text from a tree-sitter node._get_parent_context(self, node, types=...): (Language-specific node types for context)._calculate_complexity(self, node): (Language-specific complexity nodes)._get_docstring(self, body_node): (Language-specific docstring extraction).
parse(self, path: Path, is_dependency: bool = False) -> Dict:- Reads the file, parses it with
self.parser. - Calls its own
_find_*methods (_find_functions,_find_classes, etc.). - Returns a standardized dictionary format (as seen in
python.pyandjavascript.py).
- Reads the file, parses it with
_find_*Methods: Implement these for each query type, extracting data from the AST and populating the standardized dictionary.
This function (e.g., pre_scan_typescript) will quickly scan files to build an initial imports_map.
- It takes
files: list[Path]andparser_wrapper(an instance ofTreeSitterParser). - Uses a simplified query (e.g., for
class_declarationandfunction_declaration) to quickly find definitions. - Returns a dictionary mapping symbol names to file paths.
GraphBuilder.__init__:- Add
'.ts': TreeSitterParser('typescript')toself.parsers.
- Add
TreeSitterParser.__init__:- Add an
elif self.language_name == 'typescript':block to initializeself.language_specific_parserwithTypescriptTreeSitterParser(self).
- Add an
GraphBuilder._pre_scan_for_imports:- Add an
elif '.ts' in files_by_lang:block to importpre_scan_typescriptand call it.
- Add an
After implementing support for a new language, it's crucial to verify that the graph is being built correctly.
Create a small sample project for your new language (e.g., tests/sample_project_typescript/) with:
- Function declarations.
- Class declarations (including inheritance).
- Various import types (if applicable).
- Function calls.
- Variable declarations.
- Delete existing data (if any):
# Replace with your sample project path <tool_code>print(default_api.delete_repository(repo_path='/path/to/your/sample_project'))</tool_code>
- Index the project:
# Replace with your sample project path <tool_code>print(default_api.add_code_to_graph(path='/path/to/your/sample_project'))</tool_code>
- Monitor job status:
# Use the job_id returned by add_code_to_graph <tool_code>print(default_api.check_job_status(job_id='<your_job_id>'))</tool_code>
Use Cypher queries to inspect the generated graph.
-
Check for Files and Language Tags:
MATCH (f:File) WHERE f.path STARTS WITH '/path/to/your/sample_project' RETURN f.name, f.path, f.lang
Expected: All files from your sample project should be listed with the correct
langtag. -
Check for Functions:
MATCH (f:File)-[:CONTAINS]->(fn:Function) WHERE f.path STARTS WITH '/path/to/your/sample_project' AND fn.lang = '<your_language_name>' RETURN f.name AS FileName, fn.name AS FunctionName, fn.line_number AS Line
Expected: All functions from your sample project should be listed.
-
Check for Classes:
MATCH (f:File)-[:CONTAINS]->(c:Class) WHERE f.path STARTS WITH '/path/to/your/sample_project' AND c.lang = '<your_language_name>' RETURN f.name AS FileName, c.name AS ClassName, c.line_number AS Line
Expected: All classes from your sample project should be listed.
-
Check for Imports (Module-level):
MATCH (f:File)-[:IMPORTS]->(m:Module) WHERE f.path STARTS WITH '/path/to/your/sample_project' AND f.lang = '<your_language_name>' RETURN f.name AS FileName, m.name AS ImportedModule, m.full_import_name AS FullImportName
Expected: All module-level imports should be listed.
-
Check for Function Calls:
MATCH (caller:Function)-[:CALLS]->(callee:Function) WHERE caller.path STARTS WITH '/path/to/your/sample_project' AND caller.lang = '<your_language_name>' RETURN caller.name AS Caller, callee.name AS Callee, caller.path AS CallerFile, callee.path AS CalleeFile
Expected: All function calls should be correctly linked.
-
Check for Class Inheritance:
MATCH (child:Class)-[:INHERITS]->(parent:Class) WHERE child.path STARTS WITH '/path/to/your/sample_project' AND child.lang = '<your_language_name>' RETURN child.name AS ChildClass, parent.name AS ParentClass, child.path AS ChildFile, parent.path AS ParentFile
Expected: All inheritance relationships should be correctly linked.
NameError: Invalid node type ...: Your tree-sitter query is using a node type that doesn't exist in the language's grammar. Usetree-sitter parseto inspect the AST.- Missing Relationships (e.g.,
CALLS,IMPORTS):- Check
_find_*methods: Ensure your_find_*methods are correctly extracting the necessary data. - Check
imports_map: Verify that thepre_scan_<lang>function is correctly populating theimports_map. - Check
local_importsmap: Ensure thelocal_importsmap (built in_create_function_callsand_create_inheritance_links) is correctly resolving symbols.
- Check
- Incorrect
langtags: Ensureself.language_nameis correctly passed and stored.
By following these steps, contributors can effectively add and verify new language support.