Turn a GitHub repository or local codebase into a beginner-friendly Markdown tutorial.
This repository is a Pocket Flow based CLI pipeline. The current implementation crawls source files, builds semantic code chunks, asks an LLM to identify the main abstractions, analyzes how they relate, orders the chapters, writes each chapter, and finally combines everything into a tutorial folder.
- Accepts either a GitHub repository URL with
--repoor a local directory with--dir - Filters files by include patterns, exclude patterns, and a max file size
- Respects
.gitignorewhen crawling a local directory - Uses a Node.js
code-chunksidecar to build semantic chunks when available - Falls back to coarse local chunks when Node.js is unavailable, so generation can still run
- Uses a multi-step LLM workflow:
FetchRepoIdentifyAbstractionsAnalyzeRelationshipsOrderChaptersWriteChaptersCombineTutorial
- Generates tutorials in
Chineseby default
The current tutorial writer is tuned for beginner-friendly output: short code blocks, Mermaid diagrams, cross-chapter links, simple explanations, and analogy-heavy walkthroughs.
Running the generator creates a pf_guide/ folder by default:
pf_guide/
index.md
01_<chapter_name>.md
02_<chapter_name>.md
...
index.md contains:
- a short project summary
- a Mermaid relationship graph
- ordered chapter links
The repository also contains previously generated examples under docs/, including:
flowchart LR
A[Repo URL or Local Dir] --> B[File Crawl]
B --> C[Semantic Chunk Inventory]
C --> D[Identify Abstractions]
D --> E[Analyze Relationships]
E --> F[Order Chapters]
F --> G[Write Chapters]
G --> H[Combine Markdown Tutorial]
Implementation entry points:
main.py: CLI entry and shared state setupflow.py: Pocket Flow pipeline wiringnodes.py: the six pipeline nodes and promptsutils/semantic_chunks.py: chunk inventory building and fallback chunkingtools/code_chunk_adapter.mjs: Node sidecar forcode-chunkutils/call_llm.py: provider selection, cache, logging, telemetry
C0de1ndex/ is present in this repository, but it is not part of the default main.py -> flow.py -> nodes.py execution path.
- Python 3.10+
pip- Recommended: Node.js 18+ and
npm - Recommended for SSH repository URLs: Git
Install dependencies:
pip install -r requirements.txt
npm installNotes:
- Node.js is recommended, not strictly required. If
nodeis not available, the Python pipeline falls back to less precise local chunks. npm installis only used for the semantic chunking sidecar through thecode-chunkpackage.
Environment variables are loaded from a local .env file via python-dotenv.
If either GEMINI_PROJECT_ID or GEMINI_API_KEY is set, the code automatically uses Gemini.
GEMINI_API_KEY=your_api_key
GEMINI_MODEL=gemini-2.5-pro-exp-03-25For Vertex AI:
GEMINI_PROJECT_ID=your_gcp_project
GEMINI_LOCATION=us-central1
GEMINI_MODEL=gemini-2.5-pro-exp-03-25For non-Gemini providers, the current code expects:
LLM_PROVIDER=OPENROUTER
OPENROUTER_MODEL=your_model_name
OPENROUTER_BASE_URL=https://openrouter.ai/api
OPENROUTER_API_KEY=your_api_keyThe same shape works for other providers such as XAI or OLLAMA:
LLM_PROVIDER=OLLAMA
OLLAMA_MODEL=qwen2.5-coder:14b
OLLAMA_BASE_URL=http://localhost:11434<PROVIDER>_API_KEY is optional for local providers such as Ollama.
If you analyze private repositories, or want to reduce rate-limit issues for public ones, set:
GITHUB_TOKEN=your_github_tokenpython utils/call_llm.pyShow CLI help:
python main.py --helpAnalyze a public GitHub repository:
python main.py --repo https://github.com/pallets/flaskAnalyze a branch or subdirectory URL:
python main.py --repo https://github.com/langchain-ai/langgraph/tree/main/libs/langgraphAnalyze an SSH repository URL:
python main.py --repo git@github.com:owner/private-repo.git --token your_github_tokenAnalyze a local directory:
python main.py --dir /path/to/codebaseGenerate English output instead of the default Chinese:
python main.py --repo https://github.com/pallets/flask --language EnglishUse custom include and exclude filters:
python main.py --dir . --include "*.py" "*.ts" "*.md" --exclude "tests/*" "docs/*"Write output somewhere else:
python main.py --repo https://github.com/pallets/flask --output generated_tutorialsRun the local web console:
python -m webapp.serverIf webapp/bin/ is missing the Windows API Code Pack DLLs on a fresh machine, install them once:
powershell -ExecutionPolicy Bypass -File tools/install_windows_api_code_pack.ps1Then open http://127.0.0.1:8765 in your browser. The web console currently supports:
- opening the native Windows
CommonOpenFileDialogfolder picker from the分析目录browse button and filling the selected repository path automatically - adding local repository analysis jobs into a queue
- deleting pending/completed/failed jobs from the queue
- setting include/exclude patterns and core analysis parameters
- defaulting output to
<selected_repo>/output - starting the queue and watching per-task logs and output paths
--repo: GitHub repository URL--dir: local directory path-n, --name: override the derived project name-t, --token: GitHub token, otherwiseGITHUB_TOKENis used-o, --output: base output directory, defaultoutput-i, --include: include glob patterns-e, --exclude: exclude glob patterns-s, --max-size: max file size in bytes, default100000--language: tutorial language, defaultChinese--max-abstractions: useautoto let the LLM estimate a suitable chapter count, or pass a positive integer to cap the number of tutorial abstractions; defaultauto--no-cache: disable prompt-level LLM response caching--max-extraction-batches: override bounded extraction batch count--llm-extraction-concurrency: override concurrent extraction workers
--repo and --dir are mutually exclusive, and one of them is required.
The current defaults are intentionally conservative.
Included by default:
- source files such as
*.py,*.js,*.ts,*.tsx,*.go,*.java,*.c,*.cpp - docs and config-like files such as
*.md,*.rst,*.yaml,*.yml,*Dockerfile,*Makefile
Excluded by default:
assets,images,public,static,temp- test folders and test-like files
docs,examples,dist,build,legacy,experimental.git,.github,.next,.vscode,node_modules, virtual environments, logs
If you want to analyze a documentation-heavy repository, pay attention to the default docs exclusion and override it explicitly.
The current code writes and reads these runtime artifacts:
llm_cache.json: prompt-response cachelogs/llm_calls_YYYYMMDD.log: raw LLM call loglogs/llm_metrics_YYYYMMDD.jsonl: structured telemetry
Useful environment variables:
LOG_DIR: override the log directory, defaultlogsLLM_HTTP_TIMEOUT: HTTP timeout in seconds for provider calls, default120LLM_TELEMETRY=0: disable telemetry file writingLLM_TELEMETRY_FILE: custom telemetry file pathLLM_MAX_EXTRACTION_BATCHES: default extraction batch cap, default40LLM_EXTRACTION_CONCURRENCY: default extraction concurrency, default1
The CLI flags --max-extraction-batches and --llm-extraction-concurrency override the corresponding environment defaults for a run.
Run the current test suite with:
python -m unittest discover testsThe tests cover:
- CLI defaults such as the default tutorial language
- semantic chunk mapping and fallback behavior
- chunk packing and file-index extraction
- the compact abstraction-planning and refinement contract
A minimal Dockerfile is included:
docker build -t tutorial-builder .
docker run --rm -it -e GEMINI_API_KEY=your_api_key -v "$(pwd)/output":/app/output tutorial-builder --repo https://github.com/pallets/flaskImportant caveat:
- the current Dockerfile installs Python dependencies and Git
- it does not install Node.js or
npm - inside that image, semantic chunking falls back to the Python-side fallback chunks unless you extend the image yourself
.
├─ main.py
├─ flow.py
├─ nodes.py
├─ utils/
├─ tools/
├─ tests/
├─ docs/
└─ C0de1ndex/
docs/contains generated example tutorials for publishingtests/fixtures/contains small polyglot fixtures for chunking testsC0de1ndex/is a separate Go-based experiment that is not called by the default Python flow
MIT
