Synthetic Document Generation Pipeline
An AI-powered pipeline that generates realistic synthetic PDF documents from text prompts and reference images. Built to provide high-quality training data for document AI and OCR models.
Pipeline Overview · Refactor Plan
Generating realistic synthetic documents for ML training is difficult. Standard mock data looks "too perfect," and LLMs alone struggle with consistent spatial layout and domain-specific adherence.
pAIge fixes this by distributing the work across specialized stages: deterministic data generation, LLM-driven spatial reasoning, and visual degradation.
- Translates natural language prompts into structured Document Manifests (JSON).
- Enforces strict layout rules (margins, font hierarchies, required zones).
- Faker-First: Generates statistically correct values (Names, SSNs, IDs) before layout starts.
- Contextual Injection: The Architect is constrained to use the "Field Bank," eliminating content hallucinations.
- Smart Resolution: A secondary LLM stage resolves layout placeholders into final, formatted text using the field bank.
- Simulates physical document aging: stains, folds, lighting gradients, and scanner noise.
- Variability: Randomized filter intensities ensure uniqueness in batch generations.
| Layer | Technology |
|---|---|
| API Framework | FastAPI |
| Orchestration | LangChain |
| LLM Providers | Groq (Llama 3.3), Cerebras (Llama 3.1), Gemini |
| PDF Engine | ReportLab |
| Data Generation | Faker |
| Visual Effects | Augraphy |
| Database | SQLite via SQLModel |
┌──────────────────────────────────────────────────────────┐
│ pAIge Pipeline (FastAPI) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌─────────────┐ │
│ │ Stage 1 │ │ Stage 2 │ │ Stage 3 │ │
│ │ Faker ├────►Architect ├────► Manifest │ │
│ │(FieldBank)│ │ (LLM) │ │ Export │ │
│ └──────────┘ └────┬─────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────────┐ ┌────▼─────┐ ┌──────▼──────┐ │
│ │ Stage 6 │ │ Stage 5 │ │ Stage 4 │ │
│ │ Augraphy ◄────┤Rendering ◄────┤ Placeholder │ │
│ │(Degrader)│ │ (PDF) │ │ Resolution │ │
│ └────┬─────┘ └──────────┘ └─────────────┘ │
│ │ │
└───────┼──────────────────────────────────────────────────┘
│
▼
Final Output
(PDF / JSON)
git clone https://github.com/jacob-cob-null/pAIge.git
cd pAIge
python -m venv .venv
source .venv/bin/activate # .venv\Scripts\activate on Windows
pip install -r requirements.txtCreate a .env file:
GROQ_API_KEY=your_key
CEREBRAS_API_KEY=your_keyuvicorn app.main:app --reloadpAIge/
├── app/
│ ├── main.py # FastAPI routes
│ ├── services/ # Pipeline, SchemaGen, DataFill, Renderer
│ ├── llm/ # Provider-agnostic client + Prompts
│ └── models/ # Pydantic & SQLModel schemas
├── implementation/ # Detailed planning
└── tests/ # 23+ unit and integration tests
MIT License