pAIge

Synthetic Document Generation Pipeline

An AI-powered pipeline that generates realistic synthetic PDF documents from text prompts and reference images. Built to provide high-quality training data for document AI and OCR models.

Pipeline Overview · Refactor Plan

🚩 The Problem

Generating realistic synthetic documents for ML training is difficult. Standard mock data looks "too perfect," and LLMs alone struggle with consistent spatial layout and domain-specific adherence.

pAIge fixes this by distributing the work across specialized stages: deterministic data generation, LLM-driven spatial reasoning, and visual degradation.

✨ Features

📐 LLM Architect

Translates natural language prompts into structured Document Manifests (JSON).
Enforces strict layout rules (margins, font hierarchies, required zones).

🧬 Data-First Synthesis (Field Bank)

Faker-First: Generates statistically correct values (Names, SSNs, IDs) before layout starts.
Contextual Injection: The Architect is constrained to use the "Field Bank," eliminating content hallucinations.
Smart Resolution: A secondary LLM stage resolves layout placeholders into final, formatted text using the field bank.

📜 Visual Degradation (Augraphy)

Simulates physical document aging: stains, folds, lighting gradients, and scanner noise.
Variability: Randomized filter intensities ensure uniqueness in batch generations.

🛠️ Tech Stack

Layer	Technology
API Framework	FastAPI
Orchestration	LangChain
LLM Providers	Groq (Llama 3.3), Cerebras (Llama 3.1), Gemini
PDF Engine	ReportLab
Data Generation	Faker
Visual Effects	Augraphy
Database	SQLite via SQLModel

🏗️ Architecture

┌──────────────────────────────────────────────────────────┐
│                   pAIge Pipeline (FastAPI)               │
│                                                          │
│  ┌──────────┐    ┌──────────┐    ┌─────────────┐         │
│  │ Stage 1  │    │ Stage 2  │    │   Stage 3   │         │
│  │  Faker   ├────►Architect ├────►  Manifest   │         │
│  │(FieldBank)│    │  (LLM)   │    │   Export    │         │
│  └──────────┘    └────┬─────┘    └──────┬──────┘         │
│                       │                 │                 │
│  ┌──────────┐    ┌────▼─────┐    ┌──────▼──────┐         │
│  │ Stage 6  │    │ Stage 5  │    │   Stage 4   │         │
│  │ Augraphy ◄────┤Rendering ◄────┤ Placeholder │         │
│  │(Degrader)│    │  (PDF)   │    │ Resolution  │         │
│  └────┬─────┘    └──────────┘    └─────────────┘         │
│       │                                                  │
└───────┼──────────────────────────────────────────────────┘
        │
        ▼
   Final Output
   (PDF / JSON)

🚀 Getting Started

1. Clone & Install

git clone https://github.com/jacob-cob-null/pAIge.git
cd pAIge
python -m venv .venv
source .venv/bin/activate  # .venv\Scripts\activate on Windows
pip install -r requirements.txt

2. Environment Variables

Create a .env file:

GROQ_API_KEY=your_key
CEREBRAS_API_KEY=your_key

3. Run

uvicorn app.main:app --reload

📂 Project Structure

pAIge/
├── app/
│   ├── main.py              # FastAPI routes
│   ├── services/            # Pipeline, SchemaGen, DataFill, Renderer
│   ├── llm/                 # Provider-agnostic client + Prompts
│   └── models/               # Pydantic & SQLModel schemas
├── implementation/           # Detailed planning
└── tests/                    # 23+ unit and integration tests

⚖️ License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
app		app
implementation		implementation
tests		tests
.gitignore		.gitignore
README.md		README.md
paige-run		paige-run
paige-run.bat		paige-run.bat
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pAIge

🚩 The Problem

✨ Features

📐 LLM Architect

🧬 Data-First Synthesis (Field Bank)

📜 Visual Degradation (Augraphy)

🛠️ Tech Stack

🏗️ Architecture

🚀 Getting Started

1. Clone & Install

2. Environment Variables

3. Run

📂 Project Structure

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pAIge

🚩 The Problem

✨ Features

📐 LLM Architect

🧬 Data-First Synthesis (Field Bank)

📜 Visual Degradation (Augraphy)

🛠️ Tech Stack

🏗️ Architecture

🚀 Getting Started

1. Clone & Install

2. Environment Variables

3. Run

📂 Project Structure

⚖️ License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages