Skip to content

jacob-cob-null/pAIge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pAIge

Synthetic Document Generation Pipeline

An AI-powered pipeline that generates realistic synthetic PDF documents from text prompts and reference images. Built to provide high-quality training data for document AI and OCR models.

Pipeline Overview · Refactor Plan


🚩 The Problem

Generating realistic synthetic documents for ML training is difficult. Standard mock data looks "too perfect," and LLMs alone struggle with consistent spatial layout and domain-specific adherence.

pAIge fixes this by distributing the work across specialized stages: deterministic data generation, LLM-driven spatial reasoning, and visual degradation.


✨ Features

📐 LLM Architect

  • Translates natural language prompts into structured Document Manifests (JSON).
  • Enforces strict layout rules (margins, font hierarchies, required zones).

🧬 Data-First Synthesis (Field Bank)

  • Faker-First: Generates statistically correct values (Names, SSNs, IDs) before layout starts.
  • Contextual Injection: The Architect is constrained to use the "Field Bank," eliminating content hallucinations.
  • Smart Resolution: A secondary LLM stage resolves layout placeholders into final, formatted text using the field bank.

📜 Visual Degradation (Augraphy)

  • Simulates physical document aging: stains, folds, lighting gradients, and scanner noise.
  • Variability: Randomized filter intensities ensure uniqueness in batch generations.

🛠️ Tech Stack

Layer Technology
API Framework FastAPI
Orchestration LangChain
LLM Providers Groq (Llama 3.3), Cerebras (Llama 3.1), Gemini
PDF Engine ReportLab
Data Generation Faker
Visual Effects Augraphy
Database SQLite via SQLModel

🏗️ Architecture

┌──────────────────────────────────────────────────────────┐
│                   pAIge Pipeline (FastAPI)               │
│                                                          │
│  ┌──────────┐    ┌──────────┐    ┌─────────────┐         │
│  │ Stage 1  │    │ Stage 2  │    │   Stage 3   │         │
│  │  Faker   ├────►Architect ├────►  Manifest   │         │
│  │(FieldBank)│    │  (LLM)   │    │   Export    │         │
│  └──────────┘    └────┬─────┘    └──────┬──────┘         │
│                       │                 │                 │
│  ┌──────────┐    ┌────▼─────┐    ┌──────▼──────┐         │
│  │ Stage 6  │    │ Stage 5  │    │   Stage 4   │         │
│  │ Augraphy ◄────┤Rendering ◄────┤ Placeholder │         │
│  │(Degrader)│    │  (PDF)   │    │ Resolution  │         │
│  └────┬─────┘    └──────────┘    └─────────────┘         │
│       │                                                  │
└───────┼──────────────────────────────────────────────────┘
        │
        ▼
   Final Output
   (PDF / JSON)

🚀 Getting Started

1. Clone & Install

git clone https://github.com/jacob-cob-null/pAIge.git
cd pAIge
python -m venv .venv
source .venv/bin/activate  # .venv\Scripts\activate on Windows
pip install -r requirements.txt

2. Environment Variables

Create a .env file:

GROQ_API_KEY=your_key
CEREBRAS_API_KEY=your_key

3. Run

uvicorn app.main:app --reload

📂 Project Structure

pAIge/
├── app/
│   ├── main.py              # FastAPI routes
│   ├── services/            # Pipeline, SchemaGen, DataFill, Renderer
│   ├── llm/                 # Provider-agnostic client + Prompts
│   └── models/               # Pydantic & SQLModel schemas
├── implementation/           # Detailed planning
└── tests/                    # 23+ unit and integration tests

⚖️ License

MIT License

About

An AI-powered pipeline that generates realistic synthetic PDF documents from prompts and images.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages