IDP Azure — Intelligent Document Processing

An end-to-end Intelligent Document Processing (IDP) pipeline built entirely on the Azure ecosystem. It ingests documents of any type, extracts structured content, chunks intelligently, indexes with vector search, and exposes an agentic RAG interface — all following Domain-Driven Design (DDD) principles for clear, maintainable code.

End-to-End Pipeline Flow

The pipeline has 7 stages. Each stage is a distinct bounded context with a clear responsibility, input, output, and Azure technology.

 ┌──────────────────────────────────────────────────────────────┐
 │                     DOCUMENT SOURCES                         │
 │  PDF │ DOCX │ PPTX │ XLSX │ CSV │ HTML │ Images │ TXT │ MD  │
 └───────────────────────────┬──────────────────────────────────┘
                             │
                     ┌───────▼───────┐
                     │   STAGE 1     │
                     │   Routing &   │
                     │   Validation  │
                     └───────┬───────┘
                             │  validates size, type, existence
                             │  selects optimal analyzer
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
┌────────▼────────┐ ┌───────▼────────┐ ┌────────▼────────┐
│    STAGE 2a     │ │    STAGE 2b    │ │    STAGE 2c     │
│    Content      │ │    Document    │ │    GPT-4o       │
│    Understanding│ │    Intelligence│ │    Vision       │
│    (primary)    │ │    (tables)    │ │    (images)     │
└────────┬────────┘ └───────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │  all paths produce Markdown
                             │  + structural metadata (CU)
                             │
                     ┌───────▼───────┐
                     │   STAGE 2.5   │
                     │   Noise       │
                     │   Filtering   │
                     └───────┬───────┘
                             │  removes headers/footers/
                             │  page numbers via CU roles
                             │
                     ┌───────▼───────┐
                     │   STAGE 2.6   │
                     │   Speaker     │
                     │   Notes (PPTX)│
                     └───────┬───────┘
                             │  extracts notes via
                             │  python-pptx (local)
                             │
                     ┌───────▼───────┐
                     │   STAGE 3     │
                     │   Chunking    │
                     │  (page-first) │
                     └───────┬───────┘
                             │  splits by page markers,
                             │  merges mid-sentence breaks
                             │
                     ┌───────▼───────┐
                     │   STAGE 4     │
                     │   Embedding   │
                     │   & Indexing  │
                     └───────┬───────┘
                             │  embeds + uploads to search
                             │
                     ┌───────▼───────┐
                     │   STAGE 5     │
                     │   Agentic     │
                     │   Retrieval   │
                     └───────┬───────┘
                             │  LLM-driven query planning
                             │
                     ┌───────▼───────┐
                     │   STAGE 6     │
                     │   RAG Agent   │
                     │   (response)  │
                     └───────────────┘

Stage-by-Stage Breakdown

Stage 1 — Document Routing & Validation

What: Receives a file, validates it, and decides which Azure analyzer should process it.

Aspect	Detail
Code	`ingestion/router.py`
Technology	Pure Python (no Azure calls)
Input	Raw file path
Output	`DocumentMetadata` (file type, analyzer choice, size)
Domain Model	`AnalyzerChoice` enum, `FileType` enum

How it works:

Checks the file exists, is non-empty, and is under 200 MB
Maps the file extension to an AnalyzerChoice:
- .pdf, .docx, .pptx, .xlsx, .csv, .html, images → Content Understanding
- .txt, .md → Direct Read (no API call needed)
Returns a DocumentMetadata value object used by subsequent stages

file_path → validate_file() → DocumentMetadata { file_type, analyzer, size }

Stage 2 — Document Extraction (3 Analyzers)

What: Extracts structured text from the document as Markdown. Three analyzers available depending on the document type.

2a. Azure Content Understanding (Primary)

Aspect	Detail
Code	`ingestion/content_understanding.py`
Technology	Azure Content Understanding SDK (`azure-ai-contentunderstanding`)
Input	Binary file bytes
Output	Markdown string with tables, headings, page structure
Used For	PDF, DOCX, PPTX, XLSX, CSV, HTML, images (built-in OCR)
Resilience	`@retry_on_transient()` — retries on 429/503/504 with exponential backoff

file_bytes → ContentUnderstandingClient.begin_analyze_binary() → AnalyzeResult → .markdown

2b. Azure Document Intelligence (Complementary)

Aspect	Detail
Code	`ingestion/document_intelligence.py`
Technology	Azure Document Intelligence SDK (`azure-ai-documentintelligence`)
Input	Binary file bytes
Output	Markdown with fine-grained table structure (row/column counts)
Used For	Documents needing superior layout analysis or form-field extraction

file_bytes → DocumentIntelligenceClient.begin_analyze_document(model="prebuilt-layout") → .content

2c. Azure OpenAI GPT-4o Vision (Selective)

Aspect	Detail
Code	`ingestion/vision.py`
Technology	Azure OpenAI (`openai` SDK) — GPT-4o with vision capability
Input	Image bytes (PNG/JPEG/TIFF/BMP) or PDF pages rendered as 200-DPI PNGs
Output	Detailed Markdown description (charts, diagrams, data points, tables)
Used For	Complex visual content that text extractors miss: charts, diagrams, handwriting

image_bytes → base64 encode → GPT-4o chat.completions.create(vision) → Markdown description

2d. Hybrid Extraction (CU + Figure Triage + Vision)

For PDFs with complex images, a hybrid approach is available. It uses CU's native figure metadata for content-aware triage instead of crude size-based filtering:

file -> Content Understanding (text + tables + figure metadata)
     -> CU identifies figures with kind: CHART / MERMAID / UNKNOWN
     -> CHART/MERMAID: CU provides structured content directly (no Vision call)
     -> UNKNOWN (no description): crop figure region via PyMuPDF -> GPT-4o Vision
     -> combined Markdown output

Aspect	Detail
Code	`ingestion/figure_triage.py`, `application/ingestion_service.py`
Technology	Content Understanding (figure metadata) + GPT-4o Vision (fallback) + PyMuPDF (region cropping)
Output	`ExtractedDocument` with markdown + figure descriptions

Figure triage logic:

CHART figures: CU provides Chart.js structured content and description -- used directly
MERMAID figures: CU provides Mermaid.js syntax and description -- used directly
UNKNOWN figures with CU description (>= 20 chars): used directly
UNKNOWN figures without description: figure region cropped from PDF page using CU's bounding polygon, sent to GPT-4o Vision

Stage 2.5 -- Noise Filtering

What: Removes headers, footers, and page numbers from extracted markdown before chunking to prevent noise from contaminating retrieval chunks.

Aspect	Detail
Code	`chunking/noise.py`
Technology	Azure Content Understanding paragraph roles (native metadata)
Input	Markdown string + `DocumentContent` (paragraph metadata)
Output	Cleaned markdown with noise elements removed

How it works:

Primary path (CU metadata): CU classifies every paragraph with a semantic role. Paragraphs with role PAGE_HEADER, PAGE_FOOTER, or PAGE_NUMBER carry span offsets into the markdown. The noise filter collects these spans, merges overlapping ones, and rebuilds the markdown from the non-noise ranges.
Fallback path (regex): When CU paragraph metadata is unavailable (e.g. direct-read or DI extraction path), a conservative regex removes only standalone page-number lines (Page X of Y, - X -, bare numbers on their own line).

Design principle: conservative by default -- ambiguous content is kept, not removed.

Stage 2.6 -- Speaker Notes Extraction (PPTX only)

What: Extracts speaker notes from PowerPoint presentations. CU and DI do not extract notes — they only process visible slide content. Notes are extracted locally using python-pptx with no API calls.

Aspect	Detail
Code	`ingestion/speaker_notes.py`
Technology	python-pptx (reads PPTX XML structure locally)
Input	`.pptx` file path
Output	List of `### Notes from Page N` formatted strings
Used For	`.pptx` files only (notes are lost in PDF-from-PPT conversion)

Why this matters for RAG: Slides typically have terse bullet points while speaker notes contain the full explanation, context, and reasoning. For retrieval, notes often produce better answers than slide text alone.

How it integrates: Notes are formatted with ### Notes from Page N headings and appended to the ExtractedDocument.image_descriptions list. The existing content assembly module interleaves them at the correct page position, so each slide's chunk contains both the visible slide content and the presenter's notes.

.pptx file → python-pptx reads notesSlide XML → "### Notes from Page N"
  → interleaved into full_content at page position → chunked with slide content

Stage 3 -- Chunking

What: Splits extracted Markdown into retrieval-optimized chunks using page markers as the primary split, with mid-sentence merge for report-style page breaks and overflow splitting for oversized pages.

Aspect	Detail
Code	`chunking/pipeline.py`
Technology	Page-marker splitting + Chonkie `RecursiveChunker` (overflow only)
Input	Markdown string (with `<!-- PageBreak -->` / `<!-- PageNumber="N" -->` markers from CU)
Output	List of `Chunk` domain objects (`id`, `text`, `token_count`, `page_number`)

Pipeline stages (in order):

Markdown (with CU page markers)
  │
  ▼
Split by page markers               ← <!-- PageBreak --> / <!-- PageNumber="N" -->
  │                                    (NOT by ## headings — CU renders PPT text
  │                                     boxes as ## headings which would over-split)
  ▼
Mid-sentence merge                   ← if page N ends without .!?: AND page N+1
  │                                    starts lowercase → merge (report page breaks)
  ▼
Overflow split (RecursiveChunker)     ← only for blocks exceeding chunk_size (1024)
  │                                    uses plain-text recipe, not markdown
  ▼
List[Chunk]                          ← each chunk carries page_number for citations

Why page-first, not heading-based:

Azure Content Understanding renders each text box from PPT slides as a separate ## heading in the markdown (e.g., ## Customer:, ## Challenge:). A heading-based chunker like RecursiveChunker(recipe="markdown") splits on every ##, producing hundreds of micro-chunks (6–50 tokens each) from a 20-slide deck. The page-first approach treats each page/slide as a single chunk, keeping all content together regardless of ## text-box headings.

For report-style PDFs where paragraphs flow across page boundaries, the mid-sentence merge step detects when a page break cuts a paragraph (previous page ends without sentence-terminal punctuation + next page starts with a lowercase letter) and merges those pages back together. This is conservative — only mid-sentence breaks trigger merging, not topical relatedness.

Chunk sizing:

Scenario	Handling
Page fits within `chunk_size` (1024 tokens)	One chunk per page
Page exceeds `chunk_size`	Split with `RecursiveChunker` (plain-text recipe)
Mid-sentence page break (reports)	Adjacent pages merged, then size-checked
Slide-like pages (self-contained)	No merging — each page stays separate

Stage 4 — Embedding & Indexing

What: Converts each chunk to a 3072-dimensional vector and uploads to Azure AI Search.

Aspect	Detail
Code	`search/embeddings.py`, `search/index.py`, `search/indexing.py`
Technology	Azure OpenAI Embeddings (`text-embedding-3-large`) + Azure AI Search (`azure-search-documents`)
Input	List of `Chunk` objects
Output	Documents indexed in Azure AI Search

Sub-steps:

                                    ┌─────────────────────────────────┐
Chunks ──► Batch embed (16/call) ──►│  Azure AI Search Index          │
           Azure OpenAI             │  ┌───────────────────────────┐  │
           text-embedding-3-large   │  │ Fields:                   │  │
           (3072 dimensions)        │  │  • id (key)               │  │
                                    │  │  • content (searchable)   │  │
                                    │  │  • original_content       │  │
                                    │  │  • content_vector (3072d) │  │
                                    │  │  • source_file (filter)   │  │
                                    │  │  • file_type (filter)     │  │
                                    │  │  • chunk_index (sortable) │  │
                                    │  │  • page_number (filter)   │  │
                                    │  └───────────────────────────┘  │
                                    │  Vector: HNSW algorithm         │
                                    │  Vectorizer: integrated AOAI    │
                                    │  Semantic: content field ranked  │
                                    └─────────────────────────────────┘

Key details:

Integrated vectorizer — the index is configured so Azure AI Search can call Azure OpenAI for query-time vectorization automatically (required for agentic retrieval)
Batch embedding — texts are embedded 16 at a time with @retry_on_transient() for rate-limit resilience
Buffered upload — SearchIndexingBufferedSender handles reliable batch uploads with auto-retry
Deterministic IDs — chunk IDs are SHA-256 hashes of source_file:chunk_index (supports re-indexing)
Contextual enrichment (opt-in) — when CONTEXTUAL_ENRICHMENT_ENABLED=true, an LLM generates a short document-level context prefix for each chunk before embedding. The enriched text is stored in content (for search), while the raw chunk text is preserved in original_content (for display). This follows Anthropic's Contextual Retrieval approach and, combined with the existing hybrid search and semantic reranking, can reduce retrieval failures by up to 67%. See Contextual Enrichment below for details.

Stage 5 — Search Retrieval

What: Retrieves relevant document chunks using hybrid search (default) or LLM-powered agentic retrieval via a knowledge base. Configurable via the SEARCH_MODE environment variable.

Aspect	Detail
Code	`search/retrieval.py`, `search/query_rewrite.py`, `search/knowledge.py`
Technology	Azure AI Search — hybrid search with LLM query rewriting + semantic reranking (default), or Agentic Retrieval via `KnowledgeBaseRetrievalClient`
Input	Natural-language query (hybrid) or conversational messages (knowledge base)
Output	`RetrievalResult` (content, references with source/page, optional activity)

Two retrieval modes (set via SEARCH_MODE env var):

Mode 1: Hybrid Search (default — `SEARCH_MODE=hybrid`)

Combines three retrieval signals in a single request:

User question: "What about its revenue?"  (multi-turn follow-up)
  │
  ▼  Custom LLM Query Rewrite (GPT-5-mini)
  │  • Resolves coreferences: "its" → "Division B"
  │  • Expands short/ambiguous queries with keyword synonyms
  │  • Only fires when conversation context needs resolution
  │  • Skipped for clear single-turn queries (zero added latency)
  │
  ├──► BM25 text query (expanded: rewritten + keyword synonyms)
  ├──► Vector query (clean standalone rewrite only — no expansion noise)
  │
  ▼  Semantic Reranking (cross-encoder, uses clean rewritten query)
  │  (rescores fused results for higher relevance)
  │
  ▼  Top-K results with source citations

Query rewriting — custom LLM-based pre-search rewrite (search/query_rewrite.py) handles conversational coreference resolution and conditional keyword expansion. Does not rely on Azure AI Search's built-in generative query rewrite — our custom rewriter gives full control over rewrite behavior, structured output, and prompt caching.
Three-channel query split — BM25 gets expanded text for broad recall, vector search and semantic reranker get the clean standalone query for precision (via semantic_query parameter).
Semantic reranking — cross-encoder reranker via the default semantic configuration

Mode 2: Knowledge Base (`SEARCH_MODE=knowledge_base`)

LLM-powered agentic retrieval with query planning and optional answer synthesis:

User question: "Compare revenue trends in Q3 vs Q4 and list risk factors"
  │
  ▼  LLM Query Planning (GPT-5-mini)
  │
  ├──► Subquery 1: "Q3 revenue trends"     ──► hybrid search ──► results
  ├──► Subquery 2: "Q4 revenue trends"     ──► hybrid search ──► results
  └──► Subquery 3: "risk factors"          ──► hybrid search ──► results
                                                     │
                                          merge + semantic rerank
                                                     │
                                                     ▼
                                            Unified response with
                                            source citations

Architecture (3 layers):

┌────────────────────────────────────────────────────────────────┐
│                    Knowledge Base                              │
│  • LLM: GPT-5-mini (query planning — AI Search compatible)     │
│  • Decomposes complex questions into focused subqueries        │
│  • Runs subqueries in parallel                                 │
│  • Optionally synthesises a natural-language answer             │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                Knowledge Source                          │  │
│  │  • Wraps the search index                                │  │
│  │  • Citation fields: id, source_file, page_number         │  │
│  │  ┌────────────────────────────────────────────────────┐  │  │
│  │  │              Search Index                          │  │  │
│  │  │  • Hybrid search: BM25 text + HNSW vector         │  │  │
│  │  │  • Semantic reranking on content field             │  │  │
│  │  │  • Integrated vectorizer (auto text→vector)        │  │  │
│  │  └────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Multi-turn support: In knowledge base mode, pass the full conversation history — the knowledge base uses prior messages for better query planning on follow-up questions. In hybrid mode, only the last user message is used as the search query.

Stage 6 — RAG Agent

What: An intelligent conversational agent that uses agentic retrieval to answer document questions with citations, stream responses, and produce structured outputs.

Aspect	Detail
Code	`agent/tools.py`, `agent/rag_agent.py`, `agent/workflows.py`
Technology	Microsoft Agent Framework 1.0.0 (`agent-framework`, `agent-framework-openai`)
LLM	Azure OpenAI GPT-4.1 (primary chat model)
Input	Natural language question
Output	Answer with inline source citations

Agent architecture:

┌─────────────────────────────────────────────────────────────┐
│              DocumentAssistant (RAG Agent)                   │
│              LLM: Azure OpenAI GPT-4.1                      │
│                                                             │
│   Tools:                                                    │
│   ┌─────────────────────────────────────────────────────┐   │
│   │ search_documents(query)                             │   │
│   │   → Calls Stage 5 (Agentic Retrieval)               │   │
│   │   → Returns content + source citations               │   │
│   ├─────────────────────────────────────────────────────┤   │
│   │ list_indexed_documents()                            │   │
│   │   → Queries search index facets                     │   │
│   │   → Returns list of files with chunk counts          │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
│   Capabilities:                                             │
│   • Streaming responses (real-time token output)            │
│   • Structured output (Pydantic DocumentSummary)            │
│   • Multi-turn conversation with context                    │
│   • Multi-agent workflows (Retriever → Analyzer → Writer)   │
└─────────────────────────────────────────────────────────────┘

Multi-agent workflow (optional, via workflows.py):

User query
  │
  ▼
Retriever Agent ──► searches & retrieves relevant chunks
  │
  ▼
Analyzer Agent  ──► identifies themes, contradictions, insights
  │
  ▼
Writer Agent    ──► formats into a structured, cited response

Contextual Enrichment (Opt-In)

Anthropic's Contextual Retrieval approach: prepend LLM-generated context to each chunk before embedding. Combined with hybrid search (BM25 + vector + semantic rerank), this reduces retrieval failures by up to 67%.

Aspect	Detail
Code	`chunking/enrichment.py`
Toggle	`CONTEXTUAL_ENRICHMENT_ENABLED=true` (disabled by default)
Cost	1 summary call + 1 call per chunk per ingested document

How it works:

Document summary — a single LLM call generates a 3-5 sentence summary of the entire document.
Per-chunk context — for each chunk, an LLM call receives {summary} + {chunk_text} and produces 2-3 sentences situating the chunk within the document.
Dual storage — enriched text (context + chunk) is stored in content for search; raw chunk text is preserved in original_content for display.

Configuration:

Env Var	Default	Description
`CONTEXTUAL_ENRICHMENT_ENABLED`	`false`	Enable contextual enrichment during ingestion
`CONTEXTUAL_ENRICHMENT_DEPLOYMENT`	primary chat deployment	LLM deployment for enrichment
`CONTEXTUAL_ENRICHMENT_MAX_CONCURRENT`	`5`	Max concurrent LLM calls for chunk enrichment

Notes:

Enrichment is fail-safe — on failure it falls back to raw chunks rather than aborting ingestion.
Chunk text is wrapped in <document_chunk> XML tags to prevent Azure OpenAI's content filter from misidentifying imperative business language (e.g., PPT slide text like "You must ensure...") as jailbreak attempts. If the content filter still triggers, enrichment is silently skipped for that chunk.
Existing documents must be re-ingested to gain enrichment.
The hierarchical approach (summary → per-chunk context) keeps cost manageable versus sending the full document with every chunk.

Technology Stack Summary

Stage	Technology	Package	Purpose
1. Routing	Pure Python	--	File validation & analyzer selection
2a. Extraction	Azure Content Understanding	`azure-ai-contentunderstanding`	Primary document -> Markdown + metadata
2b. Extraction	Azure Document Intelligence	`azure-ai-documentintelligence`	Layout & table extraction
2c. Extraction	Azure OpenAI GPT-4o Vision	`openai`	Chart/diagram description
2c. PDF rendering	PyMuPDF	`pymupdf`	PDF page -> PNG for Vision
2d. Figure triage	Azure Content Understanding	`azure-ai-contentunderstanding`	Figure classification (CHART/MERMAID/UNKNOWN)
2.5. Noise filtering	Azure Content Understanding	`azure-ai-contentunderstanding`	Paragraph role-based noise removal
2.6. Speaker notes	python-pptx	`python-pptx`	PPTX speaker notes extraction (local, no API)
3. Chunking	Page-marker splitting + Chonkie	`chonkie`	Page-first chunking with overflow splitting
4. Embedding	Azure OpenAI	`openai`	`text-embedding-3-large` (3072d)
4. Indexing	Azure AI Search	`azure-search-documents`	Vector + BM25 hybrid index
5. Retrieval	Azure AI Search Agentic Retrieval	`azure-search-documents` (preview)	LLM-driven query planning
6. Agent	Microsoft Agent Framework 1.0.0	`agent-framework-openai`	RAG agent with tool calling
6. Model mgmt	Microsoft Foundry	`agent-framework-foundry`	Centralised model deployment
Cross-cutting	Pydantic	`pydantic`	Settings validation & domain models
Cross-cutting	Azure Identity	`azure-identity`	`DefaultAzureCredential` auth
REST API	FastAPI	`fastapi[standard]`	REST endpoints + SSE streaming

Architecture (DDD)

The codebase follows Domain-Driven Design with clear layering:

┌─────────────────────────────────────────────────────────────────┐
│                    INTERFACE ADAPTERS                            │
│   api/ (FastAPI)  │  ui/ (Chainlit)  │  agent/ (Agent Framework)│
│   REST endpoints     Chat + upload      RAG agent tools          │
└────────────────────────────┬────────────────────────────────────┘
                             │  delegates to
┌────────────────────────────▼────────────────────────────────────┐
│                    APPLICATION SERVICES                          │
│   IngestionService  │ DocumentService │ QueryService │ SetupSvc  │
│   ingest()            list/delete()     query()        provision()│
│                                                        analyzer() │
└────────────────────────────┬────────────────────────────────────┘
                             │  coordinates
┌────────────────────────────▼────────────────────────────────────┐
│                    BOUNDED CONTEXTS (Infrastructure)             │
│   ingestion/        │  chunking/       │  search/                │
│   CU, DI, Vision       Chonkie            Embeddings, Indexing,  │
│   Router, Triage        Noise, Strategies  Knowledge, Retrieval   │
└────────────────────────────┬────────────────────────────────────┘
                             │  uses
┌────────────────────────────▼────────────────────────────────────┐
│                    DOMAIN LAYER                                  │
│   domain/models.py  — Chunk, FileType, ExtractedDocument, etc.   │
│   domain/exceptions.py — IDPError hierarchy                      │
│   shared/resilience.py — @retry_on_transient()                   │
└─────────────────────────────────────────────────────────────────┘

Dependencies flow inward: adapters → application → bounded contexts → domain.

Project Structure (DDD)

idp-azure/
├── pyproject.toml                        # Dependencies & build config
├── .env.example                          # Required environment variables
├── README.md
└── src/idp_azure/
    ├── config.py                         # 🔧 Centralised Pydantic settings
    │
    ├── domain/                           # 🏛  DOMAIN LAYER (no infrastructure deps)
    │   ├── models.py                     #    AnalyzerChoice, FileType, Chunk,
    │   │                                 #    ExtractedDocument, RetrievalResult
    │   └── exceptions.py                 #    IDPError → IngestionError,
    │                                     #    ChunkingError, IndexingError, etc.
    │
    ├── shared/                           # 🔧 SHARED KERNEL
    │   └── resilience.py                 #    @retry_on_transient() decorator
    │
    ├── application/                      # 📋 APPLICATION SERVICES (use cases)
    │   ├── ingestion_service.py          #    Ingest: extract → chunk → index
    │   ├── document_service.py           #    List & delete indexed documents
    │   ├── query_service.py              #    Query knowledge base (agentic retrieval)
    │   └── setup_service.py              #    One-time infrastructure provisioning
    │
    ├── ingestion/                        # 📥 BOUNDED CONTEXT: Ingestion
    │   ├── router.py                     #    Stage 1 — routing & validation
    │   ├── content_understanding.py      #    Stage 2a — Azure CU (markdown + metadata)
    │   ├── document_intelligence.py      #    Stage 2b — Azure DI
    │   ├── vision.py                     #    Stage 2c — GPT-4o Vision + hybrid
    │   └── figure_triage.py              #    Stage 2d — CU figure classification
    │
    ├── chunking/                         # ✂️  BOUNDED CONTEXT: Chunking
    │   ├── noise.py                      #    Stage 2.5 — noise filtering (CU roles)
    │   ├── pipeline.py                   #    Stage 3 — Chonkie pipeline
    │   └── strategies.py                 #    Per-format chunk configs
    │
    ├── search/                           # 🔍 BOUNDED CONTEXT: Search
    │   ├── embeddings.py                 #    Stage 4 — Azure OpenAI embeddings
    │   ├── index.py                      #    Stage 4 — search index creation
    │   ├── indexing.py                   #    Stage 4 — chunk upload + SearchDocument
    │   ├── knowledge.py                  #    Stage 5 — knowledge source & base
    │   └── retrieval.py                  #    Stage 5 — agentic retrieval client
    │
    ├── agent/                            # 🤖 INTERFACE ADAPTER: RAG Agent
    │   ├── tools.py                      #    Agent tools (delegate to app services)
    │   ├── rag_agent.py                  #    Agent setup & streaming
    │   └── workflows.py                  #    Multi-agent workflows
    │
    ├── api/                              # 🌐 INTERFACE ADAPTER: REST API (FastAPI)
    │   ├── app.py                        #    App factory, lifespan, exception handlers
    │   ├── dependencies.py               #    DI for application services
    │   ├── models.py                     #    Request/response Pydantic schemas
    │   └── routers/
    │       ├── documents.py              #    Upload, delete, list, setup
    │       └── query.py                  #    Query + RAG agent streaming (SSE)
    │
    └── ui/                               # 🖥  INTERFACE ADAPTER: Web UI (Chainlit)
        └── app.py                        #    Chat interface + file upload

Quick Start

# 1. Install
cd idp-azure
uv sync

# 2. Configure
cp .env.example .env
# Edit .env with your Azure resource endpoints and keys

# 3. Start the REST API server (uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py
# or for production:
uv run uvicorn idp_azure.api.app:app --host 0.0.0.0 --port ${IDP_API_PORT:-8000}

# 4. One-time infrastructure setup (creates index + knowledge base)
curl -X POST http://localhost:8000/api/setup

# 5. Upload and ingest documents
curl -X POST http://localhost:8000/api/documents -F "file=@report.pdf"

# 6. Query the knowledge base
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the key financial metrics?"}'

# 7. Chat with the RAG agent (SSE streaming)
curl -N -X POST http://localhost:8000/api/agent \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize all documents"}'

# 8. Or launch the Chainlit web UI (uses IDP_UI_PORT, default 8001)
cd frontend && uv run chainlit run app.py --port ${IDP_UI_PORT:-8001}

Web UI

A standalone chat frontend built with Chainlit. The UI communicates with the REST API backend over HTTP — they run as separate processes.

# Terminal 1 — start the backend (uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py

# Terminal 2 — start the UI (uses IDP_UI_PORT, default 8001)
cd frontend && uv run chainlit run app.py --port ${IDP_UI_PORT:-8001}

The UI opens at http://localhost:8001 by default. Ports are configured via IDP_API_PORT (backend, default 8000) and IDP_UI_PORT (UI, default 8001) in .env. Set IDP_API_URL to override the full backend URL.

Features

Feature	How it works
Document upload	Drag & drop or use the 📎 attachment icon. Supports all file types listed below (PDF, DOCX, PPTX, XLSX, CSV, HTML, images, TXT, MD). Files are uploaded to the backend API and ingested automatically.
Chat	Ask natural language questions — the backend RAG agent streams answers with source citations via SSE.
Streaming	Responses stream token-by-token via Server-Sent Events from the backend.
Error handling	Clear messages for backend connectivity issues, unsupported files, or ingestion failures.

Configuration

The web UI needs IDP_API_PORT (or IDP_API_URL) to connect to the backend. All Azure configuration lives in the backend's .env. Upload limits are set in .chainlit/config.toml (default: 5 files, 200 MB each).

REST API

A FastAPI backend that exposes the full IDP pipeline as REST endpoints. This is the primary interface for programmatic integration, custom frontends, or microservice architectures.

# Development (with hot reload, uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py

# Production
uv run uvicorn idp_azure.api.app:app --host 0.0.0.0 --port ${IDP_API_PORT:-8000}

OpenAPI docs are available at http://localhost:${IDP_API_PORT}/docs (default: http://localhost:8000/docs).

Endpoints

Method	Path	Description
`GET`	`/api/health`	Health & readiness check (per-service status)
`POST`	`/api/setup`	Create search infrastructure (one-time)
`POST`	`/api/documents`	Upload & ingest a document (multipart file upload)
`GET`	`/api/documents`	List all indexed documents with chunk counts
`DELETE`	`/api/documents/{source_file}`	Delete all chunks for a document
`POST`	`/api/query`	Query the search index (hybrid or knowledge base, based on `SEARCH_MODE`)
`POST`	`/api/agent`	Chat with the RAG agent (SSE streaming)

Examples

# Health check
curl http://localhost:8000/api/health

# Upload and ingest a document
curl -X POST http://localhost:8000/api/documents \
  -F "file=@report.pdf"

# List indexed documents
curl http://localhost:8000/api/documents

# Query the knowledge base
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the key financial metrics?"}'

# Chat with the RAG agent (SSE stream)
curl -N -X POST http://localhost:8000/api/agent \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize all documents"}'

# Delete a document
curl -X DELETE http://localhost:8000/api/documents/report.pdf

Agent Streaming (SSE)

The /api/agent endpoint streams Server-Sent Events with four event types:

Event	Payload	Description
`session`	`{"session_id": "…"}`	Always first — identifies the conversation session
`token`	Raw text chunk	A piece of the response as it's generated
`done`	`{"status": "complete"}`	The response is finished
`error`	`{"error": "…", "detail": "…"}`	An error occurred (`SessionExpired` when reusing a stale ID)

Agent Session Lifecycle

Multi-turn conversations are maintained through a server-side session model built on the Microsoft Agent Framework's AgentSession. The backend owns the session; clients only hold a session ID.

 Frontend (Chainlit UI)                              Backend (/api/agent)
 ─────────────────────                               ────────────────────

 1st message
 ┌────────────────────────┐   POST /api/agent
 │ { "question": "..." }  │ ──────────────────────►  No session_id →
 └────────────────────────┘                          agent.create_session()
                                                     Store in _AgentSessionStore
                             ◄─── SSE event: session  {"session_id":"abc-123"}
                             ◄─── SSE event: token    "Here is..."
                             ◄─── SSE event: done

 Store session_id="abc-123"
 in cl.user_session

 2nd message
 ┌───────────────────────────────────────────┐
 │ { "question": "...", "session_id": "abc-123" } │
 └───────────────────────────────────────────┘
                                          ──►  Lookup in _AgentSessionStore
                                               Found → reuse session (keeps
                                               full conversation history)
                             ◄─── SSE event: session  {"session_id":"abc-123"}
                             ◄─── SSE event: token    ...
                             ◄─── SSE event: done

 After TTL expires (default 1 hour)
 ┌───────────────────────────────────────────┐
 │ { "question": "...", "session_id": "abc-123" } │
 └───────────────────────────────────────────┘
                                          ──►  Lookup → expired/evicted
                             ◄─── SSE event: error
                                  {"error":"SessionExpired","detail":"..."}

 Clear stored session_id
 Next message creates new session

Key design decisions:

Server-owned sessions — The AgentSession (from Microsoft Agent Framework) holds the full conversation history (all prior turns, tool calls, and responses). The frontend never stores message history; it only stores the opaque session_id string.
In-memory store with lazy TTL eviction — _AgentSessionStore (app.py) is a dict[str, _SessionEntry] that evicts entries on every get()/put() call when time.monotonic() - last_accessed > TTL. Each successful lookup refreshes last_accessed, so active conversations never expire.
Per-session locking — Each session has an asyncio.Lock() to serialise concurrent requests to the same session, preventing interleaved agent runs.
Graceful expiry handling — When a session is expired, the backend returns an SSE error event with "SessionExpired". The frontend clears its stored ID so the next message creates a fresh session.

Configuration:

Variable	Default	Description
`AGENT_SESSION_TTL`	`3600`	Seconds of inactivity before a session is evicted

Supported File Types

Type	Extensions	Stage 2 Analyzer	Notes
Documents	`.pdf`, `.docx`, `.pptx`, `.xlsx`	Content Understanding	Full structure preservation
Spreadsheets	`.csv`	Content Understanding	Row-based chunking
Web	`.html`, `.htm`	Content Understanding	HTML → Markdown
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`	Content Understanding (OCR)	Vision fallback for complex images
Text	`.txt`, `.md`	Direct read	No API call needed

Environment Variables

# Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_KEY=your-key              # omit to use DefaultAzureCredential
AZURE_OPENAI_API_VERSION=2025-03-01-preview
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-5.4-mini       # primary chat model
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-large
AZURE_OPENAI_VISION_DEPLOYMENT=gpt-5.4-mini
AZURE_OPENAI_QUERY_PLANNING_DEPLOYMENT=gpt-5-mini

# Azure Content Understanding
CONTENTUNDERSTANDING_ENDPOINT=https://your-cu.cognitiveservices.azure.com
CONTENTUNDERSTANDING_KEY=your-key           # optional

# Azure Document Intelligence
DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com
DOCUMENT_INTELLIGENCE_KEY=your-key          # optional

# Azure AI Search
AZURE_SEARCH_ENDPOINT=https://your-search.search.windows.net
AZURE_SEARCH_INDEX_NAME=idp-documents
AZURE_SEARCH_ADMIN_KEY=your-key             # optional

# Search mode: "hybrid" (default) or "knowledge_base"
#   hybrid         — keyword + vector + semantic reranking + custom LLM query rewrite
#   knowledge_base — LLM-driven agentic retrieval via knowledge base (requires GPT deployment)
SEARCH_MODE=hybrid

# Microsoft Foundry (optional)
AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com

# UI → Backend connection (only needed for Chainlit UI)
IDP_API_PORT=8000
IDP_UI_PORT=8001
# IDP_API_URL=http://localhost:8000  # overrides IDP_API_PORT if set

# Agent session TTL (seconds of inactivity before session is evicted)
AGENT_SESSION_TTL=3600

Logging

Application logs (idp_azure.*) and third-party / framework logs are separated so that turning on DEBUG doesn't flood the console with SDK transport noise.

Environment variables

Variable	Default	Description
`IDP_LOG_LEVEL`	`INFO`	Log level for application code (`idp_azure.*`).
`IDP_LIB_LOG_LEVEL`	`WARNING`	Log level for third-party libraries (root logger).
`IDP_LIB_LOG_SILENCE`	(see below)	Comma-separated logger names forced to `WARNING` even when `IDP_LIB_LOG_LEVEL` is lowered. Set to `""` to un-silence everything.

Common recipes

# Normal development — only app DEBUG, libraries stay quiet
IDP_LOG_LEVEL=DEBUG

# Debug Azure SDK / OpenAI calls (transport noise auto-silenced)
IDP_LOG_LEVEL=DEBUG IDP_LIB_LOG_LEVEL=DEBUG

# Debug absolutely everything including httpx request/response headers
IDP_LOG_LEVEL=DEBUG IDP_LIB_LOG_LEVEL=DEBUG IDP_LIB_LOG_SILENCE=""

# Debug only OpenAI, silence Azure SDK
IDP_LIB_LOG_LEVEL=DEBUG IDP_LIB_LOG_SILENCE="httpx,httpcore,urllib3,asyncio,watchfiles,opentelemetry,msal,azure"

Default-silenced loggers

When IDP_LIB_LOG_LEVEL is lowered to DEBUG, these loggers are kept at WARNING by default because they produce extreme noise:

Logger	What it emits at DEBUG
`httpx` / `httpcore`	Every outgoing HTTP request and response including headers
`urllib3`	Connection pool lifecycle (open, close, reuse)
`asyncio`	Event-loop internals, selector polls, task scheduling
`msal`	Token cache lookups, OAuth2 handshake steps
`watchfiles`	File-system change events (noisy in `--reload` mode)
`opentelemetry`	Span export batching, internal SDK state

Useful-for-debugging loggers (not silenced)

These are not silenced by default — they produce actionable output when you set IDP_LIB_LOG_LEVEL=DEBUG:

Logger	What it emits at DEBUG
`azure`	Azure SDK pipeline — request policies, retry logic, auth flow (covers `azure-search-documents`, `azure-ai-documentintelligence`, `azure-ai-contentunderstanding`, `azure-identity`)
`openai`	OpenAI SDK — request/response payloads for chat completions and embeddings
`agent_framework`	Microsoft Agent Framework — workflow execution, tool dispatch, orchestration
`chonkie`	Chunking library internals
`uvicorn`	ASGI server startup, shutdown, lifespan events
`fastapi`	Router registration, middleware chain

Error Handling

All domain exceptions inherit from IDPError for consistent handling:

IDPError
├── IngestionError
│   ├── UnsupportedFileTypeError    # unknown file extension
│   ├── FileTooLargeError           # exceeds 200 MB
│   ├── EmptyFileError              # zero-byte file
│   └── ExtractionError             # Azure service returned no content
├── ChunkingError                   # Chonkie pipeline failure
├── IndexingError                   # search index upload failure
├── RetrievalError                  # knowledge base query failure
└── AgentError                      # RAG agent failure

All Azure API calls are wrapped with @retry_on_transient() which retries on HTTP 429 (rate limit), 503 (unavailable), and 504 (timeout) with exponential backoff (2s → 4s → 8s, max 3 retries).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/eval		data/eval
frontend		frontend
src/idp_azure		src/idp_azure
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run-backend.sh		run-backend.sh
run-frontend.sh		run-frontend.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

IDP Azure — Intelligent Document Processing

End-to-End Pipeline Flow

Stage-by-Stage Breakdown

Stage 1 — Document Routing & Validation

Stage 2 — Document Extraction (3 Analyzers)

2a. Azure Content Understanding (Primary)

2b. Azure Document Intelligence (Complementary)

2c. Azure OpenAI GPT-4o Vision (Selective)

2d. Hybrid Extraction (CU + Figure Triage + Vision)

Stage 2.5 -- Noise Filtering

Stage 2.6 -- Speaker Notes Extraction (PPTX only)

Stage 3 -- Chunking

Stage 4 — Embedding & Indexing

Stage 5 — Search Retrieval

Mode 1: Hybrid Search (default — SEARCH_MODE=hybrid)

Mode 2: Knowledge Base (SEARCH_MODE=knowledge_base)

Stage 6 — RAG Agent

Contextual Enrichment (Opt-In)

Technology Stack Summary

Architecture (DDD)

Project Structure (DDD)

Quick Start

Web UI

Features

Configuration

REST API

Endpoints

Examples

Agent Streaming (SSE)

Agent Session Lifecycle

Supported File Types

Environment Variables

Logging

Environment variables

Common recipes

Default-silenced loggers

Useful-for-debugging loggers (not silenced)

Error Handling

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Mode 1: Hybrid Search (default — `SEARCH_MODE=hybrid`)

Mode 2: Knowledge Base (`SEARCH_MODE=knowledge_base`)

Packages