Skip to content

eggboy/idp-azure

Repository files navigation

IDP Azure — Intelligent Document Processing

An end-to-end Intelligent Document Processing (IDP) pipeline built entirely on the Azure ecosystem. It ingests documents of any type, extracts structured content, chunks intelligently, indexes with vector search, and exposes an agentic RAG interface — all following Domain-Driven Design (DDD) principles for clear, maintainable code.


End-to-End Pipeline Flow

The pipeline has 7 stages. Each stage is a distinct bounded context with a clear responsibility, input, output, and Azure technology.

 ┌──────────────────────────────────────────────────────────────┐
 │                     DOCUMENT SOURCES                         │
 │  PDF │ DOCX │ PPTX │ XLSX │ CSV │ HTML │ Images │ TXT │ MD  │
 └───────────────────────────┬──────────────────────────────────┘
                             │
                     ┌───────▼───────┐
                     │   STAGE 1     │
                     │   Routing &   │
                     │   Validation  │
                     └───────┬───────┘
                             │  validates size, type, existence
                             │  selects optimal analyzer
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
┌────────▼────────┐ ┌───────▼────────┐ ┌────────▼────────┐
│    STAGE 2a     │ │    STAGE 2b    │ │    STAGE 2c     │
│    Content      │ │    Document    │ │    GPT-4o       │
│    Understanding│ │    Intelligence│ │    Vision       │
│    (primary)    │ │    (tables)    │ │    (images)     │
└────────┬────────┘ └───────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │  all paths produce Markdown
                             │  + structural metadata (CU)
                             │
                     ┌───────▼───────┐
                     │   STAGE 2.5   │
                     │   Noise       │
                     │   Filtering   │
                     └───────┬───────┘
                             │  removes headers/footers/
                             │  page numbers via CU roles
                             │
                     ┌───────▼───────┐
                     │   STAGE 2.6   │
                     │   Speaker     │
                     │   Notes (PPTX)│
                     └───────┬───────┘
                             │  extracts notes via
                             │  python-pptx (local)
                             │
                     ┌───────▼───────┐
                     │   STAGE 3     │
                     │   Chunking    │
                     │  (page-first) │
                     └───────┬───────┘
                             │  splits by page markers,
                             │  merges mid-sentence breaks
                             │
                     ┌───────▼───────┐
                     │   STAGE 4     │
                     │   Embedding   │
                     │   & Indexing  │
                     └───────┬───────┘
                             │  embeds + uploads to search
                             │
                     ┌───────▼───────┐
                     │   STAGE 5     │
                     │   Agentic     │
                     │   Retrieval   │
                     └───────┬───────┘
                             │  LLM-driven query planning
                             │
                     ┌───────▼───────┐
                     │   STAGE 6     │
                     │   RAG Agent   │
                     │   (response)  │
                     └───────────────┘

Stage-by-Stage Breakdown

Stage 1 — Document Routing & Validation

What: Receives a file, validates it, and decides which Azure analyzer should process it.

Aspect Detail
Code ingestion/router.py
Technology Pure Python (no Azure calls)
Input Raw file path
Output DocumentMetadata (file type, analyzer choice, size)
Domain Model AnalyzerChoice enum, FileType enum

How it works:

  1. Checks the file exists, is non-empty, and is under 200 MB
  2. Maps the file extension to an AnalyzerChoice:
    • .pdf, .docx, .pptx, .xlsx, .csv, .html, images → Content Understanding
    • .txt, .mdDirect Read (no API call needed)
  3. Returns a DocumentMetadata value object used by subsequent stages
file_path → validate_file() → DocumentMetadata { file_type, analyzer, size }

Stage 2 — Document Extraction (3 Analyzers)

What: Extracts structured text from the document as Markdown. Three analyzers available depending on the document type.

2a. Azure Content Understanding (Primary)

Aspect Detail
Code ingestion/content_understanding.py
Technology Azure Content Understanding SDK (azure-ai-contentunderstanding)
Input Binary file bytes
Output Markdown string with tables, headings, page structure
Used For PDF, DOCX, PPTX, XLSX, CSV, HTML, images (built-in OCR)
Resilience @retry_on_transient() — retries on 429/503/504 with exponential backoff
file_bytes → ContentUnderstandingClient.begin_analyze_binary() → AnalyzeResult → .markdown

2b. Azure Document Intelligence (Complementary)

Aspect Detail
Code ingestion/document_intelligence.py
Technology Azure Document Intelligence SDK (azure-ai-documentintelligence)
Input Binary file bytes
Output Markdown with fine-grained table structure (row/column counts)
Used For Documents needing superior layout analysis or form-field extraction
file_bytes → DocumentIntelligenceClient.begin_analyze_document(model="prebuilt-layout") → .content

2c. Azure OpenAI GPT-4o Vision (Selective)

Aspect Detail
Code ingestion/vision.py
Technology Azure OpenAI (openai SDK) — GPT-4o with vision capability
Input Image bytes (PNG/JPEG/TIFF/BMP) or PDF pages rendered as 200-DPI PNGs
Output Detailed Markdown description (charts, diagrams, data points, tables)
Used For Complex visual content that text extractors miss: charts, diagrams, handwriting
image_bytes → base64 encode → GPT-4o chat.completions.create(vision) → Markdown description

2d. Hybrid Extraction (CU + Figure Triage + Vision)

For PDFs with complex images, a hybrid approach is available. It uses CU's native figure metadata for content-aware triage instead of crude size-based filtering:

file -> Content Understanding (text + tables + figure metadata)
     -> CU identifies figures with kind: CHART / MERMAID / UNKNOWN
     -> CHART/MERMAID: CU provides structured content directly (no Vision call)
     -> UNKNOWN (no description): crop figure region via PyMuPDF -> GPT-4o Vision
     -> combined Markdown output
Aspect Detail
Code ingestion/figure_triage.py, application/ingestion_service.py
Technology Content Understanding (figure metadata) + GPT-4o Vision (fallback) + PyMuPDF (region cropping)
Output ExtractedDocument with markdown + figure descriptions

Figure triage logic:

  • CHART figures: CU provides Chart.js structured content and description -- used directly
  • MERMAID figures: CU provides Mermaid.js syntax and description -- used directly
  • UNKNOWN figures with CU description (>= 20 chars): used directly
  • UNKNOWN figures without description: figure region cropped from PDF page using CU's bounding polygon, sent to GPT-4o Vision

Stage 2.5 -- Noise Filtering

What: Removes headers, footers, and page numbers from extracted markdown before chunking to prevent noise from contaminating retrieval chunks.

Aspect Detail
Code chunking/noise.py
Technology Azure Content Understanding paragraph roles (native metadata)
Input Markdown string + DocumentContent (paragraph metadata)
Output Cleaned markdown with noise elements removed

How it works:

  1. Primary path (CU metadata): CU classifies every paragraph with a semantic role. Paragraphs with role PAGE_HEADER, PAGE_FOOTER, or PAGE_NUMBER carry span offsets into the markdown. The noise filter collects these spans, merges overlapping ones, and rebuilds the markdown from the non-noise ranges.

  2. Fallback path (regex): When CU paragraph metadata is unavailable (e.g. direct-read or DI extraction path), a conservative regex removes only standalone page-number lines (Page X of Y, - X -, bare numbers on their own line).

Design principle: conservative by default -- ambiguous content is kept, not removed.


Stage 2.6 -- Speaker Notes Extraction (PPTX only)

What: Extracts speaker notes from PowerPoint presentations. CU and DI do not extract notes — they only process visible slide content. Notes are extracted locally using python-pptx with no API calls.

Aspect Detail
Code ingestion/speaker_notes.py
Technology python-pptx (reads PPTX XML structure locally)
Input .pptx file path
Output List of ### Notes from Page N formatted strings
Used For .pptx files only (notes are lost in PDF-from-PPT conversion)

Why this matters for RAG: Slides typically have terse bullet points while speaker notes contain the full explanation, context, and reasoning. For retrieval, notes often produce better answers than slide text alone.

How it integrates: Notes are formatted with ### Notes from Page N headings and appended to the ExtractedDocument.image_descriptions list. The existing content assembly module interleaves them at the correct page position, so each slide's chunk contains both the visible slide content and the presenter's notes.

.pptx file → python-pptx reads notesSlide XML → "### Notes from Page N"
  → interleaved into full_content at page position → chunked with slide content

Stage 3 -- Chunking

What: Splits extracted Markdown into retrieval-optimized chunks using page markers as the primary split, with mid-sentence merge for report-style page breaks and overflow splitting for oversized pages.

Aspect Detail
Code chunking/pipeline.py
Technology Page-marker splitting + Chonkie RecursiveChunker (overflow only)
Input Markdown string (with <!-- PageBreak --> / <!-- PageNumber="N" --> markers from CU)
Output List of Chunk domain objects (id, text, token_count, page_number)

Pipeline stages (in order):

Markdown (with CU page markers)
  │
  ▼
Split by page markers               ← <!-- PageBreak --> / <!-- PageNumber="N" -->
  │                                    (NOT by ## headings — CU renders PPT text
  │                                     boxes as ## headings which would over-split)
  ▼
Mid-sentence merge                   ← if page N ends without .!?: AND page N+1
  │                                    starts lowercase → merge (report page breaks)
  ▼
Overflow split (RecursiveChunker)     ← only for blocks exceeding chunk_size (1024)
  │                                    uses plain-text recipe, not markdown
  ▼
List[Chunk]                          ← each chunk carries page_number for citations

Why page-first, not heading-based:

Azure Content Understanding renders each text box from PPT slides as a separate ## heading in the markdown (e.g., ## Customer:, ## Challenge:). A heading-based chunker like RecursiveChunker(recipe="markdown") splits on every ##, producing hundreds of micro-chunks (6–50 tokens each) from a 20-slide deck. The page-first approach treats each page/slide as a single chunk, keeping all content together regardless of ## text-box headings.

For report-style PDFs where paragraphs flow across page boundaries, the mid-sentence merge step detects when a page break cuts a paragraph (previous page ends without sentence-terminal punctuation + next page starts with a lowercase letter) and merges those pages back together. This is conservative — only mid-sentence breaks trigger merging, not topical relatedness.

Chunk sizing:

Scenario Handling
Page fits within chunk_size (1024 tokens) One chunk per page
Page exceeds chunk_size Split with RecursiveChunker (plain-text recipe)
Mid-sentence page break (reports) Adjacent pages merged, then size-checked
Slide-like pages (self-contained) No merging — each page stays separate

Stage 4 — Embedding & Indexing

What: Converts each chunk to a 3072-dimensional vector and uploads to Azure AI Search.

Aspect Detail
Code search/embeddings.py, search/index.py, search/indexing.py
Technology Azure OpenAI Embeddings (text-embedding-3-large) + Azure AI Search (azure-search-documents)
Input List of Chunk objects
Output Documents indexed in Azure AI Search

Sub-steps:

                                    ┌─────────────────────────────────┐
Chunks ──► Batch embed (16/call) ──►│  Azure AI Search Index          │
           Azure OpenAI             │  ┌───────────────────────────┐  │
           text-embedding-3-large   │  │ Fields:                   │  │
           (3072 dimensions)        │  │  • id (key)               │  │
                                    │  │  • content (searchable)   │  │
                                    │  │  • original_content       │  │
                                    │  │  • content_vector (3072d) │  │
                                    │  │  • source_file (filter)   │  │
                                    │  │  • file_type (filter)     │  │
                                    │  │  • chunk_index (sortable) │  │
                                    │  │  • page_number (filter)   │  │
                                    │  └───────────────────────────┘  │
                                    │  Vector: HNSW algorithm         │
                                    │  Vectorizer: integrated AOAI    │
                                    │  Semantic: content field ranked  │
                                    └─────────────────────────────────┘

Key details:

  • Integrated vectorizer — the index is configured so Azure AI Search can call Azure OpenAI for query-time vectorization automatically (required for agentic retrieval)
  • Batch embedding — texts are embedded 16 at a time with @retry_on_transient() for rate-limit resilience
  • Buffered uploadSearchIndexingBufferedSender handles reliable batch uploads with auto-retry
  • Deterministic IDs — chunk IDs are SHA-256 hashes of source_file:chunk_index (supports re-indexing)
  • Contextual enrichment (opt-in) — when CONTEXTUAL_ENRICHMENT_ENABLED=true, an LLM generates a short document-level context prefix for each chunk before embedding. The enriched text is stored in content (for search), while the raw chunk text is preserved in original_content (for display). This follows Anthropic's Contextual Retrieval approach and, combined with the existing hybrid search and semantic reranking, can reduce retrieval failures by up to 67%. See Contextual Enrichment below for details.

Stage 5 — Search Retrieval

What: Retrieves relevant document chunks using hybrid search (default) or LLM-powered agentic retrieval via a knowledge base. Configurable via the SEARCH_MODE environment variable.

Aspect Detail
Code search/retrieval.py, search/query_rewrite.py, search/knowledge.py
Technology Azure AI Search — hybrid search with LLM query rewriting + semantic reranking (default), or Agentic Retrieval via KnowledgeBaseRetrievalClient
Input Natural-language query (hybrid) or conversational messages (knowledge base)
Output RetrievalResult (content, references with source/page, optional activity)

Two retrieval modes (set via SEARCH_MODE env var):

Mode 1: Hybrid Search (default — SEARCH_MODE=hybrid)

Combines three retrieval signals in a single request:

User question: "What about its revenue?"  (multi-turn follow-up)
  │
  ▼  Custom LLM Query Rewrite (GPT-5-mini)
  │  • Resolves coreferences: "its" → "Division B"
  │  • Expands short/ambiguous queries with keyword synonyms
  │  • Only fires when conversation context needs resolution
  │  • Skipped for clear single-turn queries (zero added latency)
  │
  ├──► BM25 text query (expanded: rewritten + keyword synonyms)
  ├──► Vector query (clean standalone rewrite only — no expansion noise)
  │
  ▼  Semantic Reranking (cross-encoder, uses clean rewritten query)
  │  (rescores fused results for higher relevance)
  │
  ▼  Top-K results with source citations
  • Query rewriting — custom LLM-based pre-search rewrite (search/query_rewrite.py) handles conversational coreference resolution and conditional keyword expansion. Does not rely on Azure AI Search's built-in generative query rewrite — our custom rewriter gives full control over rewrite behavior, structured output, and prompt caching.
  • Three-channel query split — BM25 gets expanded text for broad recall, vector search and semantic reranker get the clean standalone query for precision (via semantic_query parameter).
  • Semantic reranking — cross-encoder reranker via the default semantic configuration

Mode 2: Knowledge Base (SEARCH_MODE=knowledge_base)

LLM-powered agentic retrieval with query planning and optional answer synthesis:

User question: "Compare revenue trends in Q3 vs Q4 and list risk factors"
  │
  ▼  LLM Query Planning (GPT-5-mini)
  │
  ├──► Subquery 1: "Q3 revenue trends"     ──► hybrid search ──► results
  ├──► Subquery 2: "Q4 revenue trends"     ──► hybrid search ──► results
  └──► Subquery 3: "risk factors"          ──► hybrid search ──► results
                                                     │
                                          merge + semantic rerank
                                                     │
                                                     ▼
                                            Unified response with
                                            source citations

Architecture (3 layers):

┌────────────────────────────────────────────────────────────────┐
│                    Knowledge Base                              │
│  • LLM: GPT-5-mini (query planning — AI Search compatible)     │
│  • Decomposes complex questions into focused subqueries        │
│  • Runs subqueries in parallel                                 │
│  • Optionally synthesises a natural-language answer             │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                Knowledge Source                          │  │
│  │  • Wraps the search index                                │  │
│  │  • Citation fields: id, source_file, page_number         │  │
│  │  ┌────────────────────────────────────────────────────┐  │  │
│  │  │              Search Index                          │  │  │
│  │  │  • Hybrid search: BM25 text + HNSW vector         │  │  │
│  │  │  • Semantic reranking on content field             │  │  │
│  │  │  • Integrated vectorizer (auto text→vector)        │  │  │
│  │  └────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Multi-turn support: In knowledge base mode, pass the full conversation history — the knowledge base uses prior messages for better query planning on follow-up questions. In hybrid mode, only the last user message is used as the search query.


Stage 6 — RAG Agent

What: An intelligent conversational agent that uses agentic retrieval to answer document questions with citations, stream responses, and produce structured outputs.

Aspect Detail
Code agent/tools.py, agent/rag_agent.py, agent/workflows.py
Technology Microsoft Agent Framework 1.0.0 (agent-framework, agent-framework-openai)
LLM Azure OpenAI GPT-4.1 (primary chat model)
Input Natural language question
Output Answer with inline source citations

Agent architecture:

┌─────────────────────────────────────────────────────────────┐
│              DocumentAssistant (RAG Agent)                   │
│              LLM: Azure OpenAI GPT-4.1                      │
│                                                             │
│   Tools:                                                    │
│   ┌─────────────────────────────────────────────────────┐   │
│   │ search_documents(query)                             │   │
│   │   → Calls Stage 5 (Agentic Retrieval)               │   │
│   │   → Returns content + source citations               │   │
│   ├─────────────────────────────────────────────────────┤   │
│   │ list_indexed_documents()                            │   │
│   │   → Queries search index facets                     │   │
│   │   → Returns list of files with chunk counts          │   │
│   └─────────────────────────────────────────────────────┘   │
│                                                             │
│   Capabilities:                                             │
│   • Streaming responses (real-time token output)            │
│   • Structured output (Pydantic DocumentSummary)            │
│   • Multi-turn conversation with context                    │
│   • Multi-agent workflows (Retriever → Analyzer → Writer)   │
└─────────────────────────────────────────────────────────────┘

Multi-agent workflow (optional, via workflows.py):

User query
  │
  ▼
Retriever Agent ──► searches & retrieves relevant chunks
  │
  ▼
Analyzer Agent  ──► identifies themes, contradictions, insights
  │
  ▼
Writer Agent    ──► formats into a structured, cited response

Contextual Enrichment (Opt-In)

Anthropic's Contextual Retrieval approach: prepend LLM-generated context to each chunk before embedding. Combined with hybrid search (BM25 + vector + semantic rerank), this reduces retrieval failures by up to 67%.

Aspect Detail
Code chunking/enrichment.py
Toggle CONTEXTUAL_ENRICHMENT_ENABLED=true (disabled by default)
Cost 1 summary call + 1 call per chunk per ingested document

How it works:

  1. Document summary — a single LLM call generates a 3-5 sentence summary of the entire document.
  2. Per-chunk context — for each chunk, an LLM call receives {summary} + {chunk_text} and produces 2-3 sentences situating the chunk within the document.
  3. Dual storage — enriched text (context + chunk) is stored in content for search; raw chunk text is preserved in original_content for display.

Configuration:

Env Var Default Description
CONTEXTUAL_ENRICHMENT_ENABLED false Enable contextual enrichment during ingestion
CONTEXTUAL_ENRICHMENT_DEPLOYMENT primary chat deployment LLM deployment for enrichment
CONTEXTUAL_ENRICHMENT_MAX_CONCURRENT 5 Max concurrent LLM calls for chunk enrichment

Notes:

  • Enrichment is fail-safe — on failure it falls back to raw chunks rather than aborting ingestion.
  • Chunk text is wrapped in <document_chunk> XML tags to prevent Azure OpenAI's content filter from misidentifying imperative business language (e.g., PPT slide text like "You must ensure...") as jailbreak attempts. If the content filter still triggers, enrichment is silently skipped for that chunk.
  • Existing documents must be re-ingested to gain enrichment.
  • The hierarchical approach (summary → per-chunk context) keeps cost manageable versus sending the full document with every chunk.

Technology Stack Summary

Stage Technology Package Purpose
1. Routing Pure Python -- File validation & analyzer selection
2a. Extraction Azure Content Understanding azure-ai-contentunderstanding Primary document -> Markdown + metadata
2b. Extraction Azure Document Intelligence azure-ai-documentintelligence Layout & table extraction
2c. Extraction Azure OpenAI GPT-4o Vision openai Chart/diagram description
2c. PDF rendering PyMuPDF pymupdf PDF page -> PNG for Vision
2d. Figure triage Azure Content Understanding azure-ai-contentunderstanding Figure classification (CHART/MERMAID/UNKNOWN)
2.5. Noise filtering Azure Content Understanding azure-ai-contentunderstanding Paragraph role-based noise removal
2.6. Speaker notes python-pptx python-pptx PPTX speaker notes extraction (local, no API)
3. Chunking Page-marker splitting + Chonkie chonkie Page-first chunking with overflow splitting
4. Embedding Azure OpenAI openai text-embedding-3-large (3072d)
4. Indexing Azure AI Search azure-search-documents Vector + BM25 hybrid index
5. Retrieval Azure AI Search Agentic Retrieval azure-search-documents (preview) LLM-driven query planning
6. Agent Microsoft Agent Framework 1.0.0 agent-framework-openai RAG agent with tool calling
6. Model mgmt Microsoft Foundry agent-framework-foundry Centralised model deployment
Cross-cutting Pydantic pydantic Settings validation & domain models
Cross-cutting Azure Identity azure-identity DefaultAzureCredential auth
REST API FastAPI fastapi[standard] REST endpoints + SSE streaming

Architecture (DDD)

The codebase follows Domain-Driven Design with clear layering:

┌─────────────────────────────────────────────────────────────────┐
│                    INTERFACE ADAPTERS                            │
│   api/ (FastAPI)  │  ui/ (Chainlit)  │  agent/ (Agent Framework)│
│   REST endpoints     Chat + upload      RAG agent tools          │
└────────────────────────────┬────────────────────────────────────┘
                             │  delegates to
┌────────────────────────────▼────────────────────────────────────┐
│                    APPLICATION SERVICES                          │
│   IngestionService  │ DocumentService │ QueryService │ SetupSvc  │
│   ingest()            list/delete()     query()        provision()│
│                                                        analyzer() │
└────────────────────────────┬────────────────────────────────────┘
                             │  coordinates
┌────────────────────────────▼────────────────────────────────────┐
│                    BOUNDED CONTEXTS (Infrastructure)             │
│   ingestion/        │  chunking/       │  search/                │
│   CU, DI, Vision       Chonkie            Embeddings, Indexing,  │
│   Router, Triage        Noise, Strategies  Knowledge, Retrieval   │
└────────────────────────────┬────────────────────────────────────┘
                             │  uses
┌────────────────────────────▼────────────────────────────────────┐
│                    DOMAIN LAYER                                  │
│   domain/models.py  — Chunk, FileType, ExtractedDocument, etc.   │
│   domain/exceptions.py — IDPError hierarchy                      │
│   shared/resilience.py — @retry_on_transient()                   │
└─────────────────────────────────────────────────────────────────┘

Dependencies flow inward: adapters → application → bounded contexts → domain.

Project Structure (DDD)

idp-azure/
├── pyproject.toml                        # Dependencies & build config
├── .env.example                          # Required environment variables
├── README.md
└── src/idp_azure/
    ├── config.py                         # 🔧 Centralised Pydantic settings
    │
    ├── domain/                           # 🏛  DOMAIN LAYER (no infrastructure deps)
    │   ├── models.py                     #    AnalyzerChoice, FileType, Chunk,
    │   │                                 #    ExtractedDocument, RetrievalResult
    │   └── exceptions.py                 #    IDPError → IngestionError,
    │                                     #    ChunkingError, IndexingError, etc.
    │
    ├── shared/                           # 🔧 SHARED KERNEL
    │   └── resilience.py                 #    @retry_on_transient() decorator
    │
    ├── application/                      # 📋 APPLICATION SERVICES (use cases)
    │   ├── ingestion_service.py          #    Ingest: extract → chunk → index
    │   ├── document_service.py           #    List & delete indexed documents
    │   ├── query_service.py              #    Query knowledge base (agentic retrieval)
    │   └── setup_service.py              #    One-time infrastructure provisioning
    │
    ├── ingestion/                        # 📥 BOUNDED CONTEXT: Ingestion
    │   ├── router.py                     #    Stage 1 — routing & validation
    │   ├── content_understanding.py      #    Stage 2a — Azure CU (markdown + metadata)
    │   ├── document_intelligence.py      #    Stage 2b — Azure DI
    │   ├── vision.py                     #    Stage 2c — GPT-4o Vision + hybrid
    │   └── figure_triage.py              #    Stage 2d — CU figure classification
    │
    ├── chunking/                         # ✂️  BOUNDED CONTEXT: Chunking
    │   ├── noise.py                      #    Stage 2.5 — noise filtering (CU roles)
    │   ├── pipeline.py                   #    Stage 3 — Chonkie pipeline
    │   └── strategies.py                 #    Per-format chunk configs
    │
    ├── search/                           # 🔍 BOUNDED CONTEXT: Search
    │   ├── embeddings.py                 #    Stage 4 — Azure OpenAI embeddings
    │   ├── index.py                      #    Stage 4 — search index creation
    │   ├── indexing.py                   #    Stage 4 — chunk upload + SearchDocument
    │   ├── knowledge.py                  #    Stage 5 — knowledge source & base
    │   └── retrieval.py                  #    Stage 5 — agentic retrieval client
    │
    ├── agent/                            # 🤖 INTERFACE ADAPTER: RAG Agent
    │   ├── tools.py                      #    Agent tools (delegate to app services)
    │   ├── rag_agent.py                  #    Agent setup & streaming
    │   └── workflows.py                  #    Multi-agent workflows
    │
    ├── api/                              # 🌐 INTERFACE ADAPTER: REST API (FastAPI)
    │   ├── app.py                        #    App factory, lifespan, exception handlers
    │   ├── dependencies.py               #    DI for application services
    │   ├── models.py                     #    Request/response Pydantic schemas
    │   └── routers/
    │       ├── documents.py              #    Upload, delete, list, setup
    │       └── query.py                  #    Query + RAG agent streaming (SSE)
    │
    └── ui/                               # 🖥  INTERFACE ADAPTER: Web UI (Chainlit)
        └── app.py                        #    Chat interface + file upload

Quick Start

# 1. Install
cd idp-azure
uv sync

# 2. Configure
cp .env.example .env
# Edit .env with your Azure resource endpoints and keys

# 3. Start the REST API server (uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py
# or for production:
uv run uvicorn idp_azure.api.app:app --host 0.0.0.0 --port ${IDP_API_PORT:-8000}

# 4. One-time infrastructure setup (creates index + knowledge base)
curl -X POST http://localhost:8000/api/setup

# 5. Upload and ingest documents
curl -X POST http://localhost:8000/api/documents -F "file=@report.pdf"

# 6. Query the knowledge base
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the key financial metrics?"}'

# 7. Chat with the RAG agent (SSE streaming)
curl -N -X POST http://localhost:8000/api/agent \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize all documents"}'

# 8. Or launch the Chainlit web UI (uses IDP_UI_PORT, default 8001)
cd frontend && uv run chainlit run app.py --port ${IDP_UI_PORT:-8001}

Web UI

A standalone chat frontend built with Chainlit. The UI communicates with the REST API backend over HTTP — they run as separate processes.

# Terminal 1 — start the backend (uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py

# Terminal 2 — start the UI (uses IDP_UI_PORT, default 8001)
cd frontend && uv run chainlit run app.py --port ${IDP_UI_PORT:-8001}

The UI opens at http://localhost:8001 by default. Ports are configured via IDP_API_PORT (backend, default 8000) and IDP_UI_PORT (UI, default 8001) in .env. Set IDP_API_URL to override the full backend URL.

Features

Feature How it works
Document upload Drag & drop or use the 📎 attachment icon. Supports all file types listed below (PDF, DOCX, PPTX, XLSX, CSV, HTML, images, TXT, MD). Files are uploaded to the backend API and ingested automatically.
Chat Ask natural language questions — the backend RAG agent streams answers with source citations via SSE.
Streaming Responses stream token-by-token via Server-Sent Events from the backend.
Error handling Clear messages for backend connectivity issues, unsupported files, or ingestion failures.

Configuration

The web UI needs IDP_API_PORT (or IDP_API_URL) to connect to the backend. All Azure configuration lives in the backend's .env. Upload limits are set in .chainlit/config.toml (default: 5 files, 200 MB each).


REST API

A FastAPI backend that exposes the full IDP pipeline as REST endpoints. This is the primary interface for programmatic integration, custom frontends, or microservice architectures.

# Development (with hot reload, uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py

# Production
uv run uvicorn idp_azure.api.app:app --host 0.0.0.0 --port ${IDP_API_PORT:-8000}

OpenAPI docs are available at http://localhost:${IDP_API_PORT}/docs (default: http://localhost:8000/docs).

Endpoints

Method Path Description
GET /api/health Health & readiness check (per-service status)
POST /api/setup Create search infrastructure (one-time)
POST /api/documents Upload & ingest a document (multipart file upload)
GET /api/documents List all indexed documents with chunk counts
DELETE /api/documents/{source_file} Delete all chunks for a document
POST /api/query Query the search index (hybrid or knowledge base, based on SEARCH_MODE)
POST /api/agent Chat with the RAG agent (SSE streaming)

Examples

# Health check
curl http://localhost:8000/api/health

# Upload and ingest a document
curl -X POST http://localhost:8000/api/documents \
  -F "file=@report.pdf"

# List indexed documents
curl http://localhost:8000/api/documents

# Query the knowledge base
curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the key financial metrics?"}'

# Chat with the RAG agent (SSE stream)
curl -N -X POST http://localhost:8000/api/agent \
  -H "Content-Type: application/json" \
  -d '{"question": "Summarize all documents"}'

# Delete a document
curl -X DELETE http://localhost:8000/api/documents/report.pdf

Agent Streaming (SSE)

The /api/agent endpoint streams Server-Sent Events with four event types:

Event Payload Description
session {"session_id": "…"} Always first — identifies the conversation session
token Raw text chunk A piece of the response as it's generated
done {"status": "complete"} The response is finished
error {"error": "…", "detail": "…"} An error occurred (SessionExpired when reusing a stale ID)

Agent Session Lifecycle

Multi-turn conversations are maintained through a server-side session model built on the Microsoft Agent Framework's AgentSession. The backend owns the session; clients only hold a session ID.

 Frontend (Chainlit UI)                              Backend (/api/agent)
 ─────────────────────                               ────────────────────

 1st message
 ┌────────────────────────┐   POST /api/agent
 │ { "question": "..." }  │ ──────────────────────►  No session_id →
 └────────────────────────┘                          agent.create_session()
                                                     Store in _AgentSessionStore
                             ◄─── SSE event: session  {"session_id":"abc-123"}
                             ◄─── SSE event: token    "Here is..."
                             ◄─── SSE event: done

 Store session_id="abc-123"
 in cl.user_session

 2nd message
 ┌───────────────────────────────────────────┐
 │ { "question": "...", "session_id": "abc-123" } │
 └───────────────────────────────────────────┘
                                          ──►  Lookup in _AgentSessionStore
                                               Found → reuse session (keeps
                                               full conversation history)
                             ◄─── SSE event: session  {"session_id":"abc-123"}
                             ◄─── SSE event: token    ...
                             ◄─── SSE event: done

 After TTL expires (default 1 hour)
 ┌───────────────────────────────────────────┐
 │ { "question": "...", "session_id": "abc-123" } │
 └───────────────────────────────────────────┘
                                          ──►  Lookup → expired/evicted
                             ◄─── SSE event: error
                                  {"error":"SessionExpired","detail":"..."}

 Clear stored session_id
 Next message creates new session

Key design decisions:

  • Server-owned sessions — The AgentSession (from Microsoft Agent Framework) holds the full conversation history (all prior turns, tool calls, and responses). The frontend never stores message history; it only stores the opaque session_id string.
  • In-memory store with lazy TTL eviction_AgentSessionStore (app.py) is a dict[str, _SessionEntry] that evicts entries on every get()/put() call when time.monotonic() - last_accessed > TTL. Each successful lookup refreshes last_accessed, so active conversations never expire.
  • Per-session locking — Each session has an asyncio.Lock() to serialise concurrent requests to the same session, preventing interleaved agent runs.
  • Graceful expiry handling — When a session is expired, the backend returns an SSE error event with "SessionExpired". The frontend clears its stored ID so the next message creates a fresh session.

Configuration:

Variable Default Description
AGENT_SESSION_TTL 3600 Seconds of inactivity before a session is evicted

Supported File Types

Type Extensions Stage 2 Analyzer Notes
Documents .pdf, .docx, .pptx, .xlsx Content Understanding Full structure preservation
Spreadsheets .csv Content Understanding Row-based chunking
Web .html, .htm Content Understanding HTML → Markdown
Images .png, .jpg, .jpeg, .tiff, .bmp Content Understanding (OCR) Vision fallback for complex images
Text .txt, .md Direct read No API call needed

Environment Variables

# Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_KEY=your-key              # omit to use DefaultAzureCredential
AZURE_OPENAI_API_VERSION=2025-03-01-preview
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-5.4-mini       # primary chat model
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-large
AZURE_OPENAI_VISION_DEPLOYMENT=gpt-5.4-mini
AZURE_OPENAI_QUERY_PLANNING_DEPLOYMENT=gpt-5-mini

# Azure Content Understanding
CONTENTUNDERSTANDING_ENDPOINT=https://your-cu.cognitiveservices.azure.com
CONTENTUNDERSTANDING_KEY=your-key           # optional

# Azure Document Intelligence
DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com
DOCUMENT_INTELLIGENCE_KEY=your-key          # optional

# Azure AI Search
AZURE_SEARCH_ENDPOINT=https://your-search.search.windows.net
AZURE_SEARCH_INDEX_NAME=idp-documents
AZURE_SEARCH_ADMIN_KEY=your-key             # optional

# Search mode: "hybrid" (default) or "knowledge_base"
#   hybrid         — keyword + vector + semantic reranking + custom LLM query rewrite
#   knowledge_base — LLM-driven agentic retrieval via knowledge base (requires GPT deployment)
SEARCH_MODE=hybrid

# Microsoft Foundry (optional)
AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com

# UI → Backend connection (only needed for Chainlit UI)
IDP_API_PORT=8000
IDP_UI_PORT=8001
# IDP_API_URL=http://localhost:8000  # overrides IDP_API_PORT if set

# Agent session TTL (seconds of inactivity before session is evicted)
AGENT_SESSION_TTL=3600

Logging

Application logs (idp_azure.*) and third-party / framework logs are separated so that turning on DEBUG doesn't flood the console with SDK transport noise.

Environment variables

Variable Default Description
IDP_LOG_LEVEL INFO Log level for application code (idp_azure.*).
IDP_LIB_LOG_LEVEL WARNING Log level for third-party libraries (root logger).
IDP_LIB_LOG_SILENCE (see below) Comma-separated logger names forced to WARNING even when IDP_LIB_LOG_LEVEL is lowered. Set to "" to un-silence everything.

Common recipes

# Normal development — only app DEBUG, libraries stay quiet
IDP_LOG_LEVEL=DEBUG

# Debug Azure SDK / OpenAI calls (transport noise auto-silenced)
IDP_LOG_LEVEL=DEBUG IDP_LIB_LOG_LEVEL=DEBUG

# Debug absolutely everything including httpx request/response headers
IDP_LOG_LEVEL=DEBUG IDP_LIB_LOG_LEVEL=DEBUG IDP_LIB_LOG_SILENCE=""

# Debug only OpenAI, silence Azure SDK
IDP_LIB_LOG_LEVEL=DEBUG IDP_LIB_LOG_SILENCE="httpx,httpcore,urllib3,asyncio,watchfiles,opentelemetry,msal,azure"

Default-silenced loggers

When IDP_LIB_LOG_LEVEL is lowered to DEBUG, these loggers are kept at WARNING by default because they produce extreme noise:

Logger What it emits at DEBUG
httpx / httpcore Every outgoing HTTP request and response including headers
urllib3 Connection pool lifecycle (open, close, reuse)
asyncio Event-loop internals, selector polls, task scheduling
msal Token cache lookups, OAuth2 handshake steps
watchfiles File-system change events (noisy in --reload mode)
opentelemetry Span export batching, internal SDK state

Useful-for-debugging loggers (not silenced)

These are not silenced by default — they produce actionable output when you set IDP_LIB_LOG_LEVEL=DEBUG:

Logger What it emits at DEBUG
azure Azure SDK pipeline — request policies, retry logic, auth flow (covers azure-search-documents, azure-ai-documentintelligence, azure-ai-contentunderstanding, azure-identity)
openai OpenAI SDK — request/response payloads for chat completions and embeddings
agent_framework Microsoft Agent Framework — workflow execution, tool dispatch, orchestration
chonkie Chunking library internals
uvicorn ASGI server startup, shutdown, lifespan events
fastapi Router registration, middleware chain

Error Handling

All domain exceptions inherit from IDPError for consistent handling:

IDPError
├── IngestionError
│   ├── UnsupportedFileTypeError    # unknown file extension
│   ├── FileTooLargeError           # exceeds 200 MB
│   ├── EmptyFileError              # zero-byte file
│   └── ExtractionError             # Azure service returned no content
├── ChunkingError                   # Chonkie pipeline failure
├── IndexingError                   # search index upload failure
├── RetrievalError                  # knowledge base query failure
└── AgentError                      # RAG agent failure

All Azure API calls are wrapped with @retry_on_transient() which retries on HTTP 429 (rate limit), 503 (unavailable), and 504 (timeout) with exponential backoff (2s → 4s → 8s, max 3 retries).


References

Releases

No releases published

Packages

 
 
 

Contributors