An end-to-end Intelligent Document Processing (IDP) pipeline built entirely on the Azure ecosystem. It ingests documents of any type, extracts structured content, chunks intelligently, indexes with vector search, and exposes an agentic RAG interface — all following Domain-Driven Design (DDD) principles for clear, maintainable code.
The pipeline has 7 stages. Each stage is a distinct bounded context with a clear responsibility, input, output, and Azure technology.
┌──────────────────────────────────────────────────────────────┐
│ DOCUMENT SOURCES │
│ PDF │ DOCX │ PPTX │ XLSX │ CSV │ HTML │ Images │ TXT │ MD │
└───────────────────────────┬──────────────────────────────────┘
│
┌───────▼───────┐
│ STAGE 1 │
│ Routing & │
│ Validation │
└───────┬───────┘
│ validates size, type, existence
│ selects optimal analyzer
│
┌───────────────────┼───────────────────┐
│ │ │
┌────────▼────────┐ ┌───────▼────────┐ ┌────────▼────────┐
│ STAGE 2a │ │ STAGE 2b │ │ STAGE 2c │
│ Content │ │ Document │ │ GPT-4o │
│ Understanding│ │ Intelligence│ │ Vision │
│ (primary) │ │ (tables) │ │ (images) │
└────────┬────────┘ └───────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│ all paths produce Markdown
│ + structural metadata (CU)
│
┌───────▼───────┐
│ STAGE 2.5 │
│ Noise │
│ Filtering │
└───────┬───────┘
│ removes headers/footers/
│ page numbers via CU roles
│
┌───────▼───────┐
│ STAGE 2.6 │
│ Speaker │
│ Notes (PPTX)│
└───────┬───────┘
│ extracts notes via
│ python-pptx (local)
│
┌───────▼───────┐
│ STAGE 3 │
│ Chunking │
│ (page-first) │
└───────┬───────┘
│ splits by page markers,
│ merges mid-sentence breaks
│
┌───────▼───────┐
│ STAGE 4 │
│ Embedding │
│ & Indexing │
└───────┬───────┘
│ embeds + uploads to search
│
┌───────▼───────┐
│ STAGE 5 │
│ Agentic │
│ Retrieval │
└───────┬───────┘
│ LLM-driven query planning
│
┌───────▼───────┐
│ STAGE 6 │
│ RAG Agent │
│ (response) │
└───────────────┘
What: Receives a file, validates it, and decides which Azure analyzer should process it.
| Aspect | Detail |
|---|---|
| Code | ingestion/router.py |
| Technology | Pure Python (no Azure calls) |
| Input | Raw file path |
| Output | DocumentMetadata (file type, analyzer choice, size) |
| Domain Model | AnalyzerChoice enum, FileType enum |
How it works:
- Checks the file exists, is non-empty, and is under 200 MB
- Maps the file extension to an
AnalyzerChoice:.pdf,.docx,.pptx,.xlsx,.csv,.html, images → Content Understanding.txt,.md→ Direct Read (no API call needed)
- Returns a
DocumentMetadatavalue object used by subsequent stages
file_path → validate_file() → DocumentMetadata { file_type, analyzer, size }
What: Extracts structured text from the document as Markdown. Three analyzers available depending on the document type.
| Aspect | Detail |
|---|---|
| Code | ingestion/content_understanding.py |
| Technology | Azure Content Understanding SDK (azure-ai-contentunderstanding) |
| Input | Binary file bytes |
| Output | Markdown string with tables, headings, page structure |
| Used For | PDF, DOCX, PPTX, XLSX, CSV, HTML, images (built-in OCR) |
| Resilience | @retry_on_transient() — retries on 429/503/504 with exponential backoff |
file_bytes → ContentUnderstandingClient.begin_analyze_binary() → AnalyzeResult → .markdown
| Aspect | Detail |
|---|---|
| Code | ingestion/document_intelligence.py |
| Technology | Azure Document Intelligence SDK (azure-ai-documentintelligence) |
| Input | Binary file bytes |
| Output | Markdown with fine-grained table structure (row/column counts) |
| Used For | Documents needing superior layout analysis or form-field extraction |
file_bytes → DocumentIntelligenceClient.begin_analyze_document(model="prebuilt-layout") → .content
| Aspect | Detail |
|---|---|
| Code | ingestion/vision.py |
| Technology | Azure OpenAI (openai SDK) — GPT-4o with vision capability |
| Input | Image bytes (PNG/JPEG/TIFF/BMP) or PDF pages rendered as 200-DPI PNGs |
| Output | Detailed Markdown description (charts, diagrams, data points, tables) |
| Used For | Complex visual content that text extractors miss: charts, diagrams, handwriting |
image_bytes → base64 encode → GPT-4o chat.completions.create(vision) → Markdown description
For PDFs with complex images, a hybrid approach is available. It uses CU's native figure metadata for content-aware triage instead of crude size-based filtering:
file -> Content Understanding (text + tables + figure metadata)
-> CU identifies figures with kind: CHART / MERMAID / UNKNOWN
-> CHART/MERMAID: CU provides structured content directly (no Vision call)
-> UNKNOWN (no description): crop figure region via PyMuPDF -> GPT-4o Vision
-> combined Markdown output
| Aspect | Detail |
|---|---|
| Code | ingestion/figure_triage.py, application/ingestion_service.py |
| Technology | Content Understanding (figure metadata) + GPT-4o Vision (fallback) + PyMuPDF (region cropping) |
| Output | ExtractedDocument with markdown + figure descriptions |
Figure triage logic:
CHARTfigures: CU provides Chart.js structured content and description -- used directlyMERMAIDfigures: CU provides Mermaid.js syntax and description -- used directlyUNKNOWNfigures with CU description (>= 20 chars): used directlyUNKNOWNfigures without description: figure region cropped from PDF page using CU's bounding polygon, sent to GPT-4o Vision
What: Removes headers, footers, and page numbers from extracted markdown before chunking to prevent noise from contaminating retrieval chunks.
| Aspect | Detail |
|---|---|
| Code | chunking/noise.py |
| Technology | Azure Content Understanding paragraph roles (native metadata) |
| Input | Markdown string + DocumentContent (paragraph metadata) |
| Output | Cleaned markdown with noise elements removed |
How it works:
-
Primary path (CU metadata): CU classifies every paragraph with a semantic role. Paragraphs with role
PAGE_HEADER,PAGE_FOOTER, orPAGE_NUMBERcarryspanoffsets into the markdown. The noise filter collects these spans, merges overlapping ones, and rebuilds the markdown from the non-noise ranges. -
Fallback path (regex): When CU paragraph metadata is unavailable (e.g. direct-read or DI extraction path), a conservative regex removes only standalone page-number lines (
Page X of Y,- X -, bare numbers on their own line).
Design principle: conservative by default -- ambiguous content is kept, not removed.
What: Extracts speaker notes from PowerPoint presentations. CU and DI do not extract notes — they only process visible slide content. Notes are extracted locally using
python-pptxwith no API calls.
| Aspect | Detail |
|---|---|
| Code | ingestion/speaker_notes.py |
| Technology | python-pptx (reads PPTX XML structure locally) |
| Input | .pptx file path |
| Output | List of ### Notes from Page N formatted strings |
| Used For | .pptx files only (notes are lost in PDF-from-PPT conversion) |
Why this matters for RAG: Slides typically have terse bullet points while speaker notes contain the full explanation, context, and reasoning. For retrieval, notes often produce better answers than slide text alone.
How it integrates: Notes are formatted with ### Notes from Page N headings and appended to the ExtractedDocument.image_descriptions list. The existing content assembly module interleaves them at the correct page position, so each slide's chunk contains both the visible slide content and the presenter's notes.
.pptx file → python-pptx reads notesSlide XML → "### Notes from Page N"
→ interleaved into full_content at page position → chunked with slide content
What: Splits extracted Markdown into retrieval-optimized chunks using page markers as the primary split, with mid-sentence merge for report-style page breaks and overflow splitting for oversized pages.
| Aspect | Detail |
|---|---|
| Code | chunking/pipeline.py |
| Technology | Page-marker splitting + Chonkie RecursiveChunker (overflow only) |
| Input | Markdown string (with <!-- PageBreak --> / <!-- PageNumber="N" --> markers from CU) |
| Output | List of Chunk domain objects (id, text, token_count, page_number) |
Pipeline stages (in order):
Markdown (with CU page markers)
│
▼
Split by page markers ← <!-- PageBreak --> / <!-- PageNumber="N" -->
│ (NOT by ## headings — CU renders PPT text
│ boxes as ## headings which would over-split)
▼
Mid-sentence merge ← if page N ends without .!?: AND page N+1
│ starts lowercase → merge (report page breaks)
▼
Overflow split (RecursiveChunker) ← only for blocks exceeding chunk_size (1024)
│ uses plain-text recipe, not markdown
▼
List[Chunk] ← each chunk carries page_number for citations
Why page-first, not heading-based:
Azure Content Understanding renders each text box from PPT slides as a separate ## heading in the markdown (e.g., ## Customer:, ## Challenge:). A heading-based chunker like RecursiveChunker(recipe="markdown") splits on every ##, producing hundreds of micro-chunks (6–50 tokens each) from a 20-slide deck. The page-first approach treats each page/slide as a single chunk, keeping all content together regardless of ## text-box headings.
For report-style PDFs where paragraphs flow across page boundaries, the mid-sentence merge step detects when a page break cuts a paragraph (previous page ends without sentence-terminal punctuation + next page starts with a lowercase letter) and merges those pages back together. This is conservative — only mid-sentence breaks trigger merging, not topical relatedness.
Chunk sizing:
| Scenario | Handling |
|---|---|
Page fits within chunk_size (1024 tokens) |
One chunk per page |
Page exceeds chunk_size |
Split with RecursiveChunker (plain-text recipe) |
| Mid-sentence page break (reports) | Adjacent pages merged, then size-checked |
| Slide-like pages (self-contained) | No merging — each page stays separate |
What: Converts each chunk to a 3072-dimensional vector and uploads to Azure AI Search.
| Aspect | Detail |
|---|---|
| Code | search/embeddings.py, search/index.py, search/indexing.py |
| Technology | Azure OpenAI Embeddings (text-embedding-3-large) + Azure AI Search (azure-search-documents) |
| Input | List of Chunk objects |
| Output | Documents indexed in Azure AI Search |
Sub-steps:
┌─────────────────────────────────┐
Chunks ──► Batch embed (16/call) ──►│ Azure AI Search Index │
Azure OpenAI │ ┌───────────────────────────┐ │
text-embedding-3-large │ │ Fields: │ │
(3072 dimensions) │ │ • id (key) │ │
│ │ • content (searchable) │ │
│ │ • original_content │ │
│ │ • content_vector (3072d) │ │
│ │ • source_file (filter) │ │
│ │ • file_type (filter) │ │
│ │ • chunk_index (sortable) │ │
│ │ • page_number (filter) │ │
│ └───────────────────────────┘ │
│ Vector: HNSW algorithm │
│ Vectorizer: integrated AOAI │
│ Semantic: content field ranked │
└─────────────────────────────────┘
Key details:
- Integrated vectorizer — the index is configured so Azure AI Search can call Azure OpenAI for query-time vectorization automatically (required for agentic retrieval)
- Batch embedding — texts are embedded 16 at a time with
@retry_on_transient()for rate-limit resilience - Buffered upload —
SearchIndexingBufferedSenderhandles reliable batch uploads with auto-retry - Deterministic IDs — chunk IDs are SHA-256 hashes of
source_file:chunk_index(supports re-indexing) - Contextual enrichment (opt-in) — when
CONTEXTUAL_ENRICHMENT_ENABLED=true, an LLM generates a short document-level context prefix for each chunk before embedding. The enriched text is stored incontent(for search), while the raw chunk text is preserved inoriginal_content(for display). This follows Anthropic's Contextual Retrieval approach and, combined with the existing hybrid search and semantic reranking, can reduce retrieval failures by up to 67%. See Contextual Enrichment below for details.
What: Retrieves relevant document chunks using hybrid search (default) or LLM-powered agentic retrieval via a knowledge base. Configurable via the
SEARCH_MODEenvironment variable.
| Aspect | Detail |
|---|---|
| Code | search/retrieval.py, search/query_rewrite.py, search/knowledge.py |
| Technology | Azure AI Search — hybrid search with LLM query rewriting + semantic reranking (default), or Agentic Retrieval via KnowledgeBaseRetrievalClient |
| Input | Natural-language query (hybrid) or conversational messages (knowledge base) |
| Output | RetrievalResult (content, references with source/page, optional activity) |
Two retrieval modes (set via SEARCH_MODE env var):
Combines three retrieval signals in a single request:
User question: "What about its revenue?" (multi-turn follow-up)
│
▼ Custom LLM Query Rewrite (GPT-5-mini)
│ • Resolves coreferences: "its" → "Division B"
│ • Expands short/ambiguous queries with keyword synonyms
│ • Only fires when conversation context needs resolution
│ • Skipped for clear single-turn queries (zero added latency)
│
├──► BM25 text query (expanded: rewritten + keyword synonyms)
├──► Vector query (clean standalone rewrite only — no expansion noise)
│
▼ Semantic Reranking (cross-encoder, uses clean rewritten query)
│ (rescores fused results for higher relevance)
│
▼ Top-K results with source citations
- Query rewriting — custom LLM-based pre-search rewrite (
search/query_rewrite.py) handles conversational coreference resolution and conditional keyword expansion. Does not rely on Azure AI Search's built-in generative query rewrite — our custom rewriter gives full control over rewrite behavior, structured output, and prompt caching. - Three-channel query split — BM25 gets expanded text for broad recall, vector search and semantic reranker get the clean standalone query for precision (via
semantic_queryparameter). - Semantic reranking — cross-encoder reranker via the
defaultsemantic configuration
LLM-powered agentic retrieval with query planning and optional answer synthesis:
User question: "Compare revenue trends in Q3 vs Q4 and list risk factors"
│
▼ LLM Query Planning (GPT-5-mini)
│
├──► Subquery 1: "Q3 revenue trends" ──► hybrid search ──► results
├──► Subquery 2: "Q4 revenue trends" ──► hybrid search ──► results
└──► Subquery 3: "risk factors" ──► hybrid search ──► results
│
merge + semantic rerank
│
▼
Unified response with
source citations
Architecture (3 layers):
┌────────────────────────────────────────────────────────────────┐
│ Knowledge Base │
│ • LLM: GPT-5-mini (query planning — AI Search compatible) │
│ • Decomposes complex questions into focused subqueries │
│ • Runs subqueries in parallel │
│ • Optionally synthesises a natural-language answer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Knowledge Source │ │
│ │ • Wraps the search index │ │
│ │ • Citation fields: id, source_file, page_number │ │
│ │ ┌────────────────────────────────────────────────────┐ │ │
│ │ │ Search Index │ │ │
│ │ │ • Hybrid search: BM25 text + HNSW vector │ │ │
│ │ │ • Semantic reranking on content field │ │ │
│ │ │ • Integrated vectorizer (auto text→vector) │ │ │
│ │ └────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
Multi-turn support: In knowledge base mode, pass the full conversation history — the knowledge base uses prior messages for better query planning on follow-up questions. In hybrid mode, only the last user message is used as the search query.
What: An intelligent conversational agent that uses agentic retrieval to answer document questions with citations, stream responses, and produce structured outputs.
| Aspect | Detail |
|---|---|
| Code | agent/tools.py, agent/rag_agent.py, agent/workflows.py |
| Technology | Microsoft Agent Framework 1.0.0 (agent-framework, agent-framework-openai) |
| LLM | Azure OpenAI GPT-4.1 (primary chat model) |
| Input | Natural language question |
| Output | Answer with inline source citations |
Agent architecture:
┌─────────────────────────────────────────────────────────────┐
│ DocumentAssistant (RAG Agent) │
│ LLM: Azure OpenAI GPT-4.1 │
│ │
│ Tools: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ search_documents(query) │ │
│ │ → Calls Stage 5 (Agentic Retrieval) │ │
│ │ → Returns content + source citations │ │
│ ├─────────────────────────────────────────────────────┤ │
│ │ list_indexed_documents() │ │
│ │ → Queries search index facets │ │
│ │ → Returns list of files with chunk counts │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Capabilities: │
│ • Streaming responses (real-time token output) │
│ • Structured output (Pydantic DocumentSummary) │
│ • Multi-turn conversation with context │
│ • Multi-agent workflows (Retriever → Analyzer → Writer) │
└─────────────────────────────────────────────────────────────┘
Multi-agent workflow (optional, via workflows.py):
User query
│
▼
Retriever Agent ──► searches & retrieves relevant chunks
│
▼
Analyzer Agent ──► identifies themes, contradictions, insights
│
▼
Writer Agent ──► formats into a structured, cited response
Anthropic's Contextual Retrieval approach: prepend LLM-generated context to each chunk before embedding. Combined with hybrid search (BM25 + vector + semantic rerank), this reduces retrieval failures by up to 67%.
| Aspect | Detail |
|---|---|
| Code | chunking/enrichment.py |
| Toggle | CONTEXTUAL_ENRICHMENT_ENABLED=true (disabled by default) |
| Cost | 1 summary call + 1 call per chunk per ingested document |
How it works:
- Document summary — a single LLM call generates a 3-5 sentence summary of the entire document.
- Per-chunk context — for each chunk, an LLM call receives
{summary} + {chunk_text}and produces 2-3 sentences situating the chunk within the document. - Dual storage — enriched text (
context + chunk) is stored incontentfor search; raw chunk text is preserved inoriginal_contentfor display.
Configuration:
| Env Var | Default | Description |
|---|---|---|
CONTEXTUAL_ENRICHMENT_ENABLED |
false |
Enable contextual enrichment during ingestion |
CONTEXTUAL_ENRICHMENT_DEPLOYMENT |
primary chat deployment | LLM deployment for enrichment |
CONTEXTUAL_ENRICHMENT_MAX_CONCURRENT |
5 |
Max concurrent LLM calls for chunk enrichment |
Notes:
- Enrichment is fail-safe — on failure it falls back to raw chunks rather than aborting ingestion.
- Chunk text is wrapped in
<document_chunk>XML tags to prevent Azure OpenAI's content filter from misidentifying imperative business language (e.g., PPT slide text like "You must ensure...") as jailbreak attempts. If the content filter still triggers, enrichment is silently skipped for that chunk. - Existing documents must be re-ingested to gain enrichment.
- The hierarchical approach (summary → per-chunk context) keeps cost manageable versus sending the full document with every chunk.
| Stage | Technology | Package | Purpose |
|---|---|---|---|
| 1. Routing | Pure Python | -- | File validation & analyzer selection |
| 2a. Extraction | Azure Content Understanding | azure-ai-contentunderstanding |
Primary document -> Markdown + metadata |
| 2b. Extraction | Azure Document Intelligence | azure-ai-documentintelligence |
Layout & table extraction |
| 2c. Extraction | Azure OpenAI GPT-4o Vision | openai |
Chart/diagram description |
| 2c. PDF rendering | PyMuPDF | pymupdf |
PDF page -> PNG for Vision |
| 2d. Figure triage | Azure Content Understanding | azure-ai-contentunderstanding |
Figure classification (CHART/MERMAID/UNKNOWN) |
| 2.5. Noise filtering | Azure Content Understanding | azure-ai-contentunderstanding |
Paragraph role-based noise removal |
| 2.6. Speaker notes | python-pptx | python-pptx |
PPTX speaker notes extraction (local, no API) |
| 3. Chunking | Page-marker splitting + Chonkie | chonkie |
Page-first chunking with overflow splitting |
| 4. Embedding | Azure OpenAI | openai |
text-embedding-3-large (3072d) |
| 4. Indexing | Azure AI Search | azure-search-documents |
Vector + BM25 hybrid index |
| 5. Retrieval | Azure AI Search Agentic Retrieval | azure-search-documents (preview) |
LLM-driven query planning |
| 6. Agent | Microsoft Agent Framework 1.0.0 | agent-framework-openai |
RAG agent with tool calling |
| 6. Model mgmt | Microsoft Foundry | agent-framework-foundry |
Centralised model deployment |
| Cross-cutting | Pydantic | pydantic |
Settings validation & domain models |
| Cross-cutting | Azure Identity | azure-identity |
DefaultAzureCredential auth |
| REST API | FastAPI | fastapi[standard] |
REST endpoints + SSE streaming |
The codebase follows Domain-Driven Design with clear layering:
┌─────────────────────────────────────────────────────────────────┐
│ INTERFACE ADAPTERS │
│ api/ (FastAPI) │ ui/ (Chainlit) │ agent/ (Agent Framework)│
│ REST endpoints Chat + upload RAG agent tools │
└────────────────────────────┬────────────────────────────────────┘
│ delegates to
┌────────────────────────────▼────────────────────────────────────┐
│ APPLICATION SERVICES │
│ IngestionService │ DocumentService │ QueryService │ SetupSvc │
│ ingest() list/delete() query() provision()│
│ analyzer() │
└────────────────────────────┬────────────────────────────────────┘
│ coordinates
┌────────────────────────────▼────────────────────────────────────┐
│ BOUNDED CONTEXTS (Infrastructure) │
│ ingestion/ │ chunking/ │ search/ │
│ CU, DI, Vision Chonkie Embeddings, Indexing, │
│ Router, Triage Noise, Strategies Knowledge, Retrieval │
└────────────────────────────┬────────────────────────────────────┘
│ uses
┌────────────────────────────▼────────────────────────────────────┐
│ DOMAIN LAYER │
│ domain/models.py — Chunk, FileType, ExtractedDocument, etc. │
│ domain/exceptions.py — IDPError hierarchy │
│ shared/resilience.py — @retry_on_transient() │
└─────────────────────────────────────────────────────────────────┘
Dependencies flow inward: adapters → application → bounded contexts → domain.
idp-azure/
├── pyproject.toml # Dependencies & build config
├── .env.example # Required environment variables
├── README.md
└── src/idp_azure/
├── config.py # 🔧 Centralised Pydantic settings
│
├── domain/ # 🏛 DOMAIN LAYER (no infrastructure deps)
│ ├── models.py # AnalyzerChoice, FileType, Chunk,
│ │ # ExtractedDocument, RetrievalResult
│ └── exceptions.py # IDPError → IngestionError,
│ # ChunkingError, IndexingError, etc.
│
├── shared/ # 🔧 SHARED KERNEL
│ └── resilience.py # @retry_on_transient() decorator
│
├── application/ # 📋 APPLICATION SERVICES (use cases)
│ ├── ingestion_service.py # Ingest: extract → chunk → index
│ ├── document_service.py # List & delete indexed documents
│ ├── query_service.py # Query knowledge base (agentic retrieval)
│ └── setup_service.py # One-time infrastructure provisioning
│
├── ingestion/ # 📥 BOUNDED CONTEXT: Ingestion
│ ├── router.py # Stage 1 — routing & validation
│ ├── content_understanding.py # Stage 2a — Azure CU (markdown + metadata)
│ ├── document_intelligence.py # Stage 2b — Azure DI
│ ├── vision.py # Stage 2c — GPT-4o Vision + hybrid
│ └── figure_triage.py # Stage 2d — CU figure classification
│
├── chunking/ # ✂️ BOUNDED CONTEXT: Chunking
│ ├── noise.py # Stage 2.5 — noise filtering (CU roles)
│ ├── pipeline.py # Stage 3 — Chonkie pipeline
│ └── strategies.py # Per-format chunk configs
│
├── search/ # 🔍 BOUNDED CONTEXT: Search
│ ├── embeddings.py # Stage 4 — Azure OpenAI embeddings
│ ├── index.py # Stage 4 — search index creation
│ ├── indexing.py # Stage 4 — chunk upload + SearchDocument
│ ├── knowledge.py # Stage 5 — knowledge source & base
│ └── retrieval.py # Stage 5 — agentic retrieval client
│
├── agent/ # 🤖 INTERFACE ADAPTER: RAG Agent
│ ├── tools.py # Agent tools (delegate to app services)
│ ├── rag_agent.py # Agent setup & streaming
│ └── workflows.py # Multi-agent workflows
│
├── api/ # 🌐 INTERFACE ADAPTER: REST API (FastAPI)
│ ├── app.py # App factory, lifespan, exception handlers
│ ├── dependencies.py # DI for application services
│ ├── models.py # Request/response Pydantic schemas
│ └── routers/
│ ├── documents.py # Upload, delete, list, setup
│ └── query.py # Query + RAG agent streaming (SSE)
│
└── ui/ # 🖥 INTERFACE ADAPTER: Web UI (Chainlit)
└── app.py # Chat interface + file upload
# 1. Install
cd idp-azure
uv sync
# 2. Configure
cp .env.example .env
# Edit .env with your Azure resource endpoints and keys
# 3. Start the REST API server (uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py
# or for production:
uv run uvicorn idp_azure.api.app:app --host 0.0.0.0 --port ${IDP_API_PORT:-8000}
# 4. One-time infrastructure setup (creates index + knowledge base)
curl -X POST http://localhost:8000/api/setup
# 5. Upload and ingest documents
curl -X POST http://localhost:8000/api/documents -F "file=@report.pdf"
# 6. Query the knowledge base
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"question": "What are the key financial metrics?"}'
# 7. Chat with the RAG agent (SSE streaming)
curl -N -X POST http://localhost:8000/api/agent \
-H "Content-Type: application/json" \
-d '{"question": "Summarize all documents"}'
# 8. Or launch the Chainlit web UI (uses IDP_UI_PORT, default 8001)
cd frontend && uv run chainlit run app.py --port ${IDP_UI_PORT:-8001}A standalone chat frontend built with Chainlit. The UI communicates with the REST API backend over HTTP — they run as separate processes.
# Terminal 1 — start the backend (uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py
# Terminal 2 — start the UI (uses IDP_UI_PORT, default 8001)
cd frontend && uv run chainlit run app.py --port ${IDP_UI_PORT:-8001}The UI opens at http://localhost:8001 by default. Ports are configured via IDP_API_PORT (backend, default 8000) and IDP_UI_PORT (UI, default 8001) in .env. Set IDP_API_URL to override the full backend URL.
| Feature | How it works |
|---|---|
| Document upload | Drag & drop or use the 📎 attachment icon. Supports all file types listed below (PDF, DOCX, PPTX, XLSX, CSV, HTML, images, TXT, MD). Files are uploaded to the backend API and ingested automatically. |
| Chat | Ask natural language questions — the backend RAG agent streams answers with source citations via SSE. |
| Streaming | Responses stream token-by-token via Server-Sent Events from the backend. |
| Error handling | Clear messages for backend connectivity issues, unsupported files, or ingestion failures. |
The web UI needs IDP_API_PORT (or IDP_API_URL) to connect to the backend. All Azure configuration lives in the backend's .env. Upload limits are set in .chainlit/config.toml (default: 5 files, 200 MB each).
A FastAPI backend that exposes the full IDP pipeline as REST endpoints. This is the primary interface for programmatic integration, custom frontends, or microservice architectures.
# Development (with hot reload, uses IDP_API_PORT, default 8000)
uv run python src/idp_azure/api/app.py
# Production
uv run uvicorn idp_azure.api.app:app --host 0.0.0.0 --port ${IDP_API_PORT:-8000}OpenAPI docs are available at http://localhost:${IDP_API_PORT}/docs (default: http://localhost:8000/docs).
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Health & readiness check (per-service status) |
POST |
/api/setup |
Create search infrastructure (one-time) |
POST |
/api/documents |
Upload & ingest a document (multipart file upload) |
GET |
/api/documents |
List all indexed documents with chunk counts |
DELETE |
/api/documents/{source_file} |
Delete all chunks for a document |
POST |
/api/query |
Query the search index (hybrid or knowledge base, based on SEARCH_MODE) |
POST |
/api/agent |
Chat with the RAG agent (SSE streaming) |
# Health check
curl http://localhost:8000/api/health
# Upload and ingest a document
curl -X POST http://localhost:8000/api/documents \
-F "file=@report.pdf"
# List indexed documents
curl http://localhost:8000/api/documents
# Query the knowledge base
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{"question": "What are the key financial metrics?"}'
# Chat with the RAG agent (SSE stream)
curl -N -X POST http://localhost:8000/api/agent \
-H "Content-Type: application/json" \
-d '{"question": "Summarize all documents"}'
# Delete a document
curl -X DELETE http://localhost:8000/api/documents/report.pdfThe /api/agent endpoint streams Server-Sent Events with four event types:
| Event | Payload | Description |
|---|---|---|
session |
{"session_id": "…"} |
Always first — identifies the conversation session |
token |
Raw text chunk | A piece of the response as it's generated |
done |
{"status": "complete"} |
The response is finished |
error |
{"error": "…", "detail": "…"} |
An error occurred (SessionExpired when reusing a stale ID) |
Multi-turn conversations are maintained through a server-side session model
built on the Microsoft Agent Framework's AgentSession. The backend
owns the session; clients only hold a session ID.
Frontend (Chainlit UI) Backend (/api/agent)
───────────────────── ────────────────────
1st message
┌────────────────────────┐ POST /api/agent
│ { "question": "..." } │ ──────────────────────► No session_id →
└────────────────────────┘ agent.create_session()
Store in _AgentSessionStore
◄─── SSE event: session {"session_id":"abc-123"}
◄─── SSE event: token "Here is..."
◄─── SSE event: done
Store session_id="abc-123"
in cl.user_session
2nd message
┌───────────────────────────────────────────┐
│ { "question": "...", "session_id": "abc-123" } │
└───────────────────────────────────────────┘
──► Lookup in _AgentSessionStore
Found → reuse session (keeps
full conversation history)
◄─── SSE event: session {"session_id":"abc-123"}
◄─── SSE event: token ...
◄─── SSE event: done
After TTL expires (default 1 hour)
┌───────────────────────────────────────────┐
│ { "question": "...", "session_id": "abc-123" } │
└───────────────────────────────────────────┘
──► Lookup → expired/evicted
◄─── SSE event: error
{"error":"SessionExpired","detail":"..."}
Clear stored session_id
Next message creates new session
Key design decisions:
- Server-owned sessions — The
AgentSession(from Microsoft Agent Framework) holds the full conversation history (all prior turns, tool calls, and responses). The frontend never stores message history; it only stores the opaquesession_idstring. - In-memory store with lazy TTL eviction —
_AgentSessionStore(app.py) is adict[str, _SessionEntry]that evicts entries on everyget()/put()call whentime.monotonic() - last_accessed > TTL. Each successful lookup refresheslast_accessed, so active conversations never expire. - Per-session locking — Each session has an
asyncio.Lock()to serialise concurrent requests to the same session, preventing interleaved agent runs. - Graceful expiry handling — When a session is expired, the backend
returns an SSE
errorevent with"SessionExpired". The frontend clears its stored ID so the next message creates a fresh session.
Configuration:
| Variable | Default | Description |
|---|---|---|
AGENT_SESSION_TTL |
3600 |
Seconds of inactivity before a session is evicted |
| Type | Extensions | Stage 2 Analyzer | Notes |
|---|---|---|---|
| Documents | .pdf, .docx, .pptx, .xlsx |
Content Understanding | Full structure preservation |
| Spreadsheets | .csv |
Content Understanding | Row-based chunking |
| Web | .html, .htm |
Content Understanding | HTML → Markdown |
| Images | .png, .jpg, .jpeg, .tiff, .bmp |
Content Understanding (OCR) | Vision fallback for complex images |
| Text | .txt, .md |
Direct read | No API call needed |
# Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_KEY=your-key # omit to use DefaultAzureCredential
AZURE_OPENAI_API_VERSION=2025-03-01-preview
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-5.4-mini # primary chat model
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-3-large
AZURE_OPENAI_VISION_DEPLOYMENT=gpt-5.4-mini
AZURE_OPENAI_QUERY_PLANNING_DEPLOYMENT=gpt-5-mini
# Azure Content Understanding
CONTENTUNDERSTANDING_ENDPOINT=https://your-cu.cognitiveservices.azure.com
CONTENTUNDERSTANDING_KEY=your-key # optional
# Azure Document Intelligence
DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-di.cognitiveservices.azure.com
DOCUMENT_INTELLIGENCE_KEY=your-key # optional
# Azure AI Search
AZURE_SEARCH_ENDPOINT=https://your-search.search.windows.net
AZURE_SEARCH_INDEX_NAME=idp-documents
AZURE_SEARCH_ADMIN_KEY=your-key # optional
# Search mode: "hybrid" (default) or "knowledge_base"
# hybrid — keyword + vector + semantic reranking + custom LLM query rewrite
# knowledge_base — LLM-driven agentic retrieval via knowledge base (requires GPT deployment)
SEARCH_MODE=hybrid
# Microsoft Foundry (optional)
AZURE_AI_PROJECT_ENDPOINT=https://your-project.services.ai.azure.com
# UI → Backend connection (only needed for Chainlit UI)
IDP_API_PORT=8000
IDP_UI_PORT=8001
# IDP_API_URL=http://localhost:8000 # overrides IDP_API_PORT if set
# Agent session TTL (seconds of inactivity before session is evicted)
AGENT_SESSION_TTL=3600Application logs (idp_azure.*) and third-party / framework logs are
separated so that turning on DEBUG doesn't flood the console with SDK
transport noise.
| Variable | Default | Description |
|---|---|---|
IDP_LOG_LEVEL |
INFO |
Log level for application code (idp_azure.*). |
IDP_LIB_LOG_LEVEL |
WARNING |
Log level for third-party libraries (root logger). |
IDP_LIB_LOG_SILENCE |
(see below) | Comma-separated logger names forced to WARNING even when IDP_LIB_LOG_LEVEL is lowered. Set to "" to un-silence everything. |
# Normal development — only app DEBUG, libraries stay quiet
IDP_LOG_LEVEL=DEBUG
# Debug Azure SDK / OpenAI calls (transport noise auto-silenced)
IDP_LOG_LEVEL=DEBUG IDP_LIB_LOG_LEVEL=DEBUG
# Debug absolutely everything including httpx request/response headers
IDP_LOG_LEVEL=DEBUG IDP_LIB_LOG_LEVEL=DEBUG IDP_LIB_LOG_SILENCE=""
# Debug only OpenAI, silence Azure SDK
IDP_LIB_LOG_LEVEL=DEBUG IDP_LIB_LOG_SILENCE="httpx,httpcore,urllib3,asyncio,watchfiles,opentelemetry,msal,azure"When IDP_LIB_LOG_LEVEL is lowered to DEBUG, these loggers are kept at
WARNING by default because they produce extreme noise:
| Logger | What it emits at DEBUG |
|---|---|
httpx / httpcore |
Every outgoing HTTP request and response including headers |
urllib3 |
Connection pool lifecycle (open, close, reuse) |
asyncio |
Event-loop internals, selector polls, task scheduling |
msal |
Token cache lookups, OAuth2 handshake steps |
watchfiles |
File-system change events (noisy in --reload mode) |
opentelemetry |
Span export batching, internal SDK state |
These are not silenced by default — they produce actionable output when
you set IDP_LIB_LOG_LEVEL=DEBUG:
| Logger | What it emits at DEBUG |
|---|---|
azure |
Azure SDK pipeline — request policies, retry logic, auth flow (covers azure-search-documents, azure-ai-documentintelligence, azure-ai-contentunderstanding, azure-identity) |
openai |
OpenAI SDK — request/response payloads for chat completions and embeddings |
agent_framework |
Microsoft Agent Framework — workflow execution, tool dispatch, orchestration |
chonkie |
Chunking library internals |
uvicorn |
ASGI server startup, shutdown, lifespan events |
fastapi |
Router registration, middleware chain |
All domain exceptions inherit from IDPError for consistent handling:
IDPError
├── IngestionError
│ ├── UnsupportedFileTypeError # unknown file extension
│ ├── FileTooLargeError # exceeds 200 MB
│ ├── EmptyFileError # zero-byte file
│ └── ExtractionError # Azure service returned no content
├── ChunkingError # Chonkie pipeline failure
├── IndexingError # search index upload failure
├── RetrievalError # knowledge base query failure
└── AgentError # RAG agent failure
All Azure API calls are wrapped with @retry_on_transient() which retries on HTTP 429 (rate limit), 503 (unavailable), and 504 (timeout) with exponential backoff (2s → 4s → 8s, max 3 retries).