diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.env.example b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.env.example
new file mode 100644
index 000000000..b999eb2fb
--- /dev/null
+++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.env.example
@@ -0,0 +1,9 @@
+NVIDIA_API_KEY=nvapi-your-key-here
+NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions
+STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash
+
+# Optional overrides if Parse and StepFun require separate credentials.
+# PARSE_API_KEY=nvapi-your-parse-key-here
+# STEPFUN_API_KEY=nvapi-your-stepfun-key-here
+# PARSE_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions
+# STEPFUN_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.gitignore b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.gitignore
new file mode 100644
index 000000000..a1373c876
--- /dev/null
+++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.gitignore
@@ -0,0 +1,3 @@
+output_results/
+.ipynb_checkpoints/
+
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/README.md b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/README.md
new file mode 100644
index 000000000..4264a20b7
--- /dev/null
+++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/README.md
@@ -0,0 +1,93 @@
+# Document Intelligence with Nemotron Parse and StepFun
+
+Build a document intelligence workflow that combines **Nemotron Parse**
+for page layout extraction with **StepFun Step-3.7 Flash** for cropped
+image transcription and final document question answering.
+
+The notebook runs against hosted NVIDIA endpoints. No local GPU, Docker
+container, or model weights are required.
+
+## What It Does
+
+The workflow processes four pages from three public PDFs:
+
+1. Nemotron Parse extracts typed layout blocks and picture bounding boxes.
+2. StepFun classifies and transcribes each cropped picture.
+3. The notebook stitches text and picture transcriptions back into a
+   reading-order Markdown context.
+4. StepFun answers document-level questions with cited page evidence.
+
+## Models And Endpoints
+
+| Role | Model | Endpoint |
+| --- | --- | --- |
+| Layout extraction | `nvidia/nemotron-parse` | NVIDIA API Catalog chat completions |
+| Picture transcription | `stepfun-ai/step-3.7-flash` | NVIDIA API Catalog chat completions |
+| Document QA | `stepfun-ai/step-3.7-flash` | NVIDIA API Catalog chat completions |
+
+The notebook defaults both models to NVIDIA's standard
+`https://integrate.api.nvidia.com/v1/chat/completions` endpoint. If
+needed, Parse and StepFun can still be pointed at separate endpoints with
+the optional `PARSE_CHAT_COMPLETIONS_URL` and
+`STEPFUN_CHAT_COMPLETIONS_URL` variables.
+
+## Setup
+
+Install dependencies with `uv`:
+
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+uv sync
+```
+
+Create your local `.env`:
+
+```bash
+cp .env.example .env
+```
+
+Edit `.env` and add your key:
+
+```bash
+NVIDIA_API_KEY=nvapi-your-key-here
+NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions
+STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash
+```
+
+If Parse and StepFun require different credentials for your account, set
+these optional values:
+
+```bash
+PARSE_API_KEY=nvapi-your-parse-key-here
+STEPFUN_API_KEY=nvapi-your-stepfun-key-here
+PARSE_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions
+STEPFUN_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions
+```
+
+## Run
+
+```bash
+uv run jupyter lab stepfun_doc_intelligence_with_parse.ipynb
+```
+
+Run the notebook cells from top to bottom. The `data/documents/` folder
+already contains the demo PDFs, so the notebook can start immediately.
+
+## Project Structure
+
+```text
+.
+├── README.md
+├── .env.example
+├── pyproject.toml
+├── stepfun_doc_intelligence_with_parse.ipynb
+└── data/
+    └── documents/
+        ├── 05-03-18-political-release.pdf
+        ├── GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf
+        └── measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf
+```
+
+Running the notebook writes generated `*.parse_stepfun.json` files under
+`output_results/`; those artifacts are local run output and are not
+required in source control.
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/05-03-18-political-release.pdf b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/05-03-18-political-release.pdf
new file mode 100644
index 000000000..68ad907b1
Binary files /dev/null and b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/05-03-18-political-release.pdf differ
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf
new file mode 100644
index 000000000..3cf2eedc3
Binary files /dev/null and b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf differ
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf
new file mode 100644
index 000000000..3922329d6
Binary files /dev/null and b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf differ
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/pyproject.toml b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/pyproject.toml
new file mode 100644
index 000000000..40234976c
--- /dev/null
+++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/pyproject.toml
@@ -0,0 +1,19 @@
+[project]
+name = "nemotron-parse-stepfun-document-intelligence"
+version = "0.1.0"
+description = "Document intelligence workflow using Nemotron Parse and StepFun through NVIDIA hosted endpoints"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "ipykernel>=6.0.0",
+    "jupyter>=1.0.0",
+    "pandas>=2.0.0",
+    "pillow>=10.0.0",
+    "pymupdf>=1.24.0",
+    "python-dotenv>=1.0.0",
+    "requests>=2.31.0",
+]
+
+[tool.uv]
+package = false
+
diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/stepfun_doc_intelligence_with_parse.ipynb b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/stepfun_doc_intelligence_with_parse.ipynb
new file mode 100644
index 000000000..296c7583b
--- /dev/null
+++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/stepfun_doc_intelligence_with_parse.ipynb
@@ -0,0 +1,1407 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0cb3e5c3",
+   "metadata": {},
+   "source": [
+    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" align=\"right\" width=\"100px\"/>\n",
+    "\n",
+    "# Document Intelligence with Nemotron Parse + StepFun Invocation Endpoint\n",
+    "\n",
+    "> **You do not need a GPU to run this notebook.** Every model call goes\n",
+    "> to NVIDIA's hosted chat-completions endpoint at\n",
+    "> `https://integrate.api.nvidia.com/v1/chat/completions`.\n",
+    "> All you need is a `NVIDIA_API_KEY` with access to the two model IDs\n",
+    "> configured in Setup below.\n",
+    "\n",
+    "This notebook builds a streamlined, **all-modality** document\n",
+    "analysis pipeline by pairing **Nemotron Parse**\n",
+    "(`nvidia/nemotron-parse`) with **StepFun Flash**\n",
+    "(`stepfun-ai/step-3.7-flash`). Parse provides the spatial\n",
+    "anchoring that turns each PDF page into typed blocks and picture\n",
+    "bounding boxes. StepFun is the vision-language model that both\n",
+    "**reads each cropped picture** and **answers questions about the\n",
+    "whole assembled document**.\n",
+    "\n",
+    "To make the case concretely, we run the pipeline against four pages\n",
+    "picked from **three different public PDFs** — a Pew Research report,\n",
+    "a social-media analytics deck, and a graduate-studies brochure —\n",
+    "each chosen to stress a **different content modality** that a single\n",
+    "page-level VLM call cannot handle alone:\n",
+    "\n",
+    "| Modality | Source | Why it needs Parse |\n",
+    "| --- | --- | --- |\n",
+    "| **Chart** | Pew Research, p. 5 | a stacked-bar chart with eight policy rows — Parse isolates the bar-graph region so StepFun's transcription is a clean markdown table, not a screenshot caption |\n",
+    "| **Multi-picture page** | social-media report, p. 11 | three Facebook-post screenshots side by side — without Parse's bbox split, the QA call cannot tell which post is *the Disneyland post* |\n",
+    "| **Infographic** | social-media report, p. 20 | a dense pixel-only demographic panel — Parse draws the panel boundary, StepFun lists every number inside |\n",
+    "| **Structured table** | Graduate Studies brochure, p. 11 | a two-column programme table — Parse surfaces it as LaTeX-tabular text, so no vision call is needed to read it |\n",
+    "\n",
+    "By the end of the tutorial you will see how the pair turns these\n",
+    "unstructured PDFs into **page-cited, phrase-quoted answers** that\n",
+    "any reader can verify against the page image.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e89f7b3",
+   "metadata": {},
+   "source": [
+    "## 1. Introduction: three roles, two model surfaces\n",
+    "\n",
+    "Our pipeline uses each hosted model for its specialty around **one\n",
+    "unified spatial context**:\n",
+    "\n",
+    "* **`nvidia/nemotron-parse` — the Architect.** A deterministic layout\n",
+    "  parser that returns every block's **type** (`Title`, `Text`,\n",
+    "  `Table`, `List-item`, `Picture`, ...), **bounding box**, and\n",
+    "  **reading order** in one call.\n",
+    "\n",
+    "* **`stepfun-ai/step-3.7-flash` — the Visual Specialist.** Every\n",
+    "  `Picture` Parse identifies becomes one StepFun call that first\n",
+    "  *classifies* the image (`Infographic`, `Bar Graph`, `Line Graph`,\n",
+    "  `Smartphone Screenshot`, ...) and then *transcribes* it with a\n",
+    "  prompt tailored to that sub-type.\n",
+    "\n",
+    "* **`stepfun-ai/step-3.7-flash` — the Reasoning Engine.** The\n",
+    "  same VLM reads the assembled document context: text blocks in\n",
+    "  reading order with picture transcriptions inlined at their spatial\n",
+    "  positions. It answers the question, cites the page it came from,\n",
+    "  and quotes the supporting phrase verbatim.\n",
+    "\n",
+    "Pipeline shape:\n",
+    "\n",
+    "```text\n",
+    "PDF page -> Nemotron Parse -> typed text/table blocks\n",
+    "                         -> picture boxes -> StepFun Visual Specialist\n",
+    "text + picture transcriptions -> reading-order page context\n",
+    "page contexts + question -> StepFun Reasoning Engine -> cited answer\n",
+    "```\n",
+    "\n",
+    "### Why pair Parse with StepFun instead of calling the VLM on the whole page?\n",
+    "\n",
+    "StepFun is a capable multimodal model, but real documents create four\n",
+    "structural problems that are easier to solve with Parse in front:\n",
+    "\n",
+    "1. **Tables and headers need structure, not pixels.** On the\n",
+    "   Graduate-Studies brochure page, Parse emits a `Table` block with a\n",
+    "   LaTeX-tabular body, so the Reasoning Engine can answer directly\n",
+    "   from text.\n",
+    "2. **A chart region needs isolation before transcription.** On the\n",
+    "   Pew Research page, Parse draws one `Picture` bbox around the chart;\n",
+    "   StepFun receives only the crop and returns a clean structured\n",
+    "   transcription.\n",
+    "3. **Multi-picture pages bleed together.** Page 11 of the social-media\n",
+    "   report has three screenshots side by side. Parse cuts them into\n",
+    "   separate boxes so StepFun reads one card at a time.\n",
+    "4. **Citations need anchors.** Parse gives every content item a `bbox`\n",
+    "   and reading-order index, so answers can cite `(p. 20)` and quote the\n",
+    "   phrase that supports the answer.\n",
+    "\n",
+    "Two design levers do the heavy lifting:\n",
+    "\n",
+    "1. **Divide and conquer on every `Picture`.** One class in, many kinds\n",
+    "   of picture transcriptions out.\n",
+    "2. **Spatial-context weave.** Parse's reading-order bboxes let us\n",
+    "   interleave each picture's transcription at the exact position it\n",
+    "   occupies on the page.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0080fb2b",
+   "metadata": {},
+   "source": [
+    "## 2. Setup and prerequisites\n",
+    "\n",
+    "Five Python packages are all we need: `pymupdf` for PDF rendering,\n",
+    "`pillow` for image handling, `requests` for the API calls, `pandas`\n",
+    "for tabular display, and `python-dotenv` to load the NVIDIA key from\n",
+    "a `.env` file.\n",
+    "\n",
+    "We install with [**uv**](https://docs.astral.sh/uv/) -- the fast\n",
+    "package manager from Astral -- and fall back to `pip` automatically\n",
+    "if `uv` is not on your `PATH`.  Recommended workflow before launching\n",
+    "Jupyter:\n",
+    "\n",
+    "```bash\n",
+    "# one-time install of uv (https://docs.astral.sh/uv/getting-started/installation/)\n",
+    "curl -LsSf https://astral.sh/uv/install.sh | sh\n",
+    "\n",
+    "# create + activate an isolated environment for this notebook\n",
+    "uv venv .venv && source .venv/bin/activate\n",
+    "uv pip install jupyter\n",
+    "jupyter lab stepfun_doc_intelligence_with_parse.ipynb\n",
+    "```\n",
+    "\n",
+    "The next cell installs the runtime deps into whichever environment\n",
+    "the notebook kernel is already pointing at, so it works whether you\n",
+    "ran the steps above or are using a colleague-provided kernel.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "74b70189",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:13.485717Z",
+     "iopub.status.busy": "2026-04-28T02:39:13.485583Z",
+     "iopub.status.idle": "2026-04-28T02:39:13.583801Z",
+     "shell.execute_reply": "2026-04-28T02:39:13.583228Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import shutil, subprocess, sys\n",
+    "555\n",
+    "PKGS = [\"requests\", \"pillow\", \"pymupdf\", \"pandas\", \"python-dotenv\"]\n",
+    "\n",
+    "if shutil.which(\"uv\"):\n",
+    "    print(\"[setup] installing via uv ->\", sys.executable)\n",
+    "    subprocess.check_call([\n",
+    "        \"uv\", \"pip\", \"install\", \"--quiet\",\n",
+    "        \"--python\", sys.executable,\n",
+    "        *PKGS,\n",
+    "    ])\n",
+    "else:\n",
+    "    print(\"[setup] uv not on PATH; falling back to pip. \"\n",
+    "          \"Install uv from https://docs.astral.sh/uv/ for ~10x faster syncs.\")\n",
+    "    subprocess.check_call([\n",
+    "        sys.executable, \"-m\", \"pip\", \"install\", \"--quiet\", *PKGS,\n",
+    "    ])\n",
+    "\n",
+    "print(\"[setup] OK -- runtime deps ready.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4726c5b8",
+   "metadata": {},
+   "source": [
+    "### Configure endpoints and keys\n",
+    "\n",
+    "This notebook supports a compact `.env` shape. `NVAI_CHAT_COMPLETIONS_URL` is the default NVIDIA API Catalog chat-completions endpoint for both models; set `PARSE_CHAT_COMPLETIONS_URL` or `STEPFUN_CHAT_COMPLETIONS_URL` only if you need separate endpoints.\n",
+    "\n",
+    "Make the key and endpoint visible to this notebook in either of two\n",
+    "ways:\n",
+    "\n",
+    "- **`.env` file** in this notebook's directory:\n",
+    "  ```bash\n",
+    "  NVIDIA_API_KEY=nvapi-...\n",
+    "  NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions\n",
+    "  STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash\n",
+    "  ```\n",
+    "- **Shell export** before launching Jupyter:\n",
+    "  ```bash\n",
+    "  export NVIDIA_API_KEY=nvapi-...\n",
+    "  export NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions\n",
+    "  export STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash\n",
+    "  ```\n",
+    "\n",
+    "The notebook does not store credentials in source control; `.env` and\n",
+    "`.env.local` are ignored by this directory's `.gitignore`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1157ce65",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:13.585295Z",
+     "iopub.status.busy": "2026-04-28T02:39:13.585176Z",
+     "iopub.status.idle": "2026-04-28T02:39:14.339790Z",
+     "shell.execute_reply": "2026-04-28T02:39:14.339256Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "import base64\n",
+    "import io\n",
+    "import json\n",
+    "import os\n",
+    "import re\n",
+    "import textwrap\n",
+    "import time\n",
+    "from pathlib import Path\n",
+    "from typing import Any\n",
+    "\n",
+    "import fitz  # PyMuPDF\n",
+    "import pandas as pd\n",
+    "import requests\n",
+    "from dotenv import load_dotenv\n",
+    "from IPython.display import Markdown, display\n",
+    "from PIL import Image, ImageDraw, ImageFont\n",
+    "\n",
+    "NOTEBOOK_CWD = Path.cwd()\n",
+    "_DEMO_RELATIVE_PATH = Path(\"usage-cookbook\") / \"Nemotron-3-Nano-Omni\" / \"doc-intelligence-with-parse\"\n",
+    "\n",
+    "\n",
+    "def _resolve_demo_root() -> Path:\n",
+    "    \"\"\"Find this demo directory even when Jupyter starts from the repo root.\"\"\"\n",
+    "    candidates = [\n",
+    "        NOTEBOOK_CWD,\n",
+    "        NOTEBOOK_CWD.parent if NOTEBOOK_CWD.name == \"notebooks\" else NOTEBOOK_CWD,\n",
+    "        NOTEBOOK_CWD / _DEMO_RELATIVE_PATH,\n",
+    "        NOTEBOOK_CWD / \"Nemotron\" / _DEMO_RELATIVE_PATH,\n",
+    "    ]\n",
+    "    for parent in NOTEBOOK_CWD.parents:\n",
+    "        candidates.extend([\n",
+    "            parent,\n",
+    "            parent / _DEMO_RELATIVE_PATH,\n",
+    "            parent / \"Nemotron\" / _DEMO_RELATIVE_PATH,\n",
+    "        ])\n",
+    "\n",
+    "    seen: set[Path] = set()\n",
+    "    for candidate in candidates:\n",
+    "        candidate = candidate.resolve()\n",
+    "        if candidate in seen:\n",
+    "            continue\n",
+    "        seen.add(candidate)\n",
+    "        if any((candidate / name).exists() for name in [\"stepfun_doc_intelligence_with_parse.ipynb\"]):\n",
+    "            return candidate\n",
+    "    return NOTEBOOK_CWD\n",
+    "\n",
+    "\n",
+    "REPO_ROOT = _resolve_demo_root()\n",
+    "\n",
+    "_loaded_env_files: list[Path] = []\n",
+    "for _env_file in [REPO_ROOT.parent / \".env\", REPO_ROOT / \".env\"]:\n",
+    "    if _env_file.exists():\n",
+    "        # Load broader defaults first; let the demo-local .env win.\n",
+    "        load_dotenv(_env_file, override=(_env_file.parent == REPO_ROOT))\n",
+    "        _loaded_env_files.append(_env_file)\n",
+    "\n",
+    "# Backward-compatible .env support:\n",
+    "#   NVIDIA_API_KEY + NVAI_CHAT_COMPLETIONS_URL are the default credential\n",
+    "#   and chat-completions URL for both models. Override PARSE_* or STEPFUN_*\n",
+    "#   only if the two calls need separate credentials or endpoints.\n",
+    "COMMON_API_KEY = os.environ.get(\"NVIDIA_API_KEY\", \"YOUR_API_KEY_HERE\")\n",
+    "PARSE_API_KEY = os.environ.get(\"PARSE_API_KEY\") or COMMON_API_KEY\n",
+    "STEPFUN_API_KEY = os.environ.get(\"STEPFUN_API_KEY\") or COMMON_API_KEY\n",
+    "\n",
+    "PARSE_CHAT_COMPLETIONS_URL = os.environ.get(\n",
+    "    \"PARSE_CHAT_COMPLETIONS_URL\",\n",
+    "    \"https://integrate.api.nvidia.com/v1/chat/completions\",\n",
+    ")\n",
+    "STEPFUN_CHAT_COMPLETIONS_URL = os.environ.get(\n",
+    "    \"STEPFUN_CHAT_COMPLETIONS_URL\",\n",
+    "    os.environ.get(\n",
+    "        \"NVAI_CHAT_COMPLETIONS_URL\",\n",
+    "        \"https://integrate.api.nvidia.com/v1/chat/completions\",\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "PARSE_MODEL = os.environ.get(\"PARSE_MODEL\", \"nvidia/nemotron-parse\")\n",
+    "STEPFUN_VLM_MODEL = os.environ.get(\"STEPFUN_VLM_MODEL\", \"stepfun-ai/step-3.7-flash\")\n",
+    "NVAI_REQUEST_TIMEOUT = int(os.environ.get(\"NVAI_REQUEST_TIMEOUT\", \"480\"))\n",
+    "NVAI_MAX_RETRIES = int(os.environ.get(\"NVAI_MAX_RETRIES\", \"2\"))\n",
+    "\n",
+    "if not PARSE_API_KEY or PARSE_API_KEY == \"YOUR_API_KEY_HERE\":\n",
+    "    raise RuntimeError(\n",
+    "        \"Parse API key is not set. Add NVIDIA_API_KEY=nvapi-... or \"\n",
+    "        \"PARSE_API_KEY=nvapi-... to a .env file in this notebook's directory.\"\n",
+    "    )\n",
+    "if not STEPFUN_API_KEY or STEPFUN_API_KEY == \"YOUR_API_KEY_HERE\":\n",
+    "    raise RuntimeError(\n",
+    "        \"StepFun API key is not set. Add NVIDIA_API_KEY=nvapi-... or \"\n",
+    "        \"STEPFUN_API_KEY=nvapi-... to a .env file in this notebook's directory.\"\n",
+    "    )\n",
+    "\n",
+    "print(f\"Demo root:       {REPO_ROOT}\")\n",
+    "print(\"Env files:       \" + (\", \".join(str(p) for p in _loaded_env_files) or \"none found\"))\n",
+    "print(f\"Parse endpoint:  {PARSE_CHAT_COMPLETIONS_URL}\")\n",
+    "print(f\"StepFun endpoint:{STEPFUN_CHAT_COMPLETIONS_URL}\")\n",
+    "print(f\"Architect:       {PARSE_MODEL}\")\n",
+    "print(f\"Specialist + QA: {STEPFUN_VLM_MODEL}\")\n",
+    "print(f\"Request timeout: {NVAI_REQUEST_TIMEOUT}s, retries: {NVAI_MAX_RETRIES}\")\n",
+    "\n",
+    "CLASS_COLORS = {\n",
+    "    \"Title\": \"#D32F2F\", \"Section-header\": \"#E91E63\", \"Text\": \"#4CAF50\",\n",
+    "    \"List-item\": \"#1976D2\", \"Caption\": \"#607D8B\", \"Table\": \"#03A9F4\",\n",
+    "    \"Picture\": \"#6D4C41\", \"Figure\": \"#6D4C41\", \"Formula\": \"#FF9800\",\n",
+    "    \"Page-header\": \"#9E9E9E\", \"Page-footer\": \"#9E9E9E\", \"Footnote\": \"#00BCD4\",\n",
+    "    \"Bibliography\": \"#512DA8\", \"TOC\": \"#FFC107\", \"DEFAULT\": \"#9E9E9E\",\n",
+    "}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1756d4d5",
+   "metadata": {},
+   "source": [
+    "## 3. The example document set\n",
+    "\n",
+    "We pick **four pages from three different public PDFs** so the\n",
+    "pipeline stresses a different content modality on each page, and so\n",
+    "the reader can visually verify every answer against the original\n",
+    "page image.  The exact pages are:\n",
+    "\n",
+    "| `short_id` | Modality | PDF | Page |\n",
+    "| --- | --- | --- | --- |\n",
+    "| `pew`      | Chart             | `05-03-18-political-release.pdf` (Pew Research) | 5  |\n",
+    "| `social`   | Multi-picture     | `measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf` (Social-Media Analytics Report) | 11 |\n",
+    "| `linkedin` | Infographic       | same Social-Media report as above | 20 |\n",
+    "| `gpl`      | Structured table  | `GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf` (Graduate Studies brochure) | 11 |\n",
+    "\n",
+    "Every page of these PDFs is a **rasterised image** — selecting text\n",
+    "with your mouse in a PDF viewer returns nothing, so a text-first\n",
+    "parser gives you nothing to work with.  This is precisely the class\n",
+    "of document where a VLM-driven pipeline earns its keep.\n",
+    "\n",
+    "Parse's layout overlay for each of these pages is generated inline\n",
+    "by the pipeline in §4.2 — so the annotations you see in this\n",
+    "notebook are produced by the same `nvidia/nemotron-parse` call the\n",
+    "rest of the pipeline consumes, never a pre-baked asset.  Point\n",
+    "`DEMO_DOCS` at any `(pdf, page)` pairs on your disk to try your own\n",
+    "— the rest of the notebook does not change."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "69ee7206",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:14.341617Z",
+     "iopub.status.busy": "2026-04-28T02:39:14.341453Z",
+     "iopub.status.idle": "2026-04-28T02:39:14.345710Z",
+     "shell.execute_reply": "2026-04-28T02:39:14.345386Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "DOC_DIR = REPO_ROOT / \"data\" / \"documents\"\n",
+    "DOC_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "# (short_id, pdf_path, page_number, modality_label)\n",
+    "# short_id is a stable key used downstream to group outputs per page.\n",
+    "DEMO_DOCS: list[tuple[str, Path, int, str]] = [\n",
+    "    (\"pew\",      DOC_DIR / \"05-03-18-political-release.pdf\",                                    5,  \"chart\"),\n",
+    "    (\"social\",   DOC_DIR / \"measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf\",     11, \"multi-picture\"),\n",
+    "    (\"linkedin\", DOC_DIR / \"measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf\",     20, \"infographic\"),\n",
+    "    (\"gpl\",      DOC_DIR / \"GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf\",  11, \"table\"),\n",
+    "]\n",
+    "\n",
+    "# Short, human-friendly name to use in section headings (the PDF filename\n",
+    "# itself is too long to read in a heading).\n",
+    "DISPLAY_NAME = {\n",
+    "    \"pew\":      \"Pew Research -- Political Release\",\n",
+    "    \"social\":   \"Social-Media Analytics Report\",\n",
+    "    \"linkedin\": \"Social-Media Analytics Report\",\n",
+    "    \"gpl\":      \"Graduate Studies Brochure\",\n",
+    "}\n",
+    "\n",
+    "# Public source URLs for each PDF.  The notebook is self-contained:\n",
+    "# if a PDF is missing, we download it once into `DOC_DIR` on first\n",
+    "# run.  The three demo PDFs are mirrored on the MMLongBench-Doc\n",
+    "# Hugging Face dataset (`yubo2333/MMLongBench-Doc`).  Point\n",
+    "# `DOC_DIR` at any directory you control (or pre-populate it\n",
+    "# yourself) and the pipeline consumes the local copy after that.\n",
+    "_HF_DOC_ROOT = (\n",
+    "    \"https://huggingface.co/datasets/yubo2333/MMLongBench-Doc/\"\n",
+    "    \"resolve/main/documents\"\n",
+    ")\n",
+    "_PDF_SOURCES: dict[str, str] = {\n",
+    "    name: f\"{_HF_DOC_ROOT}/{name}\" for name in {\n",
+    "        \"05-03-18-political-release.pdf\",\n",
+    "        \"measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf\",\n",
+    "        \"GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf\",\n",
+    "    }\n",
+    "}\n",
+    "\n",
+    "\n",
+    "def _ensure_pdf(pdf: Path) -> None:\n",
+    "    if pdf.exists():\n",
+    "        return\n",
+    "    url = _PDF_SOURCES.get(pdf.name)\n",
+    "    if url is None:\n",
+    "        raise FileNotFoundError(\n",
+    "            f\"PDF not found and no default URL registered: {pdf}.  \"\n",
+    "            \"Either drop the file into DOC_DIR yourself or add a \"\n",
+    "            \"URL entry to _PDF_SOURCES.\")\n",
+    "    print(f\"  [download] {pdf.name}  <-  {url}\")\n",
+    "    try:\n",
+    "        r = requests.get(url, timeout=60)\n",
+    "        r.raise_for_status()\n",
+    "    except Exception as exc:\n",
+    "        raise FileNotFoundError(\n",
+    "            f\"Could not auto-download {pdf.name} from {url}: {exc}.  \"\n",
+    "            \"Drop the PDF into DOC_DIR manually and re-run this cell.\"\n",
+    "        ) from exc\n",
+    "    pdf.write_bytes(r.content)\n",
+    "\n",
+    "\n",
+    "for sid, pdf, pn, label in DEMO_DOCS:\n",
+    "    _ensure_pdf(pdf)\n",
+    "    print(f\"  [{sid:8s}] p.{pn:<3d} {label:<14s} -> {pdf.name}  \"\n",
+    "          f\"({pdf.stat().st_size / 1024:,.0f} KB)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9b0ce18e",
+   "metadata": {},
+   "source": [
+    "## 4. The core pipeline in action\n",
+    "\n",
+    "Now, let's walk through the code that powers the pipeline."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1ccb5c03",
+   "metadata": {},
+   "source": [
+    "### 4.1. Helper functions\n",
+    "\n",
+    "One self-contained cell with every building block the pipeline needs,\n",
+    "organised into three groups:\n",
+    "\n",
+    "1. **Imaging** — `pdf_page_to_image` renders a page to pixels,\n",
+    "   `pil_to_data_url` encodes it for the API, `draw_annotations` paints\n",
+    "   bounding-box overlays (with per-class colours and sub-typed\n",
+    "   picture labels) on top of any page.\n",
+    "2. **Model surfaces** — `call_nemotron_parse` for the Architect,\n",
+    "   `call_stepfun_vlm` as a single entry point that serves both the\n",
+    "   Visual Specialist and the Reasoning Engine over the NVIDIA hosted\n",
+    "   chat-completions endpoint, plus small helpers to\n",
+    "   pull clean text or JSON out of the response.\n",
+    "3. **Pipeline stages** — `describe_picture` implements the\n",
+    "   divide-and-conquer lever (classify, then dispatch to a\n",
+    "   content-aware prompt); `assemble_page_context` implements the\n",
+    "   spatial weave (interleaves picture transcriptions into the page's\n",
+    "   prose at their reading-order position); `ask_question` is the\n",
+    "   final Reasoning Engine call, with three short answer rules that\n",
+    "   make every answer page-cited and phrase-quoted.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1fdbd3ab",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:14.347301Z",
+     "iopub.status.busy": "2026-04-28T02:39:14.347222Z",
+     "iopub.status.idle": "2026-04-28T02:39:14.389829Z",
+     "shell.execute_reply": "2026-04-28T02:39:14.389149Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# =============================================================================\n",
+    "# Group 1 — Imaging\n",
+    "# =============================================================================\n",
+    "\n",
+    "def pdf_page_to_image(pdf_path: str | Path, page_index: int, *, dpi: int = 150) -> Image.Image:\n",
+    "    \"\"\"Render a 0-indexed PDF page to an RGB PIL image at `dpi`.\"\"\"\n",
+    "    doc = fitz.open(pdf_path)\n",
+    "    try:\n",
+    "        page = doc.load_page(page_index)\n",
+    "        zoom = dpi / 72.0\n",
+    "        pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))\n",
+    "        return Image.frombytes(\"RGB\", [pix.width, pix.height], pix.samples)\n",
+    "    finally:\n",
+    "        doc.close()\n",
+    "\n",
+    "\n",
+    "def pil_to_data_url(img: Image.Image, *, fmt: str = \"JPEG\", quality: int = 85) -> str:\n",
+    "    \"\"\"JPEG- or PNG-encode a PIL image and wrap it as a `data:` URL.\"\"\"\n",
+    "    if img.mode != \"RGB\":\n",
+    "        img = img.convert(\"RGB\")\n",
+    "    buf = io.BytesIO()\n",
+    "    if fmt.upper() == \"JPEG\":\n",
+    "        img.save(buf, format=\"JPEG\", quality=quality)\n",
+    "        mime = \"image/jpeg\"\n",
+    "    else:\n",
+    "        img.save(buf, format=\"PNG\")\n",
+    "        mime = \"image/png\"\n",
+    "    return f\"data:{mime};base64,\" + base64.b64encode(buf.getvalue()).decode()\n",
+    "\n",
+    "\n",
+    "def _luminance(hex_color: str) -> float:\n",
+    "    h = hex_color.lstrip(\"#\")\n",
+    "    r, g, b = (int(h[i:i + 2], 16) for i in (0, 2, 4))\n",
+    "    return 0.2126 * r + 0.7152 * g + 0.0722 * b\n",
+    "\n",
+    "\n",
+    "def draw_annotations(image: Image.Image, blocks: list[dict[str, Any]]) -> Image.Image:\n",
+    "    \"\"\"Paint labelled bounding boxes for every block.  Pictures that\n",
+    "    carry a `sub_type` (from the Visual Specialist) are labelled as\n",
+    "    `Picture:<sub_type>` rather than just `Picture`.\n",
+    "    \"\"\"\n",
+    "    out = image.copy()\n",
+    "    draw = ImageDraw.Draw(out)\n",
+    "    W, H = out.size\n",
+    "    box_w = max(2, int(W / 600))\n",
+    "    font_size = max(14, int(W / 80))\n",
+    "    try:\n",
+    "        font = ImageFont.truetype(\"Arial.ttf\", font_size)\n",
+    "    except Exception:\n",
+    "        font = ImageFont.load_default()\n",
+    "    for i, b in enumerate(blocks):\n",
+    "        bb = b.get(\"bbox\") or {}\n",
+    "        x0, y0 = bb.get(\"xmin\", 0.0) * W, bb.get(\"ymin\", 0.0) * H\n",
+    "        x1, y1 = bb.get(\"xmax\", 0.0) * W, bb.get(\"ymax\", 0.0) * H\n",
+    "        if x1 <= x0 or y1 <= y0:\n",
+    "            continue\n",
+    "        cat = b.get(\"type\", \"DEFAULT\")\n",
+    "        sub = b.get(\"sub_type\")\n",
+    "        color = CLASS_COLORS.get(cat, CLASS_COLORS[\"DEFAULT\"])\n",
+    "        label = f\"{i}:{cat}\" + (f\":{sub}\" if sub else \"\")\n",
+    "        draw.rectangle([x0, y0, x1, y1], outline=color, width=box_w)\n",
+    "        try:\n",
+    "            tb = draw.textbbox((0, 0), label, font=font)\n",
+    "            tw, th = tb[2] - tb[0], tb[3] - tb[1]\n",
+    "        except Exception:\n",
+    "            tw, th = len(label) * 7, font_size\n",
+    "        bg = (x0, max(0, y0 - th - 6), x0 + tw + 10, max(0, y0 - 6))\n",
+    "        draw.rectangle(bg, fill=color)\n",
+    "        text_color = \"#000000\" if _luminance(color) > 140 else \"#FFFFFF\"\n",
+    "        draw.text((bg[0] + 5, bg[1] + 2), label, fill=text_color, font=font)\n",
+    "    return out\n",
+    "\n",
+    "\n",
+    "# =============================================================================\n",
+    "# Group 2 — Model surfaces\n",
+    "# =============================================================================\n",
+    "\n",
+    "def _headers(api_key: str) -> dict[str, str]:\n",
+    "    return {\n",
+    "        \"Authorization\": f\"Bearer {api_key}\",\n",
+    "        \"Content-Type\": \"application/json\",\n",
+    "        \"Accept\": \"application/json\",\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "def _post_chat_completion(\n",
+    "    body: dict[str, Any],\n",
+    "    *,\n",
+    "    endpoint: str,\n",
+    "    api_key: str,\n",
+    "    timeout: int | None = None,\n",
+    ") -> requests.Response:\n",
+    "    \"\"\"POST to a chat-completions endpoint with small retry protection.\"\"\"\n",
+    "    timeout = timeout or NVAI_REQUEST_TIMEOUT\n",
+    "    for attempt in range(NVAI_MAX_RETRIES + 1):\n",
+    "        try:\n",
+    "            return requests.post(\n",
+    "                endpoint,\n",
+    "                headers=_headers(api_key),\n",
+    "                json=body,\n",
+    "                timeout=timeout,\n",
+    "            )\n",
+    "        except (requests.exceptions.Timeout, requests.exceptions.ConnectionError):\n",
+    "            if attempt >= NVAI_MAX_RETRIES:\n",
+    "                raise\n",
+    "            sleep_s = 2 ** attempt\n",
+    "            print(f\"  [retry] hosted endpoint timed out; retrying in {sleep_s}s\")\n",
+    "            time.sleep(sleep_s)\n",
+    "\n",
+    "\n",
+    "def call_nemotron_parse(image: Image.Image) -> list[dict[str, Any]]:\n",
+    "    \"\"\"Run `nemotron-parse` on a page image.  Returns a flat list of\n",
+    "    blocks with `type`, `bbox`, and `text`.\n",
+    "    \"\"\"\n",
+    "    body = {\n",
+    "        \"model\": PARSE_MODEL,\n",
+    "        \"messages\": [{\"role\": \"user\", \"content\": [\n",
+    "            {\"type\": \"image_url\", \"image_url\": {\"url\": pil_to_data_url(image, fmt=\"PNG\")}}\n",
+    "        ]}],\n",
+    "        \"tools\": [{\"type\": \"function\", \"function\": {\"name\": \"markdown_bbox\"}}],\n",
+    "        \"tool_choice\": {\"type\": \"function\", \"function\": {\"name\": \"markdown_bbox\"}},\n",
+    "        \"max_tokens\": 8192,\n",
+    "        \"temperature\": 0.0,\n",
+    "    }\n",
+    "    r = _post_chat_completion(\n",
+    "        body,\n",
+    "        endpoint=PARSE_CHAT_COMPLETIONS_URL,\n",
+    "        api_key=PARSE_API_KEY,\n",
+    "    )\n",
+    "    r.raise_for_status()\n",
+    "    args = r.json()[\"choices\"][0][\"message\"][\"tool_calls\"][0][\"function\"][\"arguments\"]\n",
+    "    parsed = json.loads(args)\n",
+    "    blocks = parsed if isinstance(parsed, list) else parsed.get(\"tool_call_arguments\", [])\n",
+    "    if blocks and isinstance(blocks[0], list):\n",
+    "        blocks = blocks[0]\n",
+    "    return blocks or []\n",
+    "\n",
+    "\n",
+    "# StepFun can answer directly, but VLMs sometimes include reasoning-style\n",
+    "# preambles when asked to transcribe dense visual content.  The Specialist\n",
+    "# wants the final transcription only, so direct calls get a small system\n",
+    "# guard.  `extract_text` also strips any <think> blocks or echoed guard text.\n",
+    "_SYS_NO_THINK = (\n",
+    "    \"/no_think\\n\"\n",
+    "    \"Answer directly and concisely. Do NOT include any reasoning, \"\n",
+    "    \"preamble, or <think> blocks.\"\n",
+    ")\n",
+    "_SYSTEM_ECHO = re.compile(\n",
+    "    r\"^\\s*(?:/no_think\\s*)?Answer directly and concisely[^.]*\\.\\s*\"\n",
+    "    r\"Do NOT include any reasoning[^.]*\\.\\s*\",\n",
+    "    re.IGNORECASE,\n",
+    ")\n",
+    "_LEAK_HEADS = (\n",
+    "    \"okay,\", \"okay \", \"the user wants\", \"let me \", \"first,\",\n",
+    "    \"i need to\", \"i'll \", \"i will \", \"alright,\",\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def call_stepfun_vlm(\n",
+    "    prompt: str,\n",
+    "    images: list[Image.Image] | None = None,\n",
+    "    *,\n",
+    "    direct: bool = True,\n",
+    "    json_mode: bool = False,\n",
+    "    temperature: float = 0.2,\n",
+    "    top_p: float = 0.95,\n",
+    "    max_tokens: int = 2048,\n",
+    ") -> dict[str, Any]:\n",
+    "    \"\"\"One StepFun VLM call, used by both the Specialist and the\n",
+    "    Reasoning Engine.  `direct=True` adds a short system guard that\n",
+    "    asks for final answers only.  The request uses standard\n",
+    "    chat-completions fields so it can target\n",
+    "    `stepfun-ai/step-3.7-flash` through NVIDIA's hosted endpoint.\n",
+    "    \"\"\"\n",
+    "    parts: list[dict[str, Any]] = [{\"type\": \"text\", \"text\": prompt}]\n",
+    "    for img in images or []:\n",
+    "        parts.append({\"type\": \"image_url\", \"image_url\": {\"url\": pil_to_data_url(img)}})\n",
+    "    messages: list[dict[str, Any]] = []\n",
+    "    if direct:\n",
+    "        messages.append({\"role\": \"system\", \"content\": _SYS_NO_THINK})\n",
+    "    messages.append({\"role\": \"user\", \"content\": parts})\n",
+    "    body: dict[str, Any] = {\n",
+    "        \"model\": STEPFUN_VLM_MODEL,\n",
+    "        \"messages\": messages,\n",
+    "        \"max_tokens\": max_tokens,\n",
+    "        \"temperature\": temperature,\n",
+    "        \"top_p\": top_p,\n",
+    "        \"stream\": False,\n",
+    "    }\n",
+    "    if json_mode:\n",
+    "        body[\"response_format\"] = {\"type\": \"json_object\"}\n",
+    "    r = _post_chat_completion(\n",
+    "        body,\n",
+    "        endpoint=STEPFUN_CHAT_COMPLETIONS_URL,\n",
+    "        api_key=STEPFUN_API_KEY,\n",
+    "    )\n",
+    "    if r.status_code >= 400 and json_mode and \"response_format\" in body:\n",
+    "        # Some hosted VLMs ignore JSON prompts but reject response_format.\n",
+    "        # Retry once with the prompt-only JSON instruction.\n",
+    "        body.pop(\"response_format\", None)\n",
+    "        r = _post_chat_completion(\n",
+    "            body,\n",
+    "            endpoint=STEPFUN_CHAT_COMPLETIONS_URL,\n",
+    "            api_key=STEPFUN_API_KEY,\n",
+    "        )\n",
+    "    r.raise_for_status()\n",
+    "    return r.json()\n",
+    "\n",
+    "\n",
+    "_THINK_RE = re.compile(r\"<think\\b[^>]*>.*?</think>\", re.DOTALL | re.IGNORECASE)\n",
+    "\n",
+    "\n",
+    "def extract_text(resp: dict[str, Any]) -> str:\n",
+    "    msg = resp.get(\"choices\", [{}])[0].get(\"message\", {})\n",
+    "    text = (\n",
+    "        msg.get(\"content\")\n",
+    "        or msg.get(\"reasoning\")\n",
+    "        or msg.get(\"reasoning_content\")\n",
+    "        or \"\"\n",
+    "    ).strip()\n",
+    "    text = _THINK_RE.sub(\"\", text).strip()\n",
+    "    text = _SYSTEM_ECHO.sub(\"\", text, count=1).strip()\n",
+    "    return text\n",
+    "\n",
+    "\n",
+    "def _is_leaky(text: str) -> bool:\n",
+    "    \"\"\"Heuristic: the Specialist sometimes leaks chain-of-thought\n",
+    "    into `content` without `<think>` tags.  If the response opens\n",
+    "    with a classic reasoning preamble, we retry the call once.\n",
+    "    \"\"\"\n",
+    "    head = text.lstrip().lower()[:80]\n",
+    "    return any(head.startswith(p) for p in _LEAK_HEADS)\n",
+    "\n",
+    "\n",
+    "def extract_json_object(resp: dict[str, Any]) -> dict[str, Any]:\n",
+    "    \"\"\"Extract a JSON object robustly.  Looks in both `content` and\n",
+    "    `reasoning` fields (the hosted endpoint occasionally collapses\n",
+    "    JSON-mode output into the reasoning stream), strips code fences,\n",
+    "    and falls back to the outermost `{...}` substring.\n",
+    "    \"\"\"\n",
+    "    msg = resp.get(\"choices\", [{}])[0].get(\"message\", {})\n",
+    "    for raw in (\n",
+    "        msg.get(\"content\") or \"\",\n",
+    "        msg.get(\"reasoning\") or \"\",\n",
+    "        msg.get(\"reasoning_content\") or \"\",\n",
+    "    ):\n",
+    "        s = _THINK_RE.sub(\"\", (raw or \"\").strip()).strip()\n",
+    "        if not s:\n",
+    "            continue\n",
+    "        if s.startswith(\"```\"):\n",
+    "            s = re.sub(r\"^```(?:json)?\\s*\", \"\", s, flags=re.IGNORECASE).strip(\"` \\n\")\n",
+    "        try:\n",
+    "            obj = json.loads(s)\n",
+    "            if isinstance(obj, dict):\n",
+    "                return obj\n",
+    "        except json.JSONDecodeError:\n",
+    "            pass\n",
+    "        l, r = s.find(\"{\"), s.rfind(\"}\")\n",
+    "        if l != -1 and r > l:\n",
+    "            try:\n",
+    "                obj = json.loads(s[l : r + 1])\n",
+    "                if isinstance(obj, dict):\n",
+    "                    return obj\n",
+    "            except json.JSONDecodeError:\n",
+    "                pass\n",
+    "    return {}\n",
+    "\n",
+    "\n",
+    "# =============================================================================\n",
+    "# Group 3 — Pipeline stages\n",
+    "# =============================================================================\n",
+    "\n",
+    "CLASSIFY_PROMPT = (\n",
+    "    \"Analyze the provided image and classify its content. Your response \"\n",
+    "    \"MUST be a single, valid JSON object with the following keys:\\n\"\n",
+    "    '- \"image_type\": one of \"Extractive\" (charts, graphs, diagrams, '\n",
+    "    'tables, flowcharts, maps) or \"Descriptive\" (photographs, '\n",
+    "    \"illustrations, artistic pieces).\\n\"\n",
+    "    '- \"sub_type\": a specific label, e.g. \"Line Graph\", \"Bar Graph\", '\n",
+    "    '\"Infographic\", \"Flowchart\", \"Pyramid Diagram\", \"Smartphone '\n",
+    "    'Screenshot\", \"Photograph\".\\n'\n",
+    "    '- \"subject_matter\": one-sentence summary of the picture topic.\\n'\n",
+    "    '- \"contains_text\": boolean, true if the image has readable text.\\n'\n",
+    "    \"Provide ONLY the JSON object and nothing else.\"\n",
+    ")\n",
+    "\n",
+    "# Divide-and-conquer dispatch table — different picture kinds call for\n",
+    "# different transcription prompts.\n",
+    "ANALYSIS_PROMPTS: dict[tuple[str, str], str] = {\n",
+    "    (\"Extractive\", \"Default\"): (\n",
+    "        \"Analyze the provided image and extract all structured \"\n",
+    "        \"information.  If the information fits a tabular format, \"\n",
+    "        \"render it as a Markdown table.  Otherwise, produce a concise \"\n",
+    "        \"summary capturing every number, label, and relationship.\"\n",
+    "    ),\n",
+    "    (\"Extractive\", \"Line Graph\"): (\n",
+    "        \"You are a data analyst.  Transcribe the data from this line \"\n",
+    "        \"graph.  State the title, X- and Y-axis labels, and for each \"\n",
+    "        \"series extract the data points as [x, y] pairs.  Return one \"\n",
+    "        \"JSON object.\"\n",
+    "    ),\n",
+    "    (\"Extractive\", \"Bar Graph\"): (\n",
+    "        \"You are a data analyst.  Transcribe this bar chart.  State \"\n",
+    "        \"the title, axis labels, category names, and the value for \"\n",
+    "        \"each bar.  Return both a Markdown table AND a one-sentence \"\n",
+    "        \"headline finding.\"\n",
+    "    ),\n",
+    "    (\"Extractive\", \"Infographic\"): (\n",
+    "        \"Transcribe this infographic.  For each panel or section, \"\n",
+    "        \"list its name and every labelled value or percentage.  \"\n",
+    "        \"Preserve the grouping the designer used.  Return a \"\n",
+    "        \"structured Markdown outline.\"\n",
+    "    ),\n",
+    "    (\"Extractive\", \"Flowchart\"): (\n",
+    "        \"Transcribe this flowchart or diagram.  List every node and \"\n",
+    "        \"every labelled edge.  State the overall flow direction.\"\n",
+    "    ),\n",
+    "    (\"Extractive\", \"Pyramid Diagram\"): (\n",
+    "        \"Transcribe this pyramid diagram.  List every tier from top \"\n",
+    "        \"to bottom with its label and any supporting text.  Infer \"\n",
+    "        \"the ordering or progression the diagram communicates.\"\n",
+    "    ),\n",
+    "    (\"Descriptive\", \"Default\"): (\n",
+    "        \"Describe this image in detail: subject matter, composition, \"\n",
+    "        \"colours, and any text visible in the scene.  Do not \"\n",
+    "        \"speculate beyond what is visible.\"\n",
+    "    ),\n",
+    "    (\"Descriptive\", \"Smartphone Screenshot\"): (\n",
+    "        \"Transcribe this smartphone UI.  Read every label, button, \"\n",
+    "        \"menu entry, and message visible.  Infer the app or screen \"\n",
+    "        \"type (e.g. SMS, contacts, home screen, call UI) and list \"\n",
+    "        \"the UI elements in reading order.\"\n",
+    "    ),\n",
+    "}\n",
+    "\n",
+    "\n",
+    "def pick_analysis_prompt(classification: dict[str, Any]) -> str:\n",
+    "    it = classification.get(\"image_type\", \"Extractive\")\n",
+    "    st = classification.get(\"sub_type\", \"Default\")\n",
+    "    return (ANALYSIS_PROMPTS.get((it, st))\n",
+    "            or ANALYSIS_PROMPTS.get((it, \"Default\"))\n",
+    "            or ANALYSIS_PROMPTS[(\"Extractive\", \"Default\")])\n",
+    "\n",
+    "\n",
+    "def crop_picture(page: Image.Image, bbox: dict[str, float]) -> Image.Image:\n",
+    "    W, H = page.size\n",
+    "    return page.crop((int(bbox[\"xmin\"] * W), int(bbox[\"ymin\"] * H),\n",
+    "                      int(bbox[\"xmax\"] * W), int(bbox[\"ymax\"] * H)))\n",
+    "\n",
+    "\n",
+    "def describe_picture(crop: Image.Image) -> dict[str, Any]:\n",
+    "    \"\"\"Divide-and-conquer: classify first, then transcribe with a\n",
+    "    content-aware prompt.  Two API calls per picture.  If the first\n",
+    "    transcription leaks reasoning into `content`, we retry once --\n",
+    "    this small resilience step keeps the final context clean so the\n",
+    "    Reasoning Engine doesn't echo chain-of-thought back in §7.\n",
+    "    \"\"\"\n",
+    "    cls_resp = call_stepfun_vlm(\n",
+    "        CLASSIFY_PROMPT, images=[crop],\n",
+    "        direct=True, json_mode=True,\n",
+    "        temperature=0.0, max_tokens=1024,\n",
+    "    )\n",
+    "    classification = extract_json_object(cls_resp) or {\n",
+    "        \"image_type\": \"Extractive\", \"sub_type\": \"Default\",\n",
+    "    }\n",
+    "    prompt = pick_analysis_prompt(classification)\n",
+    "    desc_resp = call_stepfun_vlm(\n",
+    "        prompt, images=[crop],\n",
+    "        direct=True, json_mode=False,\n",
+    "        temperature=0.2, max_tokens=2048,\n",
+    "    )\n",
+    "    description = extract_text(desc_resp)\n",
+    "    if _is_leaky(description):\n",
+    "        desc_resp = call_stepfun_vlm(\n",
+    "            prompt, images=[crop],\n",
+    "            direct=True, json_mode=False,\n",
+    "            temperature=0.2, max_tokens=2048,\n",
+    "        )\n",
+    "        description = extract_text(desc_resp)\n",
+    "    return {\"classification\": classification, \"description\": description}\n",
+    "\n",
+    "\n",
+    "def assemble_page_context(\n",
+    "    parse_blocks: list[dict[str, Any]],\n",
+    "    picture_descriptions: list[str],\n",
+    "    *,\n",
+    "    page_n: int,\n",
+    ") -> str:\n",
+    "    \"\"\"Spatial weave: walk parse's blocks in reading order, emit each\n",
+    "    text block verbatim and substitute each `Picture` block with its\n",
+    "    StepFun VLM transcription inline at the same spatial position.\n",
+    "\n",
+    "    Pages are wrapped in strong visual delimiters so the Reasoning\n",
+    "    Engine never confuses which page a fact came from when it is asked\n",
+    "    to cite it.\n",
+    "    \"\"\"\n",
+    "    header = f\"===== PAGE {page_n} =====\"\n",
+    "    footer = f\"===== END PAGE {page_n} =====\"\n",
+    "    lines = [header]\n",
+    "    pic_idx = 0\n",
+    "    for b in parse_blocks:\n",
+    "        t = b.get(\"type\")\n",
+    "        if t == \"Picture\":\n",
+    "            if pic_idx < len(picture_descriptions):\n",
+    "                lines.append(f\"[Picture on page {page_n}] \"\n",
+    "                             + picture_descriptions[pic_idx])\n",
+    "                pic_idx += 1\n",
+    "        elif b.get(\"text\"):\n",
+    "            lines.append(b[\"text\"])\n",
+    "    lines.append(footer)\n",
+    "    return \"\\n\\n\".join(lines)\n",
+    "\n",
+    "\n",
+    "QA_PROMPT_TEMPLATE = (\n",
+    "    \"Based on the following document context, please answer the \"\n",
+    "    \"question that follows.\\n\\n\"\n",
+    "    \"- DOCUMENT CONTEXT\\n{context}\\n\\n\"\n",
+    "    \"- QUESTION\\n{question}\\n\\n\"\n",
+    "    \"Answer rules:\\n\"\n",
+    "    \"  1. Cite the page number(s) the answer comes from in the form \"\n",
+    "    \"'(p. <page-number>)' immediately after the answer.\\n\"\n",
+    "    \"  2. Quote the specific phrase or value from the document \"\n",
+    "    \"context that supports the answer.\\n\"\n",
+    "    \"  3. If the answer is NOT present in the context, output exactly \"\n",
+    "    \"the string `Not answerable` — do not guess.\"\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def ask_question(question: str, context: str) -> str:\n",
+    "    \"\"\"Run the Reasoning Engine.  We keep `direct=True` + the\n",
+    "    `/no_think` system message so the final answer is a clean\n",
+    "    citation, not a transcript of the model's inner monologue.  The\n",
+    "    Specialist's transcriptions already carried the heavy visual\n",
+    "    extraction at pipeline time, so the QA call only needs to search\n",
+    "    the assembled context for the cited phrase.\n",
+    "    \"\"\"\n",
+    "    prompt = QA_PROMPT_TEMPLATE.format(context=context, question=question)\n",
+    "    resp = call_stepfun_vlm(\n",
+    "        prompt, images=None,\n",
+    "        direct=True, json_mode=False,\n",
+    "        temperature=0.2, max_tokens=1024,\n",
+    "    )\n",
+    "    answer = extract_text(resp)\n",
+    "    if _is_leaky(answer):\n",
+    "        resp = call_stepfun_vlm(\n",
+    "            prompt, images=None,\n",
+    "            direct=True, json_mode=False,\n",
+    "            temperature=0.2, max_tokens=1024,\n",
+    "        )\n",
+    "        answer = extract_text(resp)\n",
+    "    return answer\n",
+    "\n",
+    "\n",
+    "print(\"Helpers ready.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3dca00c",
+   "metadata": {},
+   "source": [
+    "### 4.2. Executing the pipeline\n",
+    "\n",
+    "Now we drive the four pages through the pipeline.  The loop below:\n",
+    "\n",
+    "1. Renders each PDF page to pixels.\n",
+    "2. Makes one **Stage 1** call per page to the Architect\n",
+    "   (`nemotron-parse`) and **displays the page with Parse's coloured\n",
+    "   layout boxes overlaid**.  That overlay *is* our proof of spatial\n",
+    "   anchoring: you can see Parse split the social-media page into\n",
+    "   three separate `Picture` boxes, wrap the Pew chart in a single\n",
+    "   `Picture:Bar Graph` box, and flip the GPL table into a\n",
+    "   `Table` block with no `Picture` call at all.\n",
+    "3. For every `Picture` block the Architect returns, crops the\n",
+    "   picture and makes a pair of **Stage 2** calls to the Visual\n",
+    "   Specialist — one to classify the picture's sub-type, one to\n",
+    "   transcribe it with a sub-type-aware prompt.  (The crop-plus-\n",
+    "   transcription receipts are shown modality-by-modality in §5, to\n",
+    "   avoid re-displaying the same images twice.)\n",
+    "4. Aggregates everything into a single `file_results` JSON structure\n",
+    "   we can inspect and query later.\n",
+    "\n",
+    "Watch especially page 11 of the social-media report: the Architect\n",
+    "returns **three** `Picture` boxes — one per Facebook-post\n",
+    "screenshot — and each gets its own targeted transcription.  That is\n",
+    "the multi-picture isolation that makes the *\"Disneyland\"* question\n",
+    "in §7 answerable with a precise number."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e40f5aae",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:14.391295Z",
+     "iopub.status.busy": "2026-04-28T02:39:14.391181Z",
+     "iopub.status.idle": "2026-04-28T02:39:42.712961Z",
+     "shell.execute_reply": "2026-04-28T02:39:42.712441Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Per-(short_id, page_n) page images and Parse blocks; per-pdf result bundles\n",
+    "page_images: dict[str, Image.Image] = {}\n",
+    "page_blocks: dict[str, list[dict[str, Any]]] = {}\n",
+    "file_results: dict[str, dict[str, Any]] = {}\n",
+    "\n",
+    "# Annotated-preview width (kept small so the notebook file stays lean).\n",
+    "_ANNOT_W = 820\n",
+    "\n",
+    "for sid, pdf, page_n, modality in DEMO_DOCS:\n",
+    "    print(\"\\n\" + \"=\" * 74)\n",
+    "    print(f\"[{sid}] {pdf.name} -- page {page_n}  ({modality})\")\n",
+    "    print(\"=\" * 74)\n",
+    "\n",
+    "    page = pdf_page_to_image(pdf, page_n - 1, dpi=150)\n",
+    "    page_images[sid] = page\n",
+    "\n",
+    "    # Stage 1 -- Architect.\n",
+    "    t0 = time.time()\n",
+    "    blocks = call_nemotron_parse(page)\n",
+    "    type_counts: dict[str, int] = {}\n",
+    "    for b in blocks:\n",
+    "        type_counts[b.get(\"type\", \"?\")] = type_counts.get(b.get(\"type\", \"?\"), 0) + 1\n",
+    "    print(f\"[Architect]  {len(blocks)} blocks in {time.time()-t0:.1f}s  \"\n",
+    "          f\"types -> {type_counts}\")\n",
+    "    page_blocks[sid] = blocks\n",
+    "\n",
+    "    # Show the coloured layout overlay in-place -- this is the annotated\n",
+    "    # view §3 promised.  Annotating a copy keeps sub_type labels out\n",
+    "    # of the overlay until the Specialist runs below.\n",
+    "    annotated = draw_annotations(page, blocks)\n",
+    "    display(Markdown(\n",
+    "        f\"**Parse overlay** &mdash; {DISPLAY_NAME[sid]} (p. {page_n}, \"\n",
+    "        f\"modality: {modality})\"))\n",
+    "    display(annotated.resize(\n",
+    "        (_ANNOT_W, int(_ANNOT_W * annotated.height / annotated.width))))\n",
+    "\n",
+    "    bundle = file_results.setdefault(\n",
+    "        pdf.name, {\"source_filename\": pdf.name, \"pages\": []})\n",
+    "    page_entry: dict[str, Any] = {\n",
+    "        \"page_number\": page_n, \"status\": \"Layout extraction successful\",\n",
+    "        \"content\": [],\n",
+    "    }\n",
+    "\n",
+    "    # Stage 2 -- Visual Specialist for every Picture block.\n",
+    "    n_pics = sum(1 for b in blocks if b.get(\"type\") == \"Picture\")\n",
+    "    if n_pics == 0:\n",
+    "        print(\"[Specialist] skipped -- no Picture blocks on this page \"\n",
+    "              \"(Parse handled this modality as structured text).\")\n",
+    "\n",
+    "    for i, b in enumerate(blocks):\n",
+    "        item: dict[str, Any] = {\n",
+    "            \"extraction_id\": i, \"type\": b.get(\"type\"),\n",
+    "            \"bbox\": b.get(\"bbox\"), \"text\": b.get(\"text\"),\n",
+    "        }\n",
+    "        if b.get(\"type\") == \"Picture\" and b.get(\"bbox\"):\n",
+    "            crop = crop_picture(page, b[\"bbox\"])\n",
+    "            t1 = time.time()\n",
+    "            result = describe_picture(crop)\n",
+    "            sub = result[\"classification\"].get(\"sub_type\", \"?\")\n",
+    "            print(f\"[Specialist] Picture #{i:<2d}  classified as '{sub}'  \"\n",
+    "                  f\"-> described in {time.time()-t1:.1f}s  \"\n",
+    "                  f\"({len(result['description'])} chars)\")\n",
+    "            b[\"sub_type\"] = sub\n",
+    "            item[\"classification\"] = result[\"classification\"]\n",
+    "            item[\"description\"] = result[\"description\"]\n",
+    "        page_entry[\"content\"].append(item)\n",
+    "\n",
+    "    bundle[\"pages\"].append(page_entry)\n",
+    "\n",
+    "_n_pics_total = sum(1 for b in file_results.values()\n",
+    "                    for p in b[\"pages\"] for it in p[\"content\"]\n",
+    "                    if it.get(\"description\"))\n",
+    "print(\"\\n\" + \"=\" * 74)\n",
+    "print(f\"Pipeline finished: {len(DEMO_DOCS)} pages across \"\n",
+    "      f\"{len(file_results)} documents, {_n_pics_total} pictures transcribed.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c42deff6",
+   "metadata": {},
+   "source": [
+    "## 5. Divide and conquer, modality by modality\n",
+    "\n",
+    "§4.2 showed the *coloured overlays* — that is, **where** Parse drew\n",
+    "its boxes.  This section shows **what the pair actually produced**\n",
+    "inside each box, one modality at a time.  For every demo page we\n",
+    "pull out:\n",
+    "\n",
+    "1. the **exact image crop** Parse handed to the Visual Specialist\n",
+    "   (or the text block if no vision call was needed), and\n",
+    "2. the **Specialist's transcription** of that crop — the text the\n",
+    "   Reasoning Engine will read in §7.\n",
+    "\n",
+    "That crop-plus-transcription pair is the *receipt* of the divide-\n",
+    "and-conquer design.  Four modalities, four receipts:\n",
+    "\n",
+    "* **Chart** – one `Picture:Bar Graph` box → a clean Markdown table.\n",
+    "* **Multi-picture page** – *three* `Picture` boxes → three separate\n",
+    "  Facebook-post transcriptions (this is the one to linger on — with\n",
+    "  Parse turned off, the single page-level call has to juggle three\n",
+    "  cards at once).\n",
+    "* **Infographic** – one `Picture:Infographic` box → a structured\n",
+    "  panel-by-panel breakdown of the LinkedIn demographic stats.\n",
+    "* **Structured table** – **no `Picture` call at all**: Parse emits\n",
+    "  the programme list directly as a LaTeX-tabular `Table` block the\n",
+    "  Reasoning Engine reads verbatim.\n",
+    "\n",
+    "Because the Specialist's transcription is the *same text* the\n",
+    "Reasoning Engine reads in §7, any answer it returns can be\n",
+    "back-traced to one of these crops with one glance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c1d6522",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:42.719713Z",
+     "iopub.status.busy": "2026-04-28T02:39:42.719587Z",
+     "iopub.status.idle": "2026-04-28T02:39:42.785275Z",
+     "shell.execute_reply": "2026-04-28T02:39:42.784861Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Per-modality evidence: the crop(s) Parse isolated + the Visual\n",
+    "# Specialist's transcription.  We do NOT re-display full annotated\n",
+    "# pages here -- those live in section 3.  The page image itself is\n",
+    "# shown only to provide the crop; transcriptions are rendered as\n",
+    "# Markdown so tables come out formatted.\n",
+    "MAX_DESC_CHARS = 1400\n",
+    "\n",
+    "for sid, pdf, page_n, modality in DEMO_DOCS:\n",
+    "    display(Markdown(\n",
+    "        f\"### {modality.title()} &mdash; {DISPLAY_NAME[sid]} (p. {page_n})\"\n",
+    "    ))\n",
+    "\n",
+    "    page = page_images[sid]\n",
+    "    blocks = page_blocks[sid]\n",
+    "    page_entry = next(p for p in file_results[pdf.name][\"pages\"]\n",
+    "                      if p[\"page_number\"] == page_n)\n",
+    "\n",
+    "    pic_items = [it for it in page_entry[\"content\"]\n",
+    "                 if it.get(\"type\") == \"Picture\" and it.get(\"description\")]\n",
+    "    tab_items = [it for it in page_entry[\"content\"]\n",
+    "                 if it.get(\"type\") == \"Table\" and it.get(\"text\")]\n",
+    "\n",
+    "    # Structured-table case: Parse handled it directly; no vision\n",
+    "    # call was required.\n",
+    "    if not pic_items and tab_items:\n",
+    "        display(Markdown(\n",
+    "            \"Parse surfaced this region as a `Table` block with \"\n",
+    "            \"LaTeX-tabular body &mdash; **no Visual Specialist call \"\n",
+    "            \"was needed**.  The Reasoning Engine reads the cells \"\n",
+    "            \"below as plain text:\"))\n",
+    "        print(tab_items[0][\"text\"])\n",
+    "        continue\n",
+    "\n",
+    "    block_by_id = {i: b for i, b in enumerate(blocks)}\n",
+    "    crop_w = 360 if len(pic_items) > 1 else 620\n",
+    "\n",
+    "    for item in pic_items:\n",
+    "        block = block_by_id[item[\"extraction_id\"]]\n",
+    "        cls = item[\"classification\"] or {}\n",
+    "        sub = cls.get(\"sub_type\", \"?\")\n",
+    "        subject = cls.get(\"subject_matter\", \"\")\n",
+    "        header = (f\"**Crop #{item['extraction_id']}** &mdash; \"\n",
+    "                  f\"classified as `Picture:{sub}`\")\n",
+    "        if subject:\n",
+    "            header += f\"  \\n*{subject}*\"\n",
+    "        display(Markdown(header))\n",
+    "\n",
+    "        crop = crop_picture(page, block[\"bbox\"])\n",
+    "        display(crop.resize((crop_w,\n",
+    "                             int(crop_w * crop.height / crop.width))))\n",
+    "\n",
+    "        desc = item[\"description\"]\n",
+    "        if len(desc) > MAX_DESC_CHARS:\n",
+    "            desc = desc[:MAX_DESC_CHARS].rstrip() + \"\\n\\n*... (truncated)*\"\n",
+    "        display(Markdown(desc))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e4e7b0f",
+   "metadata": {},
+   "source": [
+    "## 6. Examining the final JSON output\n",
+    "\n",
+    "The pipeline fuses both models' results into a single, structured\n",
+    "JSON object *per document*.  Every page carries a list of `content`\n",
+    "items; `Picture` items contain the Visual Specialist's\n",
+    "`classification` (image type, sub-type, subject-matter summary)\n",
+    "alongside its textual `description`, while `Table` items carry the\n",
+    "LaTeX-tabular text Parse emitted directly.\n",
+    "\n",
+    "This JSON is the *only* artefact the Reasoning Engine needs — it\n",
+    "can be serialised, cached, re-queried, or piped into downstream\n",
+    "retrieval systems.  Below we preview the Pew Research document's\n",
+    "output and save one `<stem>.parse_stepfun.json` per PDF to disk.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "758f7951",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:42.787281Z",
+     "iopub.status.busy": "2026-04-28T02:39:42.787186Z",
+     "iopub.status.idle": "2026-04-28T02:39:42.793116Z",
+     "shell.execute_reply": "2026-04-28T02:39:42.792759Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Preview the shortest bundle (the Pew one-pager) so the reader\n",
+    "# sees the full shape on screen without scrolling.\n",
+    "_preview_pdf_name = DEMO_DOCS[0][1].name  # Pew Research release\n",
+    "_preview = file_results[_preview_pdf_name]\n",
+    "print(f\"=== preview: {_preview_pdf_name} ===\")\n",
+    "for line in json.dumps(_preview, indent=2, default=str).split(\"\\n\")[:60]:\n",
+    "    print(line)\n",
+    "print(\"... (truncated)\\n\")\n",
+    "\n",
+    "# Save one JSON per PDF.  Downstream retrieval systems can index these\n",
+    "# directly, one embedding per page.\n",
+    "OUTPUT_DIR = Path(\"output_results\")\n",
+    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "for pdf_name, bundle in file_results.items():\n",
+    "    out = OUTPUT_DIR / (Path(pdf_name).stem + \".parse_stepfun.json\")\n",
+    "    out.write_text(json.dumps(bundle, indent=2, default=str))\n",
+    "    print(f\"  saved  {out.name}  ({len(bundle['pages'])} page(s))\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf4f5cde",
+   "metadata": {},
+   "source": [
+    "## 7. Querying the document with the Reasoning Engine\n",
+    "\n",
+    "With every page distilled into a picture-transcribed,\n",
+    "spatially-ordered context, we can ask real questions.  The *same*\n",
+    "StepFun VLM we used as Visual Specialist now puts on its\n",
+    "**Reasoning Engine** hat and answers — citing the page and quoting\n",
+    "the supporting phrase for every answer.\n",
+    "\n",
+    "We pose one question per modality; each answer is visible on the\n",
+    "annotated page tile in §3, so you can verify it by eye:\n",
+    "\n",
+    "1. **Chart (Pew Research, p. 5).**  Which policy area draws the\n",
+    "   largest share of *very* confident respondents?  A stacked-bar\n",
+    "   chart question that only works because the Specialist turned the\n",
+    "   chart into a Markdown table first.\n",
+    "2. **Multi-picture (Social-Media report, p. 11).**  *\"How many\n",
+    "   likes does the Disneyland post have?\"*  Three Facebook-post\n",
+    "   screenshots sit side by side on the page; with Parse turned off,\n",
+    "   a single page-level VLM call would have to pick the right card\n",
+    "   while reading it.  Parse's three separate `Picture` boxes make\n",
+    "   the answer surgical.\n",
+    "3. **Infographic (Social-Media report, p. 20).**  *\"What percentage\n",
+    "   of LinkedIn users have household income above $75K?\"*  The digit\n",
+    "   lives inside one panel of a pixel-only demographic infographic.\n",
+    "4. **Structured table (Graduate Studies brochure, p. 11).**  *\"Which\n",
+    "   leadership programme has the longest Full-Time duration?\"*  Pure\n",
+    "   Architect work — Parse preserved the programme/duration table as\n",
+    "   LaTeX-tabular text so the Reasoning Engine can scan it without a\n",
+    "   vision call.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c985d9da",
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-04-28T02:39:42.794337Z",
+     "iopub.status.busy": "2026-04-28T02:39:42.794261Z",
+     "iopub.status.idle": "2026-04-28T02:39:44.565304Z",
+     "shell.execute_reply": "2026-04-28T02:39:44.564315Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Per-document context: each PDF we processed gets its own assembled\n",
+    "# context string, stitched together from the pages we ran through\n",
+    "# the pipeline, with strong page delimiters so citations land on the\n",
+    "# right page number.\n",
+    "doc_contexts: dict[str, str] = {}\n",
+    "for sid, pdf, page_n, _mod in DEMO_DOCS:\n",
+    "    entry = next(p for p in file_results[pdf.name][\"pages\"]\n",
+    "                 if p[\"page_number\"] == page_n)\n",
+    "    descs = [it[\"description\"] for it in entry[\"content\"]\n",
+    "             if it.get(\"type\") == \"Picture\" and it.get(\"description\")]\n",
+    "    chunk = assemble_page_context(page_blocks[sid], descs, page_n=page_n)\n",
+    "    doc_contexts.setdefault(pdf.name, []).append(chunk)\n",
+    "\n",
+    "for pdf_name, chunks in doc_contexts.items():\n",
+    "    merged = \"\\n\\n\".join(chunks)\n",
+    "    doc_contexts[pdf_name] = merged\n",
+    "    print(f\"context: {pdf_name}  -> {len(merged):,} chars, \"\n",
+    "          f\"{len(chunks)} page(s)\")\n",
+    "print()\n",
+    "\n",
+    "# One question per demo page, routed to the doc whose context can\n",
+    "# answer it.  (short_id, question)\n",
+    "qa_plan: list[tuple[str, str]] = [\n",
+    "    (\"pew\",\n",
+    "     \"Which policy area has the largest share of respondents who are \"\n",
+    "     \"'very' confident in Donald Trump? Give the policy phrase and the \"\n",
+    "     \"percentage.\"),\n",
+    "    (\"social\",\n",
+    "     \"How many people like the Disneyland post?\"),\n",
+    "    (\"linkedin\",\n",
+    "     \"What percentage of LinkedIn users have a household income above \"\n",
+    "     \"$75K?\"),\n",
+    "    (\"gpl\",\n",
+    "     \"Among the leadership programmes listed, which programme has the \"\n",
+    "     \"longest Full-Time duration, and what is that duration?\"),\n",
+    "]\n",
+    "\n",
+    "# Each question fires a single Reasoning Engine call scoped to the\n",
+    "# document it belongs to, so citations point at the right page.\n",
+    "sid_to_pdf = {sid: pdf.name for sid, pdf, _, _ in DEMO_DOCS}\n",
+    "for sid, q in qa_plan:\n",
+    "    pdf_name = sid_to_pdf[sid]\n",
+    "    context = doc_contexts[pdf_name]\n",
+    "    print(\"=\" * 74)\n",
+    "    print(f\"DOC:      {DISPLAY_NAME[sid]}  ({pdf_name})\")\n",
+    "    print(f\"QUESTION: {q}\")\n",
+    "    t0 = time.time()\n",
+    "    answer = ask_question(q, context)\n",
+    "    print(f\"[{time.time()-t0:.1f}s]\")\n",
+    "    display(Markdown(f\"**ANSWER:**\\n\\n{answer}\"))\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "817c1d36",
+   "metadata": {},
+   "source": [
+    "## 8. Next steps\n",
+    "\n",
+    "This pipeline fits on one page and covers every modality a real\n",
+    "PDF throws at you (text, tables, charts, infographics, UI\n",
+    "screenshots) — on a single StepFun VLM surface — because\n",
+    "the assembled context can be queried in one hosted VLM call for\n",
+    "the small demo set, then extended with retrieval for larger document collections.\n",
+    "\n",
+    "A few directions to take it from here:\n",
+    "\n",
+    "- **Try it on your documents.**  Extend `DEMO_DOCS` in §3 with any\n",
+    "  `(short_id, pdf, page, modality)` tuples you like and re-run.\n",
+    "  Pages that are rasterised images (scanned reports, marketing\n",
+    "  decks, social-media screenshots, dashboards) are the pipeline's\n",
+    "  sweet spot — they are also exactly the pages most text-first\n",
+    "  parsers drop on the floor.\n",
+    "- **Batch the picture stage when quota is tight.**  For\n",
+    "  picture-dense documents (dashboards, slide decks, screenshot\n",
+    "  catalogues), you can send every picture on a page in a single\n",
+    "  multi-image StepFun VLM call and ask for one JSON entry per\n",
+    "  picture.  This trades a little peak quality for roughly 4×\n",
+    "  fewer picture-stage API calls on picture-dense pages; use it\n",
+    "  when API-quota economics matter more than peak accuracy.\n",
+    "- **Scale past ~25 pages with long-context retrieval.**  The\n",
+    "  per-document JSON written in §6 is a ready-made input for a\n",
+    "  vector store: one embedding per page, retrieve the top-k at\n",
+    "  question time, and feed those pages' `assemble_page_context`\n",
+    "  strings straight into the Reasoning Engine.  Parse's reading\n",
+    "  order + bbox anchors survive retrieval, so the model can still\n",
+    "  cite the right page.\n",
+    "- **Customise the picture-stage prompts.**  The `ANALYSIS_PROMPTS`\n",
+    "  dispatch table in §4.1 is the single place where you teach the\n",
+    "  pipeline what a *good* transcription looks like for your domain\n",
+    "  — product photos, engineering drawings, medical charts, CAD\n",
+    "  screenshots.  Adding a new entry is three lines of code.\n",
+    "\n",
+    "Two hosted model surfaces.  Three roles.  StepFun handles both\n",
+    "\"what does this picture mean\" and \"what is the user actually\n",
+    "asking\" — that is the all-modality foundation this cookbook demonstrates.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}