diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.env.example b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.env.example new file mode 100644 index 000000000..b999eb2fb --- /dev/null +++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.env.example @@ -0,0 +1,9 @@ +NVIDIA_API_KEY=nvapi-your-key-here +NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions +STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash + +# Optional overrides if Parse and StepFun require separate credentials. +# PARSE_API_KEY=nvapi-your-parse-key-here +# STEPFUN_API_KEY=nvapi-your-stepfun-key-here +# PARSE_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions +# STEPFUN_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.gitignore b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.gitignore new file mode 100644 index 000000000..a1373c876 --- /dev/null +++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/.gitignore @@ -0,0 +1,3 @@ +output_results/ +.ipynb_checkpoints/ + diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/README.md b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/README.md new file mode 100644 index 000000000..4264a20b7 --- /dev/null +++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/README.md @@ -0,0 +1,93 @@ +# Document Intelligence with Nemotron Parse and StepFun + +Build a document intelligence workflow that combines **Nemotron Parse** +for page layout extraction with **StepFun Step-3.7 Flash** for cropped +image transcription and final document question answering. + +The notebook runs against hosted NVIDIA endpoints. No local GPU, Docker +container, or model weights are required. + +## What It Does + +The workflow processes four pages from three public PDFs: + +1. Nemotron Parse extracts typed layout blocks and picture bounding boxes. +2. StepFun classifies and transcribes each cropped picture. +3. The notebook stitches text and picture transcriptions back into a + reading-order Markdown context. +4. StepFun answers document-level questions with cited page evidence. + +## Models And Endpoints + +| Role | Model | Endpoint | +| --- | --- | --- | +| Layout extraction | `nvidia/nemotron-parse` | NVIDIA API Catalog chat completions | +| Picture transcription | `stepfun-ai/step-3.7-flash` | NVIDIA API Catalog chat completions | +| Document QA | `stepfun-ai/step-3.7-flash` | NVIDIA API Catalog chat completions | + +The notebook defaults both models to NVIDIA's standard +`https://integrate.api.nvidia.com/v1/chat/completions` endpoint. If +needed, Parse and StepFun can still be pointed at separate endpoints with +the optional `PARSE_CHAT_COMPLETIONS_URL` and +`STEPFUN_CHAT_COMPLETIONS_URL` variables. + +## Setup + +Install dependencies with `uv`: + +```bash +curl -LsSf https://astral.sh/uv/install.sh | sh +uv sync +``` + +Create your local `.env`: + +```bash +cp .env.example .env +``` + +Edit `.env` and add your key: + +```bash +NVIDIA_API_KEY=nvapi-your-key-here +NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions +STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash +``` + +If Parse and StepFun require different credentials for your account, set +these optional values: + +```bash +PARSE_API_KEY=nvapi-your-parse-key-here +STEPFUN_API_KEY=nvapi-your-stepfun-key-here +PARSE_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions +STEPFUN_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions +``` + +## Run + +```bash +uv run jupyter lab stepfun_doc_intelligence_with_parse.ipynb +``` + +Run the notebook cells from top to bottom. The `data/documents/` folder +already contains the demo PDFs, so the notebook can start immediately. + +## Project Structure + +```text +. +├── README.md +├── .env.example +├── pyproject.toml +├── stepfun_doc_intelligence_with_parse.ipynb +└── data/ + └── documents/ + ├── 05-03-18-political-release.pdf + ├── GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf + └── measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf +``` + +Running the notebook writes generated `*.parse_stepfun.json` files under +`output_results/`; those artifacts are local run output and are not +required in source control. diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/05-03-18-political-release.pdf b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/05-03-18-political-release.pdf new file mode 100644 index 000000000..68ad907b1 Binary files /dev/null and b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/05-03-18-political-release.pdf differ diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf new file mode 100644 index 000000000..3cf2eedc3 Binary files /dev/null and b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf differ diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf new file mode 100644 index 000000000..3922329d6 Binary files /dev/null and b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/data/documents/measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf differ diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/pyproject.toml b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/pyproject.toml new file mode 100644 index 000000000..40234976c --- /dev/null +++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/pyproject.toml @@ -0,0 +1,19 @@ +[project] +name = "nemotron-parse-stepfun-document-intelligence" +version = "0.1.0" +description = "Document intelligence workflow using Nemotron Parse and StepFun through NVIDIA hosted endpoints" +readme = "README.md" +requires-python = ">=3.10" +dependencies = [ + "ipykernel>=6.0.0", + "jupyter>=1.0.0", + "pandas>=2.0.0", + "pillow>=10.0.0", + "pymupdf>=1.24.0", + "python-dotenv>=1.0.0", + "requests>=2.31.0", +] + +[tool.uv] +package = false + diff --git a/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/stepfun_doc_intelligence_with_parse.ipynb b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/stepfun_doc_intelligence_with_parse.ipynb new file mode 100644 index 000000000..296c7583b --- /dev/null +++ b/oss_tutorials/Nemotron_Parse_StepFun_Document_Intelligence/stepfun_doc_intelligence_with_parse.ipynb @@ -0,0 +1,1407 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0cb3e5c3", + "metadata": {}, + "source": [ + "\n", + "\n", + "# Document Intelligence with Nemotron Parse + StepFun Invocation Endpoint\n", + "\n", + "> **You do not need a GPU to run this notebook.** Every model call goes\n", + "> to NVIDIA's hosted chat-completions endpoint at\n", + "> `https://integrate.api.nvidia.com/v1/chat/completions`.\n", + "> All you need is a `NVIDIA_API_KEY` with access to the two model IDs\n", + "> configured in Setup below.\n", + "\n", + "This notebook builds a streamlined, **all-modality** document\n", + "analysis pipeline by pairing **Nemotron Parse**\n", + "(`nvidia/nemotron-parse`) with **StepFun Flash**\n", + "(`stepfun-ai/step-3.7-flash`). Parse provides the spatial\n", + "anchoring that turns each PDF page into typed blocks and picture\n", + "bounding boxes. StepFun is the vision-language model that both\n", + "**reads each cropped picture** and **answers questions about the\n", + "whole assembled document**.\n", + "\n", + "To make the case concretely, we run the pipeline against four pages\n", + "picked from **three different public PDFs** — a Pew Research report,\n", + "a social-media analytics deck, and a graduate-studies brochure —\n", + "each chosen to stress a **different content modality** that a single\n", + "page-level VLM call cannot handle alone:\n", + "\n", + "| Modality | Source | Why it needs Parse |\n", + "| --- | --- | --- |\n", + "| **Chart** | Pew Research, p. 5 | a stacked-bar chart with eight policy rows — Parse isolates the bar-graph region so StepFun's transcription is a clean markdown table, not a screenshot caption |\n", + "| **Multi-picture page** | social-media report, p. 11 | three Facebook-post screenshots side by side — without Parse's bbox split, the QA call cannot tell which post is *the Disneyland post* |\n", + "| **Infographic** | social-media report, p. 20 | a dense pixel-only demographic panel — Parse draws the panel boundary, StepFun lists every number inside |\n", + "| **Structured table** | Graduate Studies brochure, p. 11 | a two-column programme table — Parse surfaces it as LaTeX-tabular text, so no vision call is needed to read it |\n", + "\n", + "By the end of the tutorial you will see how the pair turns these\n", + "unstructured PDFs into **page-cited, phrase-quoted answers** that\n", + "any reader can verify against the page image.\n" + ] + }, + { + "cell_type": "markdown", + "id": "5e89f7b3", + "metadata": {}, + "source": [ + "## 1. Introduction: three roles, two model surfaces\n", + "\n", + "Our pipeline uses each hosted model for its specialty around **one\n", + "unified spatial context**:\n", + "\n", + "* **`nvidia/nemotron-parse` — the Architect.** A deterministic layout\n", + " parser that returns every block's **type** (`Title`, `Text`,\n", + " `Table`, `List-item`, `Picture`, ...), **bounding box**, and\n", + " **reading order** in one call.\n", + "\n", + "* **`stepfun-ai/step-3.7-flash` — the Visual Specialist.** Every\n", + " `Picture` Parse identifies becomes one StepFun call that first\n", + " *classifies* the image (`Infographic`, `Bar Graph`, `Line Graph`,\n", + " `Smartphone Screenshot`, ...) and then *transcribes* it with a\n", + " prompt tailored to that sub-type.\n", + "\n", + "* **`stepfun-ai/step-3.7-flash` — the Reasoning Engine.** The\n", + " same VLM reads the assembled document context: text blocks in\n", + " reading order with picture transcriptions inlined at their spatial\n", + " positions. It answers the question, cites the page it came from,\n", + " and quotes the supporting phrase verbatim.\n", + "\n", + "Pipeline shape:\n", + "\n", + "```text\n", + "PDF page -> Nemotron Parse -> typed text/table blocks\n", + " -> picture boxes -> StepFun Visual Specialist\n", + "text + picture transcriptions -> reading-order page context\n", + "page contexts + question -> StepFun Reasoning Engine -> cited answer\n", + "```\n", + "\n", + "### Why pair Parse with StepFun instead of calling the VLM on the whole page?\n", + "\n", + "StepFun is a capable multimodal model, but real documents create four\n", + "structural problems that are easier to solve with Parse in front:\n", + "\n", + "1. **Tables and headers need structure, not pixels.** On the\n", + " Graduate-Studies brochure page, Parse emits a `Table` block with a\n", + " LaTeX-tabular body, so the Reasoning Engine can answer directly\n", + " from text.\n", + "2. **A chart region needs isolation before transcription.** On the\n", + " Pew Research page, Parse draws one `Picture` bbox around the chart;\n", + " StepFun receives only the crop and returns a clean structured\n", + " transcription.\n", + "3. **Multi-picture pages bleed together.** Page 11 of the social-media\n", + " report has three screenshots side by side. Parse cuts them into\n", + " separate boxes so StepFun reads one card at a time.\n", + "4. **Citations need anchors.** Parse gives every content item a `bbox`\n", + " and reading-order index, so answers can cite `(p. 20)` and quote the\n", + " phrase that supports the answer.\n", + "\n", + "Two design levers do the heavy lifting:\n", + "\n", + "1. **Divide and conquer on every `Picture`.** One class in, many kinds\n", + " of picture transcriptions out.\n", + "2. **Spatial-context weave.** Parse's reading-order bboxes let us\n", + " interleave each picture's transcription at the exact position it\n", + " occupies on the page.\n" + ] + }, + { + "cell_type": "markdown", + "id": "0080fb2b", + "metadata": {}, + "source": [ + "## 2. Setup and prerequisites\n", + "\n", + "Five Python packages are all we need: `pymupdf` for PDF rendering,\n", + "`pillow` for image handling, `requests` for the API calls, `pandas`\n", + "for tabular display, and `python-dotenv` to load the NVIDIA key from\n", + "a `.env` file.\n", + "\n", + "We install with [**uv**](https://docs.astral.sh/uv/) -- the fast\n", + "package manager from Astral -- and fall back to `pip` automatically\n", + "if `uv` is not on your `PATH`. Recommended workflow before launching\n", + "Jupyter:\n", + "\n", + "```bash\n", + "# one-time install of uv (https://docs.astral.sh/uv/getting-started/installation/)\n", + "curl -LsSf https://astral.sh/uv/install.sh | sh\n", + "\n", + "# create + activate an isolated environment for this notebook\n", + "uv venv .venv && source .venv/bin/activate\n", + "uv pip install jupyter\n", + "jupyter lab stepfun_doc_intelligence_with_parse.ipynb\n", + "```\n", + "\n", + "The next cell installs the runtime deps into whichever environment\n", + "the notebook kernel is already pointing at, so it works whether you\n", + "ran the steps above or are using a colleague-provided kernel.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74b70189", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:13.485717Z", + "iopub.status.busy": "2026-04-28T02:39:13.485583Z", + "iopub.status.idle": "2026-04-28T02:39:13.583801Z", + "shell.execute_reply": "2026-04-28T02:39:13.583228Z" + } + }, + "outputs": [], + "source": [ + "import shutil, subprocess, sys\n", + "555\n", + "PKGS = [\"requests\", \"pillow\", \"pymupdf\", \"pandas\", \"python-dotenv\"]\n", + "\n", + "if shutil.which(\"uv\"):\n", + " print(\"[setup] installing via uv ->\", sys.executable)\n", + " subprocess.check_call([\n", + " \"uv\", \"pip\", \"install\", \"--quiet\",\n", + " \"--python\", sys.executable,\n", + " *PKGS,\n", + " ])\n", + "else:\n", + " print(\"[setup] uv not on PATH; falling back to pip. \"\n", + " \"Install uv from https://docs.astral.sh/uv/ for ~10x faster syncs.\")\n", + " subprocess.check_call([\n", + " sys.executable, \"-m\", \"pip\", \"install\", \"--quiet\", *PKGS,\n", + " ])\n", + "\n", + "print(\"[setup] OK -- runtime deps ready.\")" + ] + }, + { + "cell_type": "markdown", + "id": "4726c5b8", + "metadata": {}, + "source": [ + "### Configure endpoints and keys\n", + "\n", + "This notebook supports a compact `.env` shape. `NVAI_CHAT_COMPLETIONS_URL` is the default NVIDIA API Catalog chat-completions endpoint for both models; set `PARSE_CHAT_COMPLETIONS_URL` or `STEPFUN_CHAT_COMPLETIONS_URL` only if you need separate endpoints.\n", + "\n", + "Make the key and endpoint visible to this notebook in either of two\n", + "ways:\n", + "\n", + "- **`.env` file** in this notebook's directory:\n", + " ```bash\n", + " NVIDIA_API_KEY=nvapi-...\n", + " NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions\n", + " STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash\n", + " ```\n", + "- **Shell export** before launching Jupyter:\n", + " ```bash\n", + " export NVIDIA_API_KEY=nvapi-...\n", + " export NVAI_CHAT_COMPLETIONS_URL=https://integrate.api.nvidia.com/v1/chat/completions\n", + " export STEPFUN_VLM_MODEL=stepfun-ai/step-3.7-flash\n", + " ```\n", + "\n", + "The notebook does not store credentials in source control; `.env` and\n", + "`.env.local` are ignored by this directory's `.gitignore`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1157ce65", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:13.585295Z", + "iopub.status.busy": "2026-04-28T02:39:13.585176Z", + "iopub.status.idle": "2026-04-28T02:39:14.339790Z", + "shell.execute_reply": "2026-04-28T02:39:14.339256Z" + } + }, + "outputs": [], + "source": [ + "from __future__ import annotations\n", + "\n", + "import base64\n", + "import io\n", + "import json\n", + "import os\n", + "import re\n", + "import textwrap\n", + "import time\n", + "from pathlib import Path\n", + "from typing import Any\n", + "\n", + "import fitz # PyMuPDF\n", + "import pandas as pd\n", + "import requests\n", + "from dotenv import load_dotenv\n", + "from IPython.display import Markdown, display\n", + "from PIL import Image, ImageDraw, ImageFont\n", + "\n", + "NOTEBOOK_CWD = Path.cwd()\n", + "_DEMO_RELATIVE_PATH = Path(\"usage-cookbook\") / \"Nemotron-3-Nano-Omni\" / \"doc-intelligence-with-parse\"\n", + "\n", + "\n", + "def _resolve_demo_root() -> Path:\n", + " \"\"\"Find this demo directory even when Jupyter starts from the repo root.\"\"\"\n", + " candidates = [\n", + " NOTEBOOK_CWD,\n", + " NOTEBOOK_CWD.parent if NOTEBOOK_CWD.name == \"notebooks\" else NOTEBOOK_CWD,\n", + " NOTEBOOK_CWD / _DEMO_RELATIVE_PATH,\n", + " NOTEBOOK_CWD / \"Nemotron\" / _DEMO_RELATIVE_PATH,\n", + " ]\n", + " for parent in NOTEBOOK_CWD.parents:\n", + " candidates.extend([\n", + " parent,\n", + " parent / _DEMO_RELATIVE_PATH,\n", + " parent / \"Nemotron\" / _DEMO_RELATIVE_PATH,\n", + " ])\n", + "\n", + " seen: set[Path] = set()\n", + " for candidate in candidates:\n", + " candidate = candidate.resolve()\n", + " if candidate in seen:\n", + " continue\n", + " seen.add(candidate)\n", + " if any((candidate / name).exists() for name in [\"stepfun_doc_intelligence_with_parse.ipynb\"]):\n", + " return candidate\n", + " return NOTEBOOK_CWD\n", + "\n", + "\n", + "REPO_ROOT = _resolve_demo_root()\n", + "\n", + "_loaded_env_files: list[Path] = []\n", + "for _env_file in [REPO_ROOT.parent / \".env\", REPO_ROOT / \".env\"]:\n", + " if _env_file.exists():\n", + " # Load broader defaults first; let the demo-local .env win.\n", + " load_dotenv(_env_file, override=(_env_file.parent == REPO_ROOT))\n", + " _loaded_env_files.append(_env_file)\n", + "\n", + "# Backward-compatible .env support:\n", + "# NVIDIA_API_KEY + NVAI_CHAT_COMPLETIONS_URL are the default credential\n", + "# and chat-completions URL for both models. Override PARSE_* or STEPFUN_*\n", + "# only if the two calls need separate credentials or endpoints.\n", + "COMMON_API_KEY = os.environ.get(\"NVIDIA_API_KEY\", \"YOUR_API_KEY_HERE\")\n", + "PARSE_API_KEY = os.environ.get(\"PARSE_API_KEY\") or COMMON_API_KEY\n", + "STEPFUN_API_KEY = os.environ.get(\"STEPFUN_API_KEY\") or COMMON_API_KEY\n", + "\n", + "PARSE_CHAT_COMPLETIONS_URL = os.environ.get(\n", + " \"PARSE_CHAT_COMPLETIONS_URL\",\n", + " \"https://integrate.api.nvidia.com/v1/chat/completions\",\n", + ")\n", + "STEPFUN_CHAT_COMPLETIONS_URL = os.environ.get(\n", + " \"STEPFUN_CHAT_COMPLETIONS_URL\",\n", + " os.environ.get(\n", + " \"NVAI_CHAT_COMPLETIONS_URL\",\n", + " \"https://integrate.api.nvidia.com/v1/chat/completions\",\n", + " ),\n", + ")\n", + "\n", + "PARSE_MODEL = os.environ.get(\"PARSE_MODEL\", \"nvidia/nemotron-parse\")\n", + "STEPFUN_VLM_MODEL = os.environ.get(\"STEPFUN_VLM_MODEL\", \"stepfun-ai/step-3.7-flash\")\n", + "NVAI_REQUEST_TIMEOUT = int(os.environ.get(\"NVAI_REQUEST_TIMEOUT\", \"480\"))\n", + "NVAI_MAX_RETRIES = int(os.environ.get(\"NVAI_MAX_RETRIES\", \"2\"))\n", + "\n", + "if not PARSE_API_KEY or PARSE_API_KEY == \"YOUR_API_KEY_HERE\":\n", + " raise RuntimeError(\n", + " \"Parse API key is not set. Add NVIDIA_API_KEY=nvapi-... or \"\n", + " \"PARSE_API_KEY=nvapi-... to a .env file in this notebook's directory.\"\n", + " )\n", + "if not STEPFUN_API_KEY or STEPFUN_API_KEY == \"YOUR_API_KEY_HERE\":\n", + " raise RuntimeError(\n", + " \"StepFun API key is not set. Add NVIDIA_API_KEY=nvapi-... or \"\n", + " \"STEPFUN_API_KEY=nvapi-... to a .env file in this notebook's directory.\"\n", + " )\n", + "\n", + "print(f\"Demo root: {REPO_ROOT}\")\n", + "print(\"Env files: \" + (\", \".join(str(p) for p in _loaded_env_files) or \"none found\"))\n", + "print(f\"Parse endpoint: {PARSE_CHAT_COMPLETIONS_URL}\")\n", + "print(f\"StepFun endpoint:{STEPFUN_CHAT_COMPLETIONS_URL}\")\n", + "print(f\"Architect: {PARSE_MODEL}\")\n", + "print(f\"Specialist + QA: {STEPFUN_VLM_MODEL}\")\n", + "print(f\"Request timeout: {NVAI_REQUEST_TIMEOUT}s, retries: {NVAI_MAX_RETRIES}\")\n", + "\n", + "CLASS_COLORS = {\n", + " \"Title\": \"#D32F2F\", \"Section-header\": \"#E91E63\", \"Text\": \"#4CAF50\",\n", + " \"List-item\": \"#1976D2\", \"Caption\": \"#607D8B\", \"Table\": \"#03A9F4\",\n", + " \"Picture\": \"#6D4C41\", \"Figure\": \"#6D4C41\", \"Formula\": \"#FF9800\",\n", + " \"Page-header\": \"#9E9E9E\", \"Page-footer\": \"#9E9E9E\", \"Footnote\": \"#00BCD4\",\n", + " \"Bibliography\": \"#512DA8\", \"TOC\": \"#FFC107\", \"DEFAULT\": \"#9E9E9E\",\n", + "}\n" + ] + }, + { + "cell_type": "markdown", + "id": "1756d4d5", + "metadata": {}, + "source": [ + "## 3. The example document set\n", + "\n", + "We pick **four pages from three different public PDFs** so the\n", + "pipeline stresses a different content modality on each page, and so\n", + "the reader can visually verify every answer against the original\n", + "page image. The exact pages are:\n", + "\n", + "| `short_id` | Modality | PDF | Page |\n", + "| --- | --- | --- | --- |\n", + "| `pew` | Chart | `05-03-18-political-release.pdf` (Pew Research) | 5 |\n", + "| `social` | Multi-picture | `measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf` (Social-Media Analytics Report) | 11 |\n", + "| `linkedin` | Infographic | same Social-Media report as above | 20 |\n", + "| `gpl` | Structured table | `GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf` (Graduate Studies brochure) | 11 |\n", + "\n", + "Every page of these PDFs is a **rasterised image** — selecting text\n", + "with your mouse in a PDF viewer returns nothing, so a text-first\n", + "parser gives you nothing to work with. This is precisely the class\n", + "of document where a VLM-driven pipeline earns its keep.\n", + "\n", + "Parse's layout overlay for each of these pages is generated inline\n", + "by the pipeline in §4.2 — so the annotations you see in this\n", + "notebook are produced by the same `nvidia/nemotron-parse` call the\n", + "rest of the pipeline consumes, never a pre-baked asset. Point\n", + "`DEMO_DOCS` at any `(pdf, page)` pairs on your disk to try your own\n", + "— the rest of the notebook does not change." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "69ee7206", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:14.341617Z", + "iopub.status.busy": "2026-04-28T02:39:14.341453Z", + "iopub.status.idle": "2026-04-28T02:39:14.345710Z", + "shell.execute_reply": "2026-04-28T02:39:14.345386Z" + } + }, + "outputs": [], + "source": [ + "DOC_DIR = REPO_ROOT / \"data\" / \"documents\"\n", + "DOC_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "# (short_id, pdf_path, page_number, modality_label)\n", + "# short_id is a stable key used downstream to group outputs per page.\n", + "DEMO_DOCS: list[tuple[str, Path, int, str]] = [\n", + " (\"pew\", DOC_DIR / \"05-03-18-political-release.pdf\", 5, \"chart\"),\n", + " (\"social\", DOC_DIR / \"measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf\", 11, \"multi-picture\"),\n", + " (\"linkedin\", DOC_DIR / \"measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf\", 20, \"infographic\"),\n", + " (\"gpl\", DOC_DIR / \"GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf\", 11, \"table\"),\n", + "]\n", + "\n", + "# Short, human-friendly name to use in section headings (the PDF filename\n", + "# itself is too long to read in a heading).\n", + "DISPLAY_NAME = {\n", + " \"pew\": \"Pew Research -- Political Release\",\n", + " \"social\": \"Social-Media Analytics Report\",\n", + " \"linkedin\": \"Social-Media Analytics Report\",\n", + " \"gpl\": \"Graduate Studies Brochure\",\n", + "}\n", + "\n", + "# Public source URLs for each PDF. The notebook is self-contained:\n", + "# if a PDF is missing, we download it once into `DOC_DIR` on first\n", + "# run. The three demo PDFs are mirrored on the MMLongBench-Doc\n", + "# Hugging Face dataset (`yubo2333/MMLongBench-Doc`). Point\n", + "# `DOC_DIR` at any directory you control (or pre-populate it\n", + "# yourself) and the pipeline consumes the local copy after that.\n", + "_HF_DOC_ROOT = (\n", + " \"https://huggingface.co/datasets/yubo2333/MMLongBench-Doc/\"\n", + " \"resolve/main/documents\"\n", + ")\n", + "_PDF_SOURCES: dict[str, str] = {\n", + " name: f\"{_HF_DOC_ROOT}/{name}\" for name in {\n", + " \"05-03-18-political-release.pdf\",\n", + " \"measuringsuccessonfacebooktwitterlinkedin-160317142140_95.pdf\",\n", + " \"GPL-Graduate-Studies-Professional-Learning-Brochure-Jul-2021.pdf\",\n", + " }\n", + "}\n", + "\n", + "\n", + "def _ensure_pdf(pdf: Path) -> None:\n", + " if pdf.exists():\n", + " return\n", + " url = _PDF_SOURCES.get(pdf.name)\n", + " if url is None:\n", + " raise FileNotFoundError(\n", + " f\"PDF not found and no default URL registered: {pdf}. \"\n", + " \"Either drop the file into DOC_DIR yourself or add a \"\n", + " \"URL entry to _PDF_SOURCES.\")\n", + " print(f\" [download] {pdf.name} <- {url}\")\n", + " try:\n", + " r = requests.get(url, timeout=60)\n", + " r.raise_for_status()\n", + " except Exception as exc:\n", + " raise FileNotFoundError(\n", + " f\"Could not auto-download {pdf.name} from {url}: {exc}. \"\n", + " \"Drop the PDF into DOC_DIR manually and re-run this cell.\"\n", + " ) from exc\n", + " pdf.write_bytes(r.content)\n", + "\n", + "\n", + "for sid, pdf, pn, label in DEMO_DOCS:\n", + " _ensure_pdf(pdf)\n", + " print(f\" [{sid:8s}] p.{pn:<3d} {label:<14s} -> {pdf.name} \"\n", + " f\"({pdf.stat().st_size / 1024:,.0f} KB)\")" + ] + }, + { + "cell_type": "markdown", + "id": "9b0ce18e", + "metadata": {}, + "source": [ + "## 4. The core pipeline in action\n", + "\n", + "Now, let's walk through the code that powers the pipeline." + ] + }, + { + "cell_type": "markdown", + "id": "1ccb5c03", + "metadata": {}, + "source": [ + "### 4.1. Helper functions\n", + "\n", + "One self-contained cell with every building block the pipeline needs,\n", + "organised into three groups:\n", + "\n", + "1. **Imaging** — `pdf_page_to_image` renders a page to pixels,\n", + " `pil_to_data_url` encodes it for the API, `draw_annotations` paints\n", + " bounding-box overlays (with per-class colours and sub-typed\n", + " picture labels) on top of any page.\n", + "2. **Model surfaces** — `call_nemotron_parse` for the Architect,\n", + " `call_stepfun_vlm` as a single entry point that serves both the\n", + " Visual Specialist and the Reasoning Engine over the NVIDIA hosted\n", + " chat-completions endpoint, plus small helpers to\n", + " pull clean text or JSON out of the response.\n", + "3. **Pipeline stages** — `describe_picture` implements the\n", + " divide-and-conquer lever (classify, then dispatch to a\n", + " content-aware prompt); `assemble_page_context` implements the\n", + " spatial weave (interleaves picture transcriptions into the page's\n", + " prose at their reading-order position); `ask_question` is the\n", + " final Reasoning Engine call, with three short answer rules that\n", + " make every answer page-cited and phrase-quoted.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1fdbd3ab", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:14.347301Z", + "iopub.status.busy": "2026-04-28T02:39:14.347222Z", + "iopub.status.idle": "2026-04-28T02:39:14.389829Z", + "shell.execute_reply": "2026-04-28T02:39:14.389149Z" + } + }, + "outputs": [], + "source": [ + "# =============================================================================\n", + "# Group 1 — Imaging\n", + "# =============================================================================\n", + "\n", + "def pdf_page_to_image(pdf_path: str | Path, page_index: int, *, dpi: int = 150) -> Image.Image:\n", + " \"\"\"Render a 0-indexed PDF page to an RGB PIL image at `dpi`.\"\"\"\n", + " doc = fitz.open(pdf_path)\n", + " try:\n", + " page = doc.load_page(page_index)\n", + " zoom = dpi / 72.0\n", + " pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))\n", + " return Image.frombytes(\"RGB\", [pix.width, pix.height], pix.samples)\n", + " finally:\n", + " doc.close()\n", + "\n", + "\n", + "def pil_to_data_url(img: Image.Image, *, fmt: str = \"JPEG\", quality: int = 85) -> str:\n", + " \"\"\"JPEG- or PNG-encode a PIL image and wrap it as a `data:` URL.\"\"\"\n", + " if img.mode != \"RGB\":\n", + " img = img.convert(\"RGB\")\n", + " buf = io.BytesIO()\n", + " if fmt.upper() == \"JPEG\":\n", + " img.save(buf, format=\"JPEG\", quality=quality)\n", + " mime = \"image/jpeg\"\n", + " else:\n", + " img.save(buf, format=\"PNG\")\n", + " mime = \"image/png\"\n", + " return f\"data:{mime};base64,\" + base64.b64encode(buf.getvalue()).decode()\n", + "\n", + "\n", + "def _luminance(hex_color: str) -> float:\n", + " h = hex_color.lstrip(\"#\")\n", + " r, g, b = (int(h[i:i + 2], 16) for i in (0, 2, 4))\n", + " return 0.2126 * r + 0.7152 * g + 0.0722 * b\n", + "\n", + "\n", + "def draw_annotations(image: Image.Image, blocks: list[dict[str, Any]]) -> Image.Image:\n", + " \"\"\"Paint labelled bounding boxes for every block. Pictures that\n", + " carry a `sub_type` (from the Visual Specialist) are labelled as\n", + " `Picture:` rather than just `Picture`.\n", + " \"\"\"\n", + " out = image.copy()\n", + " draw = ImageDraw.Draw(out)\n", + " W, H = out.size\n", + " box_w = max(2, int(W / 600))\n", + " font_size = max(14, int(W / 80))\n", + " try:\n", + " font = ImageFont.truetype(\"Arial.ttf\", font_size)\n", + " except Exception:\n", + " font = ImageFont.load_default()\n", + " for i, b in enumerate(blocks):\n", + " bb = b.get(\"bbox\") or {}\n", + " x0, y0 = bb.get(\"xmin\", 0.0) * W, bb.get(\"ymin\", 0.0) * H\n", + " x1, y1 = bb.get(\"xmax\", 0.0) * W, bb.get(\"ymax\", 0.0) * H\n", + " if x1 <= x0 or y1 <= y0:\n", + " continue\n", + " cat = b.get(\"type\", \"DEFAULT\")\n", + " sub = b.get(\"sub_type\")\n", + " color = CLASS_COLORS.get(cat, CLASS_COLORS[\"DEFAULT\"])\n", + " label = f\"{i}:{cat}\" + (f\":{sub}\" if sub else \"\")\n", + " draw.rectangle([x0, y0, x1, y1], outline=color, width=box_w)\n", + " try:\n", + " tb = draw.textbbox((0, 0), label, font=font)\n", + " tw, th = tb[2] - tb[0], tb[3] - tb[1]\n", + " except Exception:\n", + " tw, th = len(label) * 7, font_size\n", + " bg = (x0, max(0, y0 - th - 6), x0 + tw + 10, max(0, y0 - 6))\n", + " draw.rectangle(bg, fill=color)\n", + " text_color = \"#000000\" if _luminance(color) > 140 else \"#FFFFFF\"\n", + " draw.text((bg[0] + 5, bg[1] + 2), label, fill=text_color, font=font)\n", + " return out\n", + "\n", + "\n", + "# =============================================================================\n", + "# Group 2 — Model surfaces\n", + "# =============================================================================\n", + "\n", + "def _headers(api_key: str) -> dict[str, str]:\n", + " return {\n", + " \"Authorization\": f\"Bearer {api_key}\",\n", + " \"Content-Type\": \"application/json\",\n", + " \"Accept\": \"application/json\",\n", + " }\n", + "\n", + "\n", + "def _post_chat_completion(\n", + " body: dict[str, Any],\n", + " *,\n", + " endpoint: str,\n", + " api_key: str,\n", + " timeout: int | None = None,\n", + ") -> requests.Response:\n", + " \"\"\"POST to a chat-completions endpoint with small retry protection.\"\"\"\n", + " timeout = timeout or NVAI_REQUEST_TIMEOUT\n", + " for attempt in range(NVAI_MAX_RETRIES + 1):\n", + " try:\n", + " return requests.post(\n", + " endpoint,\n", + " headers=_headers(api_key),\n", + " json=body,\n", + " timeout=timeout,\n", + " )\n", + " except (requests.exceptions.Timeout, requests.exceptions.ConnectionError):\n", + " if attempt >= NVAI_MAX_RETRIES:\n", + " raise\n", + " sleep_s = 2 ** attempt\n", + " print(f\" [retry] hosted endpoint timed out; retrying in {sleep_s}s\")\n", + " time.sleep(sleep_s)\n", + "\n", + "\n", + "def call_nemotron_parse(image: Image.Image) -> list[dict[str, Any]]:\n", + " \"\"\"Run `nemotron-parse` on a page image. Returns a flat list of\n", + " blocks with `type`, `bbox`, and `text`.\n", + " \"\"\"\n", + " body = {\n", + " \"model\": PARSE_MODEL,\n", + " \"messages\": [{\"role\": \"user\", \"content\": [\n", + " {\"type\": \"image_url\", \"image_url\": {\"url\": pil_to_data_url(image, fmt=\"PNG\")}}\n", + " ]}],\n", + " \"tools\": [{\"type\": \"function\", \"function\": {\"name\": \"markdown_bbox\"}}],\n", + " \"tool_choice\": {\"type\": \"function\", \"function\": {\"name\": \"markdown_bbox\"}},\n", + " \"max_tokens\": 8192,\n", + " \"temperature\": 0.0,\n", + " }\n", + " r = _post_chat_completion(\n", + " body,\n", + " endpoint=PARSE_CHAT_COMPLETIONS_URL,\n", + " api_key=PARSE_API_KEY,\n", + " )\n", + " r.raise_for_status()\n", + " args = r.json()[\"choices\"][0][\"message\"][\"tool_calls\"][0][\"function\"][\"arguments\"]\n", + " parsed = json.loads(args)\n", + " blocks = parsed if isinstance(parsed, list) else parsed.get(\"tool_call_arguments\", [])\n", + " if blocks and isinstance(blocks[0], list):\n", + " blocks = blocks[0]\n", + " return blocks or []\n", + "\n", + "\n", + "# StepFun can answer directly, but VLMs sometimes include reasoning-style\n", + "# preambles when asked to transcribe dense visual content. The Specialist\n", + "# wants the final transcription only, so direct calls get a small system\n", + "# guard. `extract_text` also strips any blocks or echoed guard text.\n", + "_SYS_NO_THINK = (\n", + " \"/no_think\\n\"\n", + " \"Answer directly and concisely. Do NOT include any reasoning, \"\n", + " \"preamble, or blocks.\"\n", + ")\n", + "_SYSTEM_ECHO = re.compile(\n", + " r\"^\\s*(?:/no_think\\s*)?Answer directly and concisely[^.]*\\.\\s*\"\n", + " r\"Do NOT include any reasoning[^.]*\\.\\s*\",\n", + " re.IGNORECASE,\n", + ")\n", + "_LEAK_HEADS = (\n", + " \"okay,\", \"okay \", \"the user wants\", \"let me \", \"first,\",\n", + " \"i need to\", \"i'll \", \"i will \", \"alright,\",\n", + ")\n", + "\n", + "\n", + "def call_stepfun_vlm(\n", + " prompt: str,\n", + " images: list[Image.Image] | None = None,\n", + " *,\n", + " direct: bool = True,\n", + " json_mode: bool = False,\n", + " temperature: float = 0.2,\n", + " top_p: float = 0.95,\n", + " max_tokens: int = 2048,\n", + ") -> dict[str, Any]:\n", + " \"\"\"One StepFun VLM call, used by both the Specialist and the\n", + " Reasoning Engine. `direct=True` adds a short system guard that\n", + " asks for final answers only. The request uses standard\n", + " chat-completions fields so it can target\n", + " `stepfun-ai/step-3.7-flash` through NVIDIA's hosted endpoint.\n", + " \"\"\"\n", + " parts: list[dict[str, Any]] = [{\"type\": \"text\", \"text\": prompt}]\n", + " for img in images or []:\n", + " parts.append({\"type\": \"image_url\", \"image_url\": {\"url\": pil_to_data_url(img)}})\n", + " messages: list[dict[str, Any]] = []\n", + " if direct:\n", + " messages.append({\"role\": \"system\", \"content\": _SYS_NO_THINK})\n", + " messages.append({\"role\": \"user\", \"content\": parts})\n", + " body: dict[str, Any] = {\n", + " \"model\": STEPFUN_VLM_MODEL,\n", + " \"messages\": messages,\n", + " \"max_tokens\": max_tokens,\n", + " \"temperature\": temperature,\n", + " \"top_p\": top_p,\n", + " \"stream\": False,\n", + " }\n", + " if json_mode:\n", + " body[\"response_format\"] = {\"type\": \"json_object\"}\n", + " r = _post_chat_completion(\n", + " body,\n", + " endpoint=STEPFUN_CHAT_COMPLETIONS_URL,\n", + " api_key=STEPFUN_API_KEY,\n", + " )\n", + " if r.status_code >= 400 and json_mode and \"response_format\" in body:\n", + " # Some hosted VLMs ignore JSON prompts but reject response_format.\n", + " # Retry once with the prompt-only JSON instruction.\n", + " body.pop(\"response_format\", None)\n", + " r = _post_chat_completion(\n", + " body,\n", + " endpoint=STEPFUN_CHAT_COMPLETIONS_URL,\n", + " api_key=STEPFUN_API_KEY,\n", + " )\n", + " r.raise_for_status()\n", + " return r.json()\n", + "\n", + "\n", + "_THINK_RE = re.compile(r\"]*>.*?\", re.DOTALL | re.IGNORECASE)\n", + "\n", + "\n", + "def extract_text(resp: dict[str, Any]) -> str:\n", + " msg = resp.get(\"choices\", [{}])[0].get(\"message\", {})\n", + " text = (\n", + " msg.get(\"content\")\n", + " or msg.get(\"reasoning\")\n", + " or msg.get(\"reasoning_content\")\n", + " or \"\"\n", + " ).strip()\n", + " text = _THINK_RE.sub(\"\", text).strip()\n", + " text = _SYSTEM_ECHO.sub(\"\", text, count=1).strip()\n", + " return text\n", + "\n", + "\n", + "def _is_leaky(text: str) -> bool:\n", + " \"\"\"Heuristic: the Specialist sometimes leaks chain-of-thought\n", + " into `content` without `` tags. If the response opens\n", + " with a classic reasoning preamble, we retry the call once.\n", + " \"\"\"\n", + " head = text.lstrip().lower()[:80]\n", + " return any(head.startswith(p) for p in _LEAK_HEADS)\n", + "\n", + "\n", + "def extract_json_object(resp: dict[str, Any]) -> dict[str, Any]:\n", + " \"\"\"Extract a JSON object robustly. Looks in both `content` and\n", + " `reasoning` fields (the hosted endpoint occasionally collapses\n", + " JSON-mode output into the reasoning stream), strips code fences,\n", + " and falls back to the outermost `{...}` substring.\n", + " \"\"\"\n", + " msg = resp.get(\"choices\", [{}])[0].get(\"message\", {})\n", + " for raw in (\n", + " msg.get(\"content\") or \"\",\n", + " msg.get(\"reasoning\") or \"\",\n", + " msg.get(\"reasoning_content\") or \"\",\n", + " ):\n", + " s = _THINK_RE.sub(\"\", (raw or \"\").strip()).strip()\n", + " if not s:\n", + " continue\n", + " if s.startswith(\"```\"):\n", + " s = re.sub(r\"^```(?:json)?\\s*\", \"\", s, flags=re.IGNORECASE).strip(\"` \\n\")\n", + " try:\n", + " obj = json.loads(s)\n", + " if isinstance(obj, dict):\n", + " return obj\n", + " except json.JSONDecodeError:\n", + " pass\n", + " l, r = s.find(\"{\"), s.rfind(\"}\")\n", + " if l != -1 and r > l:\n", + " try:\n", + " obj = json.loads(s[l : r + 1])\n", + " if isinstance(obj, dict):\n", + " return obj\n", + " except json.JSONDecodeError:\n", + " pass\n", + " return {}\n", + "\n", + "\n", + "# =============================================================================\n", + "# Group 3 — Pipeline stages\n", + "# =============================================================================\n", + "\n", + "CLASSIFY_PROMPT = (\n", + " \"Analyze the provided image and classify its content. Your response \"\n", + " \"MUST be a single, valid JSON object with the following keys:\\n\"\n", + " '- \"image_type\": one of \"Extractive\" (charts, graphs, diagrams, '\n", + " 'tables, flowcharts, maps) or \"Descriptive\" (photographs, '\n", + " \"illustrations, artistic pieces).\\n\"\n", + " '- \"sub_type\": a specific label, e.g. \"Line Graph\", \"Bar Graph\", '\n", + " '\"Infographic\", \"Flowchart\", \"Pyramid Diagram\", \"Smartphone '\n", + " 'Screenshot\", \"Photograph\".\\n'\n", + " '- \"subject_matter\": one-sentence summary of the picture topic.\\n'\n", + " '- \"contains_text\": boolean, true if the image has readable text.\\n'\n", + " \"Provide ONLY the JSON object and nothing else.\"\n", + ")\n", + "\n", + "# Divide-and-conquer dispatch table — different picture kinds call for\n", + "# different transcription prompts.\n", + "ANALYSIS_PROMPTS: dict[tuple[str, str], str] = {\n", + " (\"Extractive\", \"Default\"): (\n", + " \"Analyze the provided image and extract all structured \"\n", + " \"information. If the information fits a tabular format, \"\n", + " \"render it as a Markdown table. Otherwise, produce a concise \"\n", + " \"summary capturing every number, label, and relationship.\"\n", + " ),\n", + " (\"Extractive\", \"Line Graph\"): (\n", + " \"You are a data analyst. Transcribe the data from this line \"\n", + " \"graph. State the title, X- and Y-axis labels, and for each \"\n", + " \"series extract the data points as [x, y] pairs. Return one \"\n", + " \"JSON object.\"\n", + " ),\n", + " (\"Extractive\", \"Bar Graph\"): (\n", + " \"You are a data analyst. Transcribe this bar chart. State \"\n", + " \"the title, axis labels, category names, and the value for \"\n", + " \"each bar. Return both a Markdown table AND a one-sentence \"\n", + " \"headline finding.\"\n", + " ),\n", + " (\"Extractive\", \"Infographic\"): (\n", + " \"Transcribe this infographic. For each panel or section, \"\n", + " \"list its name and every labelled value or percentage. \"\n", + " \"Preserve the grouping the designer used. Return a \"\n", + " \"structured Markdown outline.\"\n", + " ),\n", + " (\"Extractive\", \"Flowchart\"): (\n", + " \"Transcribe this flowchart or diagram. List every node and \"\n", + " \"every labelled edge. State the overall flow direction.\"\n", + " ),\n", + " (\"Extractive\", \"Pyramid Diagram\"): (\n", + " \"Transcribe this pyramid diagram. List every tier from top \"\n", + " \"to bottom with its label and any supporting text. Infer \"\n", + " \"the ordering or progression the diagram communicates.\"\n", + " ),\n", + " (\"Descriptive\", \"Default\"): (\n", + " \"Describe this image in detail: subject matter, composition, \"\n", + " \"colours, and any text visible in the scene. Do not \"\n", + " \"speculate beyond what is visible.\"\n", + " ),\n", + " (\"Descriptive\", \"Smartphone Screenshot\"): (\n", + " \"Transcribe this smartphone UI. Read every label, button, \"\n", + " \"menu entry, and message visible. Infer the app or screen \"\n", + " \"type (e.g. SMS, contacts, home screen, call UI) and list \"\n", + " \"the UI elements in reading order.\"\n", + " ),\n", + "}\n", + "\n", + "\n", + "def pick_analysis_prompt(classification: dict[str, Any]) -> str:\n", + " it = classification.get(\"image_type\", \"Extractive\")\n", + " st = classification.get(\"sub_type\", \"Default\")\n", + " return (ANALYSIS_PROMPTS.get((it, st))\n", + " or ANALYSIS_PROMPTS.get((it, \"Default\"))\n", + " or ANALYSIS_PROMPTS[(\"Extractive\", \"Default\")])\n", + "\n", + "\n", + "def crop_picture(page: Image.Image, bbox: dict[str, float]) -> Image.Image:\n", + " W, H = page.size\n", + " return page.crop((int(bbox[\"xmin\"] * W), int(bbox[\"ymin\"] * H),\n", + " int(bbox[\"xmax\"] * W), int(bbox[\"ymax\"] * H)))\n", + "\n", + "\n", + "def describe_picture(crop: Image.Image) -> dict[str, Any]:\n", + " \"\"\"Divide-and-conquer: classify first, then transcribe with a\n", + " content-aware prompt. Two API calls per picture. If the first\n", + " transcription leaks reasoning into `content`, we retry once --\n", + " this small resilience step keeps the final context clean so the\n", + " Reasoning Engine doesn't echo chain-of-thought back in §7.\n", + " \"\"\"\n", + " cls_resp = call_stepfun_vlm(\n", + " CLASSIFY_PROMPT, images=[crop],\n", + " direct=True, json_mode=True,\n", + " temperature=0.0, max_tokens=1024,\n", + " )\n", + " classification = extract_json_object(cls_resp) or {\n", + " \"image_type\": \"Extractive\", \"sub_type\": \"Default\",\n", + " }\n", + " prompt = pick_analysis_prompt(classification)\n", + " desc_resp = call_stepfun_vlm(\n", + " prompt, images=[crop],\n", + " direct=True, json_mode=False,\n", + " temperature=0.2, max_tokens=2048,\n", + " )\n", + " description = extract_text(desc_resp)\n", + " if _is_leaky(description):\n", + " desc_resp = call_stepfun_vlm(\n", + " prompt, images=[crop],\n", + " direct=True, json_mode=False,\n", + " temperature=0.2, max_tokens=2048,\n", + " )\n", + " description = extract_text(desc_resp)\n", + " return {\"classification\": classification, \"description\": description}\n", + "\n", + "\n", + "def assemble_page_context(\n", + " parse_blocks: list[dict[str, Any]],\n", + " picture_descriptions: list[str],\n", + " *,\n", + " page_n: int,\n", + ") -> str:\n", + " \"\"\"Spatial weave: walk parse's blocks in reading order, emit each\n", + " text block verbatim and substitute each `Picture` block with its\n", + " StepFun VLM transcription inline at the same spatial position.\n", + "\n", + " Pages are wrapped in strong visual delimiters so the Reasoning\n", + " Engine never confuses which page a fact came from when it is asked\n", + " to cite it.\n", + " \"\"\"\n", + " header = f\"===== PAGE {page_n} =====\"\n", + " footer = f\"===== END PAGE {page_n} =====\"\n", + " lines = [header]\n", + " pic_idx = 0\n", + " for b in parse_blocks:\n", + " t = b.get(\"type\")\n", + " if t == \"Picture\":\n", + " if pic_idx < len(picture_descriptions):\n", + " lines.append(f\"[Picture on page {page_n}] \"\n", + " + picture_descriptions[pic_idx])\n", + " pic_idx += 1\n", + " elif b.get(\"text\"):\n", + " lines.append(b[\"text\"])\n", + " lines.append(footer)\n", + " return \"\\n\\n\".join(lines)\n", + "\n", + "\n", + "QA_PROMPT_TEMPLATE = (\n", + " \"Based on the following document context, please answer the \"\n", + " \"question that follows.\\n\\n\"\n", + " \"- DOCUMENT CONTEXT\\n{context}\\n\\n\"\n", + " \"- QUESTION\\n{question}\\n\\n\"\n", + " \"Answer rules:\\n\"\n", + " \" 1. Cite the page number(s) the answer comes from in the form \"\n", + " \"'(p. )' immediately after the answer.\\n\"\n", + " \" 2. Quote the specific phrase or value from the document \"\n", + " \"context that supports the answer.\\n\"\n", + " \" 3. If the answer is NOT present in the context, output exactly \"\n", + " \"the string `Not answerable` — do not guess.\"\n", + ")\n", + "\n", + "\n", + "def ask_question(question: str, context: str) -> str:\n", + " \"\"\"Run the Reasoning Engine. We keep `direct=True` + the\n", + " `/no_think` system message so the final answer is a clean\n", + " citation, not a transcript of the model's inner monologue. The\n", + " Specialist's transcriptions already carried the heavy visual\n", + " extraction at pipeline time, so the QA call only needs to search\n", + " the assembled context for the cited phrase.\n", + " \"\"\"\n", + " prompt = QA_PROMPT_TEMPLATE.format(context=context, question=question)\n", + " resp = call_stepfun_vlm(\n", + " prompt, images=None,\n", + " direct=True, json_mode=False,\n", + " temperature=0.2, max_tokens=1024,\n", + " )\n", + " answer = extract_text(resp)\n", + " if _is_leaky(answer):\n", + " resp = call_stepfun_vlm(\n", + " prompt, images=None,\n", + " direct=True, json_mode=False,\n", + " temperature=0.2, max_tokens=1024,\n", + " )\n", + " answer = extract_text(resp)\n", + " return answer\n", + "\n", + "\n", + "print(\"Helpers ready.\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "a3dca00c", + "metadata": {}, + "source": [ + "### 4.2. Executing the pipeline\n", + "\n", + "Now we drive the four pages through the pipeline. The loop below:\n", + "\n", + "1. Renders each PDF page to pixels.\n", + "2. Makes one **Stage 1** call per page to the Architect\n", + " (`nemotron-parse`) and **displays the page with Parse's coloured\n", + " layout boxes overlaid**. That overlay *is* our proof of spatial\n", + " anchoring: you can see Parse split the social-media page into\n", + " three separate `Picture` boxes, wrap the Pew chart in a single\n", + " `Picture:Bar Graph` box, and flip the GPL table into a\n", + " `Table` block with no `Picture` call at all.\n", + "3. For every `Picture` block the Architect returns, crops the\n", + " picture and makes a pair of **Stage 2** calls to the Visual\n", + " Specialist — one to classify the picture's sub-type, one to\n", + " transcribe it with a sub-type-aware prompt. (The crop-plus-\n", + " transcription receipts are shown modality-by-modality in §5, to\n", + " avoid re-displaying the same images twice.)\n", + "4. Aggregates everything into a single `file_results` JSON structure\n", + " we can inspect and query later.\n", + "\n", + "Watch especially page 11 of the social-media report: the Architect\n", + "returns **three** `Picture` boxes — one per Facebook-post\n", + "screenshot — and each gets its own targeted transcription. That is\n", + "the multi-picture isolation that makes the *\"Disneyland\"* question\n", + "in §7 answerable with a precise number." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e40f5aae", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:14.391295Z", + "iopub.status.busy": "2026-04-28T02:39:14.391181Z", + "iopub.status.idle": "2026-04-28T02:39:42.712961Z", + "shell.execute_reply": "2026-04-28T02:39:42.712441Z" + } + }, + "outputs": [], + "source": [ + "# Per-(short_id, page_n) page images and Parse blocks; per-pdf result bundles\n", + "page_images: dict[str, Image.Image] = {}\n", + "page_blocks: dict[str, list[dict[str, Any]]] = {}\n", + "file_results: dict[str, dict[str, Any]] = {}\n", + "\n", + "# Annotated-preview width (kept small so the notebook file stays lean).\n", + "_ANNOT_W = 820\n", + "\n", + "for sid, pdf, page_n, modality in DEMO_DOCS:\n", + " print(\"\\n\" + \"=\" * 74)\n", + " print(f\"[{sid}] {pdf.name} -- page {page_n} ({modality})\")\n", + " print(\"=\" * 74)\n", + "\n", + " page = pdf_page_to_image(pdf, page_n - 1, dpi=150)\n", + " page_images[sid] = page\n", + "\n", + " # Stage 1 -- Architect.\n", + " t0 = time.time()\n", + " blocks = call_nemotron_parse(page)\n", + " type_counts: dict[str, int] = {}\n", + " for b in blocks:\n", + " type_counts[b.get(\"type\", \"?\")] = type_counts.get(b.get(\"type\", \"?\"), 0) + 1\n", + " print(f\"[Architect] {len(blocks)} blocks in {time.time()-t0:.1f}s \"\n", + " f\"types -> {type_counts}\")\n", + " page_blocks[sid] = blocks\n", + "\n", + " # Show the coloured layout overlay in-place -- this is the annotated\n", + " # view §3 promised. Annotating a copy keeps sub_type labels out\n", + " # of the overlay until the Specialist runs below.\n", + " annotated = draw_annotations(page, blocks)\n", + " display(Markdown(\n", + " f\"**Parse overlay** — {DISPLAY_NAME[sid]} (p. {page_n}, \"\n", + " f\"modality: {modality})\"))\n", + " display(annotated.resize(\n", + " (_ANNOT_W, int(_ANNOT_W * annotated.height / annotated.width))))\n", + "\n", + " bundle = file_results.setdefault(\n", + " pdf.name, {\"source_filename\": pdf.name, \"pages\": []})\n", + " page_entry: dict[str, Any] = {\n", + " \"page_number\": page_n, \"status\": \"Layout extraction successful\",\n", + " \"content\": [],\n", + " }\n", + "\n", + " # Stage 2 -- Visual Specialist for every Picture block.\n", + " n_pics = sum(1 for b in blocks if b.get(\"type\") == \"Picture\")\n", + " if n_pics == 0:\n", + " print(\"[Specialist] skipped -- no Picture blocks on this page \"\n", + " \"(Parse handled this modality as structured text).\")\n", + "\n", + " for i, b in enumerate(blocks):\n", + " item: dict[str, Any] = {\n", + " \"extraction_id\": i, \"type\": b.get(\"type\"),\n", + " \"bbox\": b.get(\"bbox\"), \"text\": b.get(\"text\"),\n", + " }\n", + " if b.get(\"type\") == \"Picture\" and b.get(\"bbox\"):\n", + " crop = crop_picture(page, b[\"bbox\"])\n", + " t1 = time.time()\n", + " result = describe_picture(crop)\n", + " sub = result[\"classification\"].get(\"sub_type\", \"?\")\n", + " print(f\"[Specialist] Picture #{i:<2d} classified as '{sub}' \"\n", + " f\"-> described in {time.time()-t1:.1f}s \"\n", + " f\"({len(result['description'])} chars)\")\n", + " b[\"sub_type\"] = sub\n", + " item[\"classification\"] = result[\"classification\"]\n", + " item[\"description\"] = result[\"description\"]\n", + " page_entry[\"content\"].append(item)\n", + "\n", + " bundle[\"pages\"].append(page_entry)\n", + "\n", + "_n_pics_total = sum(1 for b in file_results.values()\n", + " for p in b[\"pages\"] for it in p[\"content\"]\n", + " if it.get(\"description\"))\n", + "print(\"\\n\" + \"=\" * 74)\n", + "print(f\"Pipeline finished: {len(DEMO_DOCS)} pages across \"\n", + " f\"{len(file_results)} documents, {_n_pics_total} pictures transcribed.\")" + ] + }, + { + "cell_type": "markdown", + "id": "c42deff6", + "metadata": {}, + "source": [ + "## 5. Divide and conquer, modality by modality\n", + "\n", + "§4.2 showed the *coloured overlays* — that is, **where** Parse drew\n", + "its boxes. This section shows **what the pair actually produced**\n", + "inside each box, one modality at a time. For every demo page we\n", + "pull out:\n", + "\n", + "1. the **exact image crop** Parse handed to the Visual Specialist\n", + " (or the text block if no vision call was needed), and\n", + "2. the **Specialist's transcription** of that crop — the text the\n", + " Reasoning Engine will read in §7.\n", + "\n", + "That crop-plus-transcription pair is the *receipt* of the divide-\n", + "and-conquer design. Four modalities, four receipts:\n", + "\n", + "* **Chart** – one `Picture:Bar Graph` box → a clean Markdown table.\n", + "* **Multi-picture page** – *three* `Picture` boxes → three separate\n", + " Facebook-post transcriptions (this is the one to linger on — with\n", + " Parse turned off, the single page-level call has to juggle three\n", + " cards at once).\n", + "* **Infographic** – one `Picture:Infographic` box → a structured\n", + " panel-by-panel breakdown of the LinkedIn demographic stats.\n", + "* **Structured table** – **no `Picture` call at all**: Parse emits\n", + " the programme list directly as a LaTeX-tabular `Table` block the\n", + " Reasoning Engine reads verbatim.\n", + "\n", + "Because the Specialist's transcription is the *same text* the\n", + "Reasoning Engine reads in §7, any answer it returns can be\n", + "back-traced to one of these crops with one glance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6c1d6522", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:42.719713Z", + "iopub.status.busy": "2026-04-28T02:39:42.719587Z", + "iopub.status.idle": "2026-04-28T02:39:42.785275Z", + "shell.execute_reply": "2026-04-28T02:39:42.784861Z" + } + }, + "outputs": [], + "source": [ + "# Per-modality evidence: the crop(s) Parse isolated + the Visual\n", + "# Specialist's transcription. We do NOT re-display full annotated\n", + "# pages here -- those live in section 3. The page image itself is\n", + "# shown only to provide the crop; transcriptions are rendered as\n", + "# Markdown so tables come out formatted.\n", + "MAX_DESC_CHARS = 1400\n", + "\n", + "for sid, pdf, page_n, modality in DEMO_DOCS:\n", + " display(Markdown(\n", + " f\"### {modality.title()} — {DISPLAY_NAME[sid]} (p. {page_n})\"\n", + " ))\n", + "\n", + " page = page_images[sid]\n", + " blocks = page_blocks[sid]\n", + " page_entry = next(p for p in file_results[pdf.name][\"pages\"]\n", + " if p[\"page_number\"] == page_n)\n", + "\n", + " pic_items = [it for it in page_entry[\"content\"]\n", + " if it.get(\"type\") == \"Picture\" and it.get(\"description\")]\n", + " tab_items = [it for it in page_entry[\"content\"]\n", + " if it.get(\"type\") == \"Table\" and it.get(\"text\")]\n", + "\n", + " # Structured-table case: Parse handled it directly; no vision\n", + " # call was required.\n", + " if not pic_items and tab_items:\n", + " display(Markdown(\n", + " \"Parse surfaced this region as a `Table` block with \"\n", + " \"LaTeX-tabular body — **no Visual Specialist call \"\n", + " \"was needed**. The Reasoning Engine reads the cells \"\n", + " \"below as plain text:\"))\n", + " print(tab_items[0][\"text\"])\n", + " continue\n", + "\n", + " block_by_id = {i: b for i, b in enumerate(blocks)}\n", + " crop_w = 360 if len(pic_items) > 1 else 620\n", + "\n", + " for item in pic_items:\n", + " block = block_by_id[item[\"extraction_id\"]]\n", + " cls = item[\"classification\"] or {}\n", + " sub = cls.get(\"sub_type\", \"?\")\n", + " subject = cls.get(\"subject_matter\", \"\")\n", + " header = (f\"**Crop #{item['extraction_id']}** — \"\n", + " f\"classified as `Picture:{sub}`\")\n", + " if subject:\n", + " header += f\" \\n*{subject}*\"\n", + " display(Markdown(header))\n", + "\n", + " crop = crop_picture(page, block[\"bbox\"])\n", + " display(crop.resize((crop_w,\n", + " int(crop_w * crop.height / crop.width))))\n", + "\n", + " desc = item[\"description\"]\n", + " if len(desc) > MAX_DESC_CHARS:\n", + " desc = desc[:MAX_DESC_CHARS].rstrip() + \"\\n\\n*... (truncated)*\"\n", + " display(Markdown(desc))" + ] + }, + { + "cell_type": "markdown", + "id": "1e4e7b0f", + "metadata": {}, + "source": [ + "## 6. Examining the final JSON output\n", + "\n", + "The pipeline fuses both models' results into a single, structured\n", + "JSON object *per document*. Every page carries a list of `content`\n", + "items; `Picture` items contain the Visual Specialist's\n", + "`classification` (image type, sub-type, subject-matter summary)\n", + "alongside its textual `description`, while `Table` items carry the\n", + "LaTeX-tabular text Parse emitted directly.\n", + "\n", + "This JSON is the *only* artefact the Reasoning Engine needs — it\n", + "can be serialised, cached, re-queried, or piped into downstream\n", + "retrieval systems. Below we preview the Pew Research document's\n", + "output and save one `.parse_stepfun.json` per PDF to disk.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "758f7951", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:42.787281Z", + "iopub.status.busy": "2026-04-28T02:39:42.787186Z", + "iopub.status.idle": "2026-04-28T02:39:42.793116Z", + "shell.execute_reply": "2026-04-28T02:39:42.792759Z" + } + }, + "outputs": [], + "source": [ + "# Preview the shortest bundle (the Pew one-pager) so the reader\n", + "# sees the full shape on screen without scrolling.\n", + "_preview_pdf_name = DEMO_DOCS[0][1].name # Pew Research release\n", + "_preview = file_results[_preview_pdf_name]\n", + "print(f\"=== preview: {_preview_pdf_name} ===\")\n", + "for line in json.dumps(_preview, indent=2, default=str).split(\"\\n\")[:60]:\n", + " print(line)\n", + "print(\"... (truncated)\\n\")\n", + "\n", + "# Save one JSON per PDF. Downstream retrieval systems can index these\n", + "# directly, one embedding per page.\n", + "OUTPUT_DIR = Path(\"output_results\")\n", + "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "for pdf_name, bundle in file_results.items():\n", + " out = OUTPUT_DIR / (Path(pdf_name).stem + \".parse_stepfun.json\")\n", + " out.write_text(json.dumps(bundle, indent=2, default=str))\n", + " print(f\" saved {out.name} ({len(bundle['pages'])} page(s))\")" + ] + }, + { + "cell_type": "markdown", + "id": "bf4f5cde", + "metadata": {}, + "source": [ + "## 7. Querying the document with the Reasoning Engine\n", + "\n", + "With every page distilled into a picture-transcribed,\n", + "spatially-ordered context, we can ask real questions. The *same*\n", + "StepFun VLM we used as Visual Specialist now puts on its\n", + "**Reasoning Engine** hat and answers — citing the page and quoting\n", + "the supporting phrase for every answer.\n", + "\n", + "We pose one question per modality; each answer is visible on the\n", + "annotated page tile in §3, so you can verify it by eye:\n", + "\n", + "1. **Chart (Pew Research, p. 5).** Which policy area draws the\n", + " largest share of *very* confident respondents? A stacked-bar\n", + " chart question that only works because the Specialist turned the\n", + " chart into a Markdown table first.\n", + "2. **Multi-picture (Social-Media report, p. 11).** *\"How many\n", + " likes does the Disneyland post have?\"* Three Facebook-post\n", + " screenshots sit side by side on the page; with Parse turned off,\n", + " a single page-level VLM call would have to pick the right card\n", + " while reading it. Parse's three separate `Picture` boxes make\n", + " the answer surgical.\n", + "3. **Infographic (Social-Media report, p. 20).** *\"What percentage\n", + " of LinkedIn users have household income above $75K?\"* The digit\n", + " lives inside one panel of a pixel-only demographic infographic.\n", + "4. **Structured table (Graduate Studies brochure, p. 11).** *\"Which\n", + " leadership programme has the longest Full-Time duration?\"* Pure\n", + " Architect work — Parse preserved the programme/duration table as\n", + " LaTeX-tabular text so the Reasoning Engine can scan it without a\n", + " vision call.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c985d9da", + "metadata": { + "execution": { + "iopub.execute_input": "2026-04-28T02:39:42.794337Z", + "iopub.status.busy": "2026-04-28T02:39:42.794261Z", + "iopub.status.idle": "2026-04-28T02:39:44.565304Z", + "shell.execute_reply": "2026-04-28T02:39:44.564315Z" + } + }, + "outputs": [], + "source": [ + "# Per-document context: each PDF we processed gets its own assembled\n", + "# context string, stitched together from the pages we ran through\n", + "# the pipeline, with strong page delimiters so citations land on the\n", + "# right page number.\n", + "doc_contexts: dict[str, str] = {}\n", + "for sid, pdf, page_n, _mod in DEMO_DOCS:\n", + " entry = next(p for p in file_results[pdf.name][\"pages\"]\n", + " if p[\"page_number\"] == page_n)\n", + " descs = [it[\"description\"] for it in entry[\"content\"]\n", + " if it.get(\"type\") == \"Picture\" and it.get(\"description\")]\n", + " chunk = assemble_page_context(page_blocks[sid], descs, page_n=page_n)\n", + " doc_contexts.setdefault(pdf.name, []).append(chunk)\n", + "\n", + "for pdf_name, chunks in doc_contexts.items():\n", + " merged = \"\\n\\n\".join(chunks)\n", + " doc_contexts[pdf_name] = merged\n", + " print(f\"context: {pdf_name} -> {len(merged):,} chars, \"\n", + " f\"{len(chunks)} page(s)\")\n", + "print()\n", + "\n", + "# One question per demo page, routed to the doc whose context can\n", + "# answer it. (short_id, question)\n", + "qa_plan: list[tuple[str, str]] = [\n", + " (\"pew\",\n", + " \"Which policy area has the largest share of respondents who are \"\n", + " \"'very' confident in Donald Trump? Give the policy phrase and the \"\n", + " \"percentage.\"),\n", + " (\"social\",\n", + " \"How many people like the Disneyland post?\"),\n", + " (\"linkedin\",\n", + " \"What percentage of LinkedIn users have a household income above \"\n", + " \"$75K?\"),\n", + " (\"gpl\",\n", + " \"Among the leadership programmes listed, which programme has the \"\n", + " \"longest Full-Time duration, and what is that duration?\"),\n", + "]\n", + "\n", + "# Each question fires a single Reasoning Engine call scoped to the\n", + "# document it belongs to, so citations point at the right page.\n", + "sid_to_pdf = {sid: pdf.name for sid, pdf, _, _ in DEMO_DOCS}\n", + "for sid, q in qa_plan:\n", + " pdf_name = sid_to_pdf[sid]\n", + " context = doc_contexts[pdf_name]\n", + " print(\"=\" * 74)\n", + " print(f\"DOC: {DISPLAY_NAME[sid]} ({pdf_name})\")\n", + " print(f\"QUESTION: {q}\")\n", + " t0 = time.time()\n", + " answer = ask_question(q, context)\n", + " print(f\"[{time.time()-t0:.1f}s]\")\n", + " display(Markdown(f\"**ANSWER:**\\n\\n{answer}\"))\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "id": "817c1d36", + "metadata": {}, + "source": [ + "## 8. Next steps\n", + "\n", + "This pipeline fits on one page and covers every modality a real\n", + "PDF throws at you (text, tables, charts, infographics, UI\n", + "screenshots) — on a single StepFun VLM surface — because\n", + "the assembled context can be queried in one hosted VLM call for\n", + "the small demo set, then extended with retrieval for larger document collections.\n", + "\n", + "A few directions to take it from here:\n", + "\n", + "- **Try it on your documents.** Extend `DEMO_DOCS` in §3 with any\n", + " `(short_id, pdf, page, modality)` tuples you like and re-run.\n", + " Pages that are rasterised images (scanned reports, marketing\n", + " decks, social-media screenshots, dashboards) are the pipeline's\n", + " sweet spot — they are also exactly the pages most text-first\n", + " parsers drop on the floor.\n", + "- **Batch the picture stage when quota is tight.** For\n", + " picture-dense documents (dashboards, slide decks, screenshot\n", + " catalogues), you can send every picture on a page in a single\n", + " multi-image StepFun VLM call and ask for one JSON entry per\n", + " picture. This trades a little peak quality for roughly 4×\n", + " fewer picture-stage API calls on picture-dense pages; use it\n", + " when API-quota economics matter more than peak accuracy.\n", + "- **Scale past ~25 pages with long-context retrieval.** The\n", + " per-document JSON written in §6 is a ready-made input for a\n", + " vector store: one embedding per page, retrieve the top-k at\n", + " question time, and feed those pages' `assemble_page_context`\n", + " strings straight into the Reasoning Engine. Parse's reading\n", + " order + bbox anchors survive retrieval, so the model can still\n", + " cite the right page.\n", + "- **Customise the picture-stage prompts.** The `ANALYSIS_PROMPTS`\n", + " dispatch table in §4.1 is the single place where you teach the\n", + " pipeline what a *good* transcription looks like for your domain\n", + " — product photos, engineering drawings, medical charts, CAD\n", + " screenshots. Adding a new entry is three lines of code.\n", + "\n", + "Two hosted model surfaces. Three roles. StepFun handles both\n", + "\"what does this picture mean\" and \"what is the user actually\n", + "asking\" — that is the all-modality foundation this cookbook demonstrates.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}