feat(file_processors): add remote docling-serve provider by alinaryan · Pull Request #5412 · ogx-ai/llama-stack

alinaryan · 2026-04-01T20:09:42Z

What does this PR do?

Adds a remote file processor provider that delegates document conversion and chunking to a Docling Serve instance, enabling GPU-accelerated layout-aware document parsing for real-time RAG applications.

Test Plan

End-to-end RAG demo

Prerequisites:

Ollama running with a model pulled (e.g. ollama pull llama3.2:3b-instruct-fp16)
Docling Serve running: uv run --with "docling-serve" docling-serve run --port 5001

Create a config file (e.g. docling_serve_rag_config.yaml):

docling_serve_rag_config.yaml

version: 2                                                                                                                                                                              
  distro_name: docling-serve-rag-demo
  apis:                                                                                                                                                                                   
  - file_processors                                                                                                                                                                     
  - files                                                                                                                                                                                 
  - inference
  - vector_io                                                                                                                                                                             
  providers:                                                                                                                                                                            
    inference:
    - provider_id: ollama
      provider_type: remote::ollama                                                                                                                                                       
      config:                      
        base_url: ${env.OLLAMA_URL:=http://localhost:11434/v1}                                                                                                                            
    - provider_id: sentence-transformers                                                                                                                                                
      provider_type: inline::sentence-transformers                                                                                                                                        
      config:                                     
        trust_remote_code: true                                                                                                                                                           
    vector_io:                                                                                                                                                                          
    - provider_id: faiss                                                                                                                                                                  
      provider_type: inline::faiss
      config:                                                                                                                                                                             
        persistence:                                                                                                                                                                    
          backend: kv_default
          namespace: vector_io::faiss
    files:                           
    - provider_id: localfs                                                                                                                                                                
      provider_type: inline::localfs                                                                                                                                                      
      config:                                                                                                                                                                             
        storage_dir: ${env.FILES_STORAGE_DIR:=~/.llama/distributions/docling-serve-rag-demo/files}                                                                                        
        metadata_store:                                                                                                                                                                   
          backend: sql_default                                                                                                                                                            
          table_name: files_metadata                                                                                                                                                      
    file_processors:                                                                                                                                                                      
    - provider_id: docling-serve                                                                                                                                                          
      provider_type: remote::docling-serve                                                                                                                                                
      config:                             
        base_url: ${env.DOCLING_SERVE_URL:=http://localhost:5001}                                                                                                                         
        api_key: ${env.DOCLING_SERVE_API_KEY:=}

Start Llama Stack server: OLLAMA_URL=http://localhost:11434/v1 llama stack run docling_serve_rag_config.yaml --port 8321
RAG Pipeline

Upload PDF

  -F "file=@/path/to/document.pdf" \
  -F "purpose=assistants" | jq -r '.id')                                                                                                                                                
echo "File ID: $FILE_ID"

Process with docling-serve

  -F "file_id=$FILE_ID" \                                                                                                                                                               
  -F 'chunking_strategy={"type":"static","static":{"max_chunk_size_tokens":600,"chunk_overlap_tokens":75}}' \
  | jq '{processor: .metadata.processor, n_chunks: (.chunks | length), processing_time_ms: .metadata.processing_time_ms}'

Create vector store

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{                                                                                                                                                                               
    "name": "docling-serve-rag-demo",                                                                                                                                                   
    "metadata": {
      "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",                                                                                                        
      "embedding_dimension": 768                                                                                                                                                        
    }
  }' | jq -r '.id')                                                                                                                                                                     
echo "Vector Store ID: $VECTOR_STORE_ID"

Insert file into vector store

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{                                                                                                                                                                                 
    "file_id": "'"$FILE_ID"'",                                                                                                                                                          
    "chunking_strategy": {"type": "static", "static": {"max_chunk_size_tokens": 600, "chunk_overlap_tokens": 75}}                                                                       
  }' | jq '{id: .id, status: .status}'

Verify indexing

  | jq '{status: .status, file_counts: .file_counts}'

Search the vector store

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{"query": "What are the key requirements?", "max_chunks": 50}' \                                                                                                                  
  | jq '{results: [.data[0:2][] | {score: .score, preview: (.content[0].text[0:200] + "...")}]}'

RAG: retrieve context and generate answer

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{"query": "What are the key requirements?", "max_chunks": 50}' \                                                                                                                  
  | jq -r '.data[0:3][] | .content[0].text' | tr '"' "'" | tr '\n' ' ' | head -c 1500)```                                                                                                  
                                                                                                                                                                                        
curl -sS -X POST http://localhost:8321/v1/chat/completions \                                                                                                                            
  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{                                                                                                                                                                                 
    "model": "ollama/llama3.2:3b-instruct-fp16",                                                                                                                                      
    "messages": [                                                                                                                                                                       
      {"role": "system", "content": "Answer questions using only the provided context."},
      {"role": "user", "content": "Context: '"$CONTEXT"'\n\nQuestion: What are the key requirements?\n\nAnswer:"}                                                                       
    ],                                                                                                                                                                                  
    "max_tokens": 300,                                                                                                                                                                  
    "temperature": 0.3                                                                                                                                                                  
  }' | jq -r '.choices[0].message.content'

Add a remote file processor provider that delegates document conversion and chunking to a Docling Serve instance, enabling GPU-accelerated layout-aware document parsing for real-time RAG applications. Signed-off-by: Alina Ryan <aliryan@redhat.com>

franciscojavierarceo · 2026-04-10T14:23:31Z

+        headers = self._get_headers()
+
+        options = {
+            "to_formats": '["md"]',


feels like this could go in FileProcessorConfig but can be a follow up PR later

will add in follow-up

…cessor Use SecretStr for api_key config field with get_secret_value() in headers, validate files_api dependency is non-None in get_adapter_impl, type files_api as Files instead of Any, use file_id as document_id when available, simplify filename fallback to "upload", and include /v1 in base_url default to match standard Docling Serve API convention. Signed-off-by: Alina Ryan <aliryan@redhat.com>

…ocessor Add detailed description to the docling-serve provider spec including features, usage examples with Docker and run.yaml, and links to the Docling Serve documentation repository. Signed-off-by: Alina Ryan <aliryan@redhat.com>

Signed-off-by: Alina Ryan <aliryan@redhat.com>

alinaryan · 2026-04-14T13:42:39Z

spoke offline with @franciscojavierarceo - I'm going to complete a speed/scale analysis test of some pdfs, will post results here

comments addressed

alinaryan · 2026-04-15T18:20:37Z

Performance Benchmark Results

Benchmarked the remote::docling-serve file processor against a local Docling Serve instance using real-world PDFs from a mixed corpus (103 files, 49KB–63MB). All tests used chunking_strategy: auto (Docling's HybridChunker, which splits documents at semantic boundaries like headings and sections) via POST /v1alpha/file-processors/process.

Term	Meaning
Concurrency	Number of requests in-flight simultaneously. c=1 means sequential; c=10 means 10 requests hitting the server at once.
p50 / p95	Median and 95th percentile latency. p50 is what a typical user experiences. p95 captures the slow tail — the worst 5% of requests.
Throughput	Files successfully processed per second of wall clock time.
OK/Total	Successful requests out of total attempted. Failures are 500 errors from Docling Serve under resource pressure.

Single-file latency (concurrency=1)

File	Size	Pages	Latency (p50)	Chunks	Throughput
Small	49KB	2	6.0s	7	0.17 files/s
Medium	828KB	40	16.1s	44	0.06 files/s
Large	63MB	86	78.4s	178	0.01 files/s

Behavior under concurrent load

File	Concurrency	OK/Total	p50	p95	Throughput
Small (2pg)	1	5/5	6.0s	6.0s	0.17/s
Small (2pg)	5	5/5	18.1s	24.1s	0.21/s
Small (2pg)	10	5/5	18.0s	24.1s	0.21/s
Medium (40pg)	1	5/5	16.1s	16.1s	0.06/s
Medium (40pg)	5	5/5	58.9s	73.0s	0.07/s
Medium (40pg)	10	5/5	58.9s	73.0s	0.07/s
Large (86pg)	1	5/5	78.4s	82.3s	0.01/s
Large (86pg)	5	2/5	115.2s	—	0.02/s
Large (86pg)	10	0/5	—	—	total failure

Deep dive: large file failures

The large file used was 37-02-FullBook.pdf — a Johns Hopkins APL Technical Digest (academic research papers). 86 pages, 63.4MB, 297 embedded images (charts, diagrams, photos), created
in Adobe InDesign. This is a realistic worst-case for document processing: dense, image-heavy, multi-column layout.

Concurrency	Succeeded	Failed	Time per request (OK)	Time per request (fail)
1 (sequential)	3/3	0	~76s	—
3	3/5	2	~115s	~122s
5	1/5	4	~114s	~123s

Why they fail: Each request forces Docling Serve to load a 63MB PDF into memory, extract 297 images, run layout analysis on 86 pages, and chunk the result. Sequentially this works fine
(~76s per file). But at concurrency=3, that's ~190MB of PDFs plus image processing buffers all competing for resources simultaneously. The requests that manage to acquire resources
first succeed (at degraded ~115s instead of 76s), while the rest are killed by Docling Serve after ~122s. At concurrency=5, only 1 request survives.

Observations

Single-request performance is solid. The provider correctly delegates to Docling Serve and returns well-structured chunks with metadata. A 40-page PDF processes in ~16s with 44
semantic chunks — reasonable for layout-aware parsing.
Latency degrades linearly under concurrency. Small files: 6s → 18s (3x). Medium files: 16s → 59s (3.7x). Requests queue behind each other because Docling Serve serializes work
internally.
Throughput plateaus immediately. Concurrency 5 vs 10 yields identical throughput for small and medium files. The system is saturated — more concurrent requests just increase wait
time without processing any faster.
Large files crash under load. This is the most critical finding. Image-heavy documents that process fine sequentially become unreliable at even moderate concurrency. Without request
queuing, a burst of large file uploads can take down the processing pipeline entirely.

Test setup

Docling Serve: running locally via uv
Llama Stack: remote::docling-serve provider, inline::faiss vector store, inline::sentence-transformers embeddings
Benchmark script: benchmarks/docling_serve_bench.py (included in this branch)
Test data: real-world PDFs (financial reports, product manuals, legal filings, academic research, technical docs)

franciscojavierarceo · 2026-04-15T18:35:49Z

+            "to_formats": '["md"]',
+        }
+
+        async with httpx.AsyncClient(timeout=300.0) as client:


in a follow up we can make the timeout part of the file processor config.

franciscojavierarceo

🚢 🚢 🚢

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026

alinaryan force-pushed the docling-serve branch 2 times, most recently from f41025e to b62d9f0 Compare April 8, 2026 14:19

alinaryan force-pushed the docling-serve branch from b62d9f0 to c76ce1a Compare April 8, 2026 20:39

alinaryan marked this pull request as ready for review April 8, 2026 20:39

alinaryan requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners April 8, 2026 20:39

Merge branch 'main' into docling-serve

30a9f8e

franciscojavierarceo reviewed Apr 10, 2026

View reviewed changes

Comment thread src/llama_stack/providers/remote/file_processor/docling_serve/docling_serve.py Outdated

franciscojavierarceo reviewed Apr 10, 2026

View reviewed changes

Comment thread src/llama_stack/providers/remote/file_processor/docling_serve/docling_serve.py Outdated

franciscojavierarceo reviewed Apr 10, 2026

View reviewed changes

Comment thread docs/docs/providers/file_processors/remote_docling-serve.mdx

mattf previously requested changes Apr 13, 2026

View reviewed changes

alinaryan added 4 commits April 13, 2026 16:50

Merge remote-tracking branch 'upstream/main' into docling-serve

6a73f52

test: add integration test recordings for inference and safety models

559ee86

Signed-off-by: Alina Ryan <aliryan@redhat.com>

Merge branch 'main' into docling-serve

25bd8f9

franciscojavierarceo reviewed Apr 15, 2026

View reviewed changes

franciscojavierarceo approved these changes Apr 15, 2026

View reviewed changes

franciscojavierarceo added this pull request to the merge queue Apr 15, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 15, 2026

franciscojavierarceo added this pull request to the merge queue Apr 15, 2026

Merged via the queue into ogx-ai:main with commit 75d8315 Apr 15, 2026
65 checks passed

franciscojavierarceo deleted the docling-serve branch April 15, 2026 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(file_processors): add remote docling-serve provider#5412

feat(file_processors): add remote docling-serve provider#5412
franciscojavierarceo merged 7 commits intoogx-ai:mainfrom
alinaryan:docling-serve

alinaryan commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

franciscojavierarceo Apr 10, 2026

Uh oh!

alinaryan Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alinaryan commented Apr 14, 2026

Uh oh!

alinaryan commented Apr 15, 2026

Uh oh!

franciscojavierarceo Apr 15, 2026

Uh oh!

franciscojavierarceo left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alinaryan commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

End-to-end RAG demo

Prerequisites:

Upload PDF

Process with docling-serve

Create vector store

Insert file into vector store

Verify indexing

Search the vector store

RAG: retrieve context and generate answer

Uh oh!

Uh oh!

Uh oh!

franciscojavierarceo Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

alinaryan Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alinaryan commented Apr 14, 2026

Uh oh!

alinaryan commented Apr 15, 2026

Uh oh!

franciscojavierarceo Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alinaryan commented Apr 1, 2026 •

edited

Loading