Skip to content

feat(file_processors): add remote docling-serve provider#5412

Merged
franciscojavierarceo merged 7 commits intoogx-ai:mainfrom
alinaryan:docling-serve
Apr 15, 2026
Merged

feat(file_processors): add remote docling-serve provider#5412
franciscojavierarceo merged 7 commits intoogx-ai:mainfrom
alinaryan:docling-serve

Conversation

@alinaryan
Copy link
Copy Markdown
Contributor

@alinaryan alinaryan commented Apr 1, 2026

What does this PR do?

Adds a remote file processor provider that delegates document conversion and chunking to a Docling Serve instance, enabling GPU-accelerated layout-aware document parsing for real-time RAG applications.

Test Plan

End-to-end RAG demo

Prerequisites:

  • Ollama running with a model pulled (e.g. ollama pull llama3.2:3b-instruct-fp16)
  • Docling Serve running: uv run --with "docling-serve" docling-serve run --port 5001
  1. Create a config file (e.g. docling_serve_rag_config.yaml):
docling_serve_rag_config.yaml
version: 2                                                                                                                                                                              
  distro_name: docling-serve-rag-demo
  apis:                                                                                                                                                                                   
  - file_processors                                                                                                                                                                     
  - files                                                                                                                                                                                 
  - inference
  - vector_io                                                                                                                                                                             
  providers:                                                                                                                                                                            
    inference:
    - provider_id: ollama
      provider_type: remote::ollama                                                                                                                                                       
      config:                      
        base_url: ${env.OLLAMA_URL:=http://localhost:11434/v1}                                                                                                                            
    - provider_id: sentence-transformers                                                                                                                                                
      provider_type: inline::sentence-transformers                                                                                                                                        
      config:                                     
        trust_remote_code: true                                                                                                                                                           
    vector_io:                                                                                                                                                                          
    - provider_id: faiss                                                                                                                                                                  
      provider_type: inline::faiss
      config:                                                                                                                                                                             
        persistence:                                                                                                                                                                    
          backend: kv_default
          namespace: vector_io::faiss
    files:                           
    - provider_id: localfs                                                                                                                                                                
      provider_type: inline::localfs                                                                                                                                                      
      config:                                                                                                                                                                             
        storage_dir: ${env.FILES_STORAGE_DIR:=~/.llama/distributions/docling-serve-rag-demo/files}                                                                                        
        metadata_store:                                                                                                                                                                   
          backend: sql_default                                                                                                                                                            
          table_name: files_metadata                                                                                                                                                      
    file_processors:                                                                                                                                                                      
    - provider_id: docling-serve                                                                                                                                                          
      provider_type: remote::docling-serve                                                                                                                                                
      config:                             
        base_url: ${env.DOCLING_SERVE_URL:=http://localhost:5001}                                                                                                                         
        api_key: ${env.DOCLING_SERVE_API_KEY:=}
  1. Start Llama Stack server: OLLAMA_URL=http://localhost:11434/v1 llama stack run docling_serve_rag_config.yaml --port 8321

  2. RAG Pipeline

Upload PDF

  -F "file=@/path/to/document.pdf" \
  -F "purpose=assistants" | jq -r '.id')                                                                                                                                                
echo "File ID: $FILE_ID"

Process with docling-serve

  -F "file_id=$FILE_ID" \                                                                                                                                                               
  -F 'chunking_strategy={"type":"static","static":{"max_chunk_size_tokens":600,"chunk_overlap_tokens":75}}' \
  | jq '{processor: .metadata.processor, n_chunks: (.chunks | length), processing_time_ms: .metadata.processing_time_ms}'

Create vector store

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{                                                                                                                                                                               
    "name": "docling-serve-rag-demo",                                                                                                                                                   
    "metadata": {
      "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",                                                                                                        
      "embedding_dimension": 768                                                                                                                                                        
    }
  }' | jq -r '.id')                                                                                                                                                                     
echo "Vector Store ID: $VECTOR_STORE_ID"

Insert file into vector store

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{                                                                                                                                                                                 
    "file_id": "'"$FILE_ID"'",                                                                                                                                                          
    "chunking_strategy": {"type": "static", "static": {"max_chunk_size_tokens": 600, "chunk_overlap_tokens": 75}}                                                                       
  }' | jq '{id: .id, status: .status}'

Verify indexing

  | jq '{status: .status, file_counts: .file_counts}'

Search the vector store

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{"query": "What are the key requirements?", "max_chunks": 50}' \                                                                                                                  
  | jq '{results: [.data[0:2][] | {score: .score, preview: (.content[0].text[0:200] + "...")}]}'

RAG: retrieve context and generate answer

  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{"query": "What are the key requirements?", "max_chunks": 50}' \                                                                                                                  
  | jq -r '.data[0:3][] | .content[0].text' | tr '"' "'" | tr '\n' ' ' | head -c 1500)```                                                                                                  
                                                                                                                                                                                        
curl -sS -X POST http://localhost:8321/v1/chat/completions \                                                                                                                            
  -H "Content-Type: application/json" \                                                                                                                                                 
  -d '{                                                                                                                                                                                 
    "model": "ollama/llama3.2:3b-instruct-fp16",                                                                                                                                      
    "messages": [                                                                                                                                                                       
      {"role": "system", "content": "Answer questions using only the provided context."},
      {"role": "user", "content": "Context: '"$CONTEXT"'\n\nQuestion: What are the key requirements?\n\nAnswer:"}                                                                       
    ],                                                                                                                                                                                  
    "max_tokens": 300,                                                                                                                                                                  
    "temperature": 0.3                                                                                                                                                                  
  }' | jq -r '.choices[0].message.content'

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 1, 2026
@alinaryan alinaryan force-pushed the docling-serve branch 2 times, most recently from f41025e to b62d9f0 Compare April 8, 2026 14:19
Add a remote file processor provider that delegates document conversion
and chunking to a Docling Serve instance, enabling GPU-accelerated
layout-aware document parsing for real-time RAG applications.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
headers = self._get_headers()

options = {
"to_formats": '["md"]',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like this could go in FileProcessorConfig but can be a follow up PR later

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add in follow-up

Comment thread docs/docs/providers/file_processors/remote_docling-serve.mdx
mattf
mattf previously requested changes Apr 13, 2026
Comment thread src/llama_stack/providers/remote/file_processor/docling_serve/config.py Outdated
Comment thread src/llama_stack/providers/remote/file_processor/docling_serve/docling_serve.py Outdated
…cessor

Use SecretStr for api_key config field with get_secret_value() in headers,
validate files_api dependency is non-None in get_adapter_impl, type
files_api as Files instead of Any, use file_id as document_id when
available, simplify filename fallback to "upload", and include /v1 in
base_url default to match standard Docling Serve API convention.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
…ocessor

Add detailed description to the docling-serve provider spec including
features, usage examples with Docker and run.yaml, and links to the
Docling Serve documentation repository.

Signed-off-by: Alina Ryan <aliryan@redhat.com>
Signed-off-by: Alina Ryan <aliryan@redhat.com>
@alinaryan
Copy link
Copy Markdown
Contributor Author

spoke offline with @franciscojavierarceo - I'm going to complete a speed/scale analysis test of some pdfs, will post results here

@mattf mattf dismissed their stale review April 14, 2026 15:09

comments addressed

@alinaryan
Copy link
Copy Markdown
Contributor Author

Performance Benchmark Results

Benchmarked the remote::docling-serve file processor against a local Docling Serve instance using real-world PDFs from a mixed corpus (103 files, 49KB–63MB). All tests used chunking_strategy: auto (Docling's HybridChunker, which splits documents at semantic boundaries like headings and sections) via POST /v1alpha/file-processors/process.

Term Meaning
Concurrency Number of requests in-flight simultaneously. c=1 means sequential; c=10 means 10 requests hitting the server at once.
p50 / p95 Median and 95th percentile latency. p50 is what a typical user experiences. p95 captures the slow tail — the worst 5% of requests.
Throughput Files successfully processed per second of wall clock time.
OK/Total Successful requests out of total attempted. Failures are 500 errors from Docling Serve under resource pressure.

Single-file latency (concurrency=1)

File Size Pages Latency (p50) Chunks Throughput
Small 49KB 2 6.0s 7 0.17 files/s
Medium 828KB 40 16.1s 44 0.06 files/s
Large 63MB 86 78.4s 178 0.01 files/s

Behavior under concurrent load

File Concurrency OK/Total p50 p95 Throughput
Small (2pg) 1 5/5 6.0s 6.0s 0.17/s
Small (2pg) 5 5/5 18.1s 24.1s 0.21/s
Small (2pg) 10 5/5 18.0s 24.1s 0.21/s
Medium (40pg) 1 5/5 16.1s 16.1s 0.06/s
Medium (40pg) 5 5/5 58.9s 73.0s 0.07/s
Medium (40pg) 10 5/5 58.9s 73.0s 0.07/s
Large (86pg) 1 5/5 78.4s 82.3s 0.01/s
Large (86pg) 5 2/5 115.2s 0.02/s
Large (86pg) 10 0/5 total failure

Deep dive: large file failures

The large file used was 37-02-FullBook.pdf — a Johns Hopkins APL Technical Digest (academic research papers). 86 pages, 63.4MB, 297 embedded images (charts, diagrams, photos), created
in Adobe InDesign. This is a realistic worst-case for document processing: dense, image-heavy, multi-column layout.

Concurrency Succeeded Failed Time per request (OK) Time per request (fail)
1 (sequential) 3/3 0 ~76s
3 3/5 2 ~115s ~122s
5 1/5 4 ~114s ~123s

Why they fail: Each request forces Docling Serve to load a 63MB PDF into memory, extract 297 images, run layout analysis on 86 pages, and chunk the result. Sequentially this works fine
(~76s per file). But at concurrency=3, that's ~190MB of PDFs plus image processing buffers all competing for resources simultaneously. The requests that manage to acquire resources
first succeed (at degraded ~115s instead of 76s), while the rest are killed by Docling Serve after ~122s. At concurrency=5, only 1 request survives.

Observations

  • Single-request performance is solid. The provider correctly delegates to Docling Serve and returns well-structured chunks with metadata. A 40-page PDF processes in ~16s with 44
    semantic chunks — reasonable for layout-aware parsing.
  • Latency degrades linearly under concurrency. Small files: 6s → 18s (3x). Medium files: 16s → 59s (3.7x). Requests queue behind each other because Docling Serve serializes work
    internally.
  • Throughput plateaus immediately. Concurrency 5 vs 10 yields identical throughput for small and medium files. The system is saturated — more concurrent requests just increase wait
    time without processing any faster.
  • Large files crash under load. This is the most critical finding. Image-heavy documents that process fine sequentially become unreliable at even moderate concurrency. Without request
    queuing, a burst of large file uploads can take down the processing pipeline entirely.

Test setup

  • Docling Serve: running locally via uv
  • Llama Stack: remote::docling-serve provider, inline::faiss vector store, inline::sentence-transformers embeddings
  • Benchmark script: benchmarks/docling_serve_bench.py (included in this branch)
  • Test data: real-world PDFs (financial reports, product manuals, legal filings, academic research, technical docs)

"to_formats": '["md"]',
}

async with httpx.AsyncClient(timeout=300.0) as client:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a follow up we can make the timeout part of the file processor config.

Copy link
Copy Markdown
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 🚢 🚢

@franciscojavierarceo franciscojavierarceo added this pull request to the merge queue Apr 15, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 15, 2026
@franciscojavierarceo franciscojavierarceo added this pull request to the merge queue Apr 15, 2026
Merged via the queue into ogx-ai:main with commit 75d8315 Apr 15, 2026
65 checks passed
@franciscojavierarceo franciscojavierarceo deleted the docling-serve branch April 15, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants