Skip to content

Add support for nemotron-ocr-v2#1777

Merged
jdye64 merged 3 commits intoNVIDIA:mainfrom
jdye64:nemotron-ocr-vs-supprt
Apr 21, 2026
Merged

Add support for nemotron-ocr-v2#1777
jdye64 merged 3 commits intoNVIDIA:mainfrom
jdye64:nemotron-ocr-vs-supprt

Conversation

@jdye64
Copy link
Copy Markdown
Collaborator

@jdye64 jdye64 commented Apr 1, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jdye64 jdye64 force-pushed the nemotron-ocr-vs-supprt branch from 7ad6073 to 53c1153 Compare April 1, 2026 23:39
@jdye64 jdye64 marked this pull request as ready for review April 21, 2026 14:58
@jdye64 jdye64 requested review from a team as code owners April 21, 2026 14:58
@jdye64 jdye64 requested a review from jioffe502 April 21, 2026 14:58
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR introduces NemotronOCRV2, a new GPU-backed OCR model wrapper, along with matching GPU and CPU actor classes (OCRV2Actor, OCRV2GPUActor, OCRV2CPUActor) and a new OCRV2Actor archetype operator in ocr.py. It also renames the shared ocr_page_elements output column from ocr_v1 to ocr, applying that change to both existing v1 actors and the new v2 actors.

  • The ocr_v1ocr rename in shared.py is a breaking change that affects all existing OCRActor/OCRCPUActor consumers without a deprecation cycle — external callers reading the ocr_v1 column will silently receive an empty column after upgrading.

Confidence Score: 3/5

Not safe to merge until the breaking ocr_v1ocr column rename is intentionally acknowledged and documented, or mitigated with a deprecation path.

One P1 finding: the ocr_v1ocr column rename in shared.py is a breaking API change that silently breaks all existing v1 OCR consumers without any deprecation notice. This warrants a score below 4.

nemo_retriever/src/nemo_retriever/ocr/shared.py — the column rename is the primary concern. Also note nemo_retriever/src/nemo_retriever/model/local/nemotron_ocr_v2.py and the new actor files lack unit tests.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/model/local/nemotron_ocr_v2.py New NemotronOCRV2 model wrapper; closely mirrors v1, includes TRT best-effort compilation, but has a docstring/code mismatch on str file-path handling and no unit tests.
nemo_retriever/src/nemo_retriever/ocr/gpu_ocrv2.py New GPU OCRV2 Ray actor; correctly uses the shared ocr_page_elements function and writes error payloads to out["ocr"]; missing dedicated unit tests.
nemo_retriever/src/nemo_retriever/ocr/cpu_ocrv2.py New CPU OCRV2 actor defaulting to the build.nvidia.com endpoint; mirrors GPU actor logic cleanly; no unit tests.
nemo_retriever/src/nemo_retriever/ocr/shared.py Renames output column ocr_v1 → ocr affecting both v1 and v2 actors; breaking change for external consumers without deprecation notice.
nemo_retriever/src/nemo_retriever/ocr/ocr.py Adds OCRV2Actor archetype and module-level lazy-load entries; straightforward addition following existing patterns.
nemo_retriever/src/nemo_retriever/model/local/init.py Registers NemotronOCRV2 in all and lazy-load getattr; clean and consistent with existing pattern.
nemo_retriever/tests/test_actor_operators.py Updates existing OCRActor tests to reflect the ocr_v1 → ocr column rename; no new OCRV2 tests added.
nemo_retriever/tests/test_operator_flags_and_cpu_actors.py Updates OCRCPUActor test assertion for ocr_v1 → ocr column rename.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class ArchetypeOperator {
        +prefers_cpu_variant()
        +cpu_variant_class()
        +gpu_variant_class()
    }
    class OCRActor {
        +prefers_cpu_variant()
        +cpu_variant_class()
        +gpu_variant_class()
        +__init__(**ocr_kwargs)
    }
    class OCRV2Actor {
        +prefers_cpu_variant()
        +cpu_variant_class()
        +gpu_variant_class()
        +__init__(**ocr_kwargs)
    }
    class OCRV2GPUActor {
        -_model: NemotronOCRV2
        -_nim_client: NIMClient
        +process(data) DataFrame
        +__call__(batch_df) DataFrame
    }
    class OCRV2CPUActor {
        -DEFAULT_INVOKE_URL: str
        -_nim_client: NIMClient
        +process(data) DataFrame
        +__call__(batch_df) DataFrame
    }
    class NemotronOCRV2 {
        +invoke(input_data) Any
        +preprocess(tensor) Tensor
        +model_name: str
        -_maybe_compile_submodules()
        -_tensor_to_png_b64(img) str
    }
    class ocr_page_elements {
        <<function>>
        +returns out["ocr"] column
    }
    ArchetypeOperator <|-- OCRActor
    ArchetypeOperator <|-- OCRV2Actor
    OCRV2Actor ..> OCRV2GPUActor : gpu_variant_class
    OCRV2Actor ..> OCRV2CPUActor : cpu_variant_class
    OCRV2GPUActor --> NemotronOCRV2 : local model
    OCRV2GPUActor --> ocr_page_elements : calls
    OCRV2CPUActor --> ocr_page_elements : calls
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ocr/shared.py
Line: 802

Comment:
**Breaking rename of `ocr_v1``ocr` column**

Renaming the output column from `ocr_v1` to `ocr` is a breaking change for any caller that reads this column from the result DataFrame. Both `OCRActor` (v1) and the new `OCRV2Actor` share this function, so v1 consumers lose their existing column name without a deprecation cycle. Per the `api-backward-compatibility` rule, fields must not be removed or renamed without a deprecation cycle and documentation notice. Consider keeping `ocr_v1` for the v1 path (passing a parameter, as suggested in the previous review) and using `ocr` only for the new v2 path — or at minimum documenting this as a breaking change in the PR description and changelog.

**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ocr/gpu_ocrv2.py
Line: 1-20

Comment:
**No unit tests for new OCRV2 actors**

`gpu_ocrv2.py`, `cpu_ocrv2.py`, and `nemotron_ocr_v2.py` are new source files with zero dedicated tests. The test file changes only update column-name assertions in pre-existing v1 tests. Per the `test-mirrors-source-structure` and `test-coverage-new-code` rules, new business logic must include unit tests covering at least the happy path and the error path. At a minimum, `OCRV2Actor.__call__` error recovery, `OCRV2CPUActor` initialization defaults, and `NemotronOCRV2.invoke` for the tensor/base64 paths should be covered.

**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/nemotron_ocr_v2.py
Line: 178-179

Comment:
**Docstring claims file-path detection that the code does not perform**

The `invoke` docstring states `str` is treated as base64 "unless it is an existing file path", but lines 178-179 unconditionally encode any string to bytes and forward it to the model — no `os.path.exists()` check is performed. Either update the docstring to reflect the actual behavior (strings are always treated as base64), or add the file-path check as described.

```suggestion
        if isinstance(input_data, str):
            if os.path.isfile(input_data):
                with open(input_data, "rb") as f:
                    return self._model(f.read(), merge_level=merge_level)
            return self._model(input_data.encode("utf-8"), merge_level=merge_level)
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "Greptile suggestions" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/ocr/ocr.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/model/local/nemotron_ocr_v2.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/ocr/shared.py
@jdye64 jdye64 merged commit 3017ad2 into NVIDIA:main Apr 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants