Conversation
7ad6073 to
53c1153
Compare
Greptile SummaryThis PR introduces
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/model/local/nemotron_ocr_v2.py | New NemotronOCRV2 model wrapper; closely mirrors v1, includes TRT best-effort compilation, but has a docstring/code mismatch on str file-path handling and no unit tests. |
| nemo_retriever/src/nemo_retriever/ocr/gpu_ocrv2.py | New GPU OCRV2 Ray actor; correctly uses the shared ocr_page_elements function and writes error payloads to out["ocr"]; missing dedicated unit tests. |
| nemo_retriever/src/nemo_retriever/ocr/cpu_ocrv2.py | New CPU OCRV2 actor defaulting to the build.nvidia.com endpoint; mirrors GPU actor logic cleanly; no unit tests. |
| nemo_retriever/src/nemo_retriever/ocr/shared.py | Renames output column ocr_v1 → ocr affecting both v1 and v2 actors; breaking change for external consumers without deprecation notice. |
| nemo_retriever/src/nemo_retriever/ocr/ocr.py | Adds OCRV2Actor archetype and module-level lazy-load entries; straightforward addition following existing patterns. |
| nemo_retriever/src/nemo_retriever/model/local/init.py | Registers NemotronOCRV2 in all and lazy-load getattr; clean and consistent with existing pattern. |
| nemo_retriever/tests/test_actor_operators.py | Updates existing OCRActor tests to reflect the ocr_v1 → ocr column rename; no new OCRV2 tests added. |
| nemo_retriever/tests/test_operator_flags_and_cpu_actors.py | Updates OCRCPUActor test assertion for ocr_v1 → ocr column rename. |
Class Diagram
%%{init: {'theme': 'neutral'}}%%
classDiagram
class ArchetypeOperator {
+prefers_cpu_variant()
+cpu_variant_class()
+gpu_variant_class()
}
class OCRActor {
+prefers_cpu_variant()
+cpu_variant_class()
+gpu_variant_class()
+__init__(**ocr_kwargs)
}
class OCRV2Actor {
+prefers_cpu_variant()
+cpu_variant_class()
+gpu_variant_class()
+__init__(**ocr_kwargs)
}
class OCRV2GPUActor {
-_model: NemotronOCRV2
-_nim_client: NIMClient
+process(data) DataFrame
+__call__(batch_df) DataFrame
}
class OCRV2CPUActor {
-DEFAULT_INVOKE_URL: str
-_nim_client: NIMClient
+process(data) DataFrame
+__call__(batch_df) DataFrame
}
class NemotronOCRV2 {
+invoke(input_data) Any
+preprocess(tensor) Tensor
+model_name: str
-_maybe_compile_submodules()
-_tensor_to_png_b64(img) str
}
class ocr_page_elements {
<<function>>
+returns out["ocr"] column
}
ArchetypeOperator <|-- OCRActor
ArchetypeOperator <|-- OCRV2Actor
OCRV2Actor ..> OCRV2GPUActor : gpu_variant_class
OCRV2Actor ..> OCRV2CPUActor : cpu_variant_class
OCRV2GPUActor --> NemotronOCRV2 : local model
OCRV2GPUActor --> ocr_page_elements : calls
OCRV2CPUActor --> ocr_page_elements : calls
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ocr/shared.py
Line: 802
Comment:
**Breaking rename of `ocr_v1` → `ocr` column**
Renaming the output column from `ocr_v1` to `ocr` is a breaking change for any caller that reads this column from the result DataFrame. Both `OCRActor` (v1) and the new `OCRV2Actor` share this function, so v1 consumers lose their existing column name without a deprecation cycle. Per the `api-backward-compatibility` rule, fields must not be removed or renamed without a deprecation cycle and documentation notice. Consider keeping `ocr_v1` for the v1 path (passing a parameter, as suggested in the previous review) and using `ocr` only for the new v2 path — or at minimum documenting this as a breaking change in the PR description and changelog.
**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ocr/gpu_ocrv2.py
Line: 1-20
Comment:
**No unit tests for new OCRV2 actors**
`gpu_ocrv2.py`, `cpu_ocrv2.py`, and `nemotron_ocr_v2.py` are new source files with zero dedicated tests. The test file changes only update column-name assertions in pre-existing v1 tests. Per the `test-mirrors-source-structure` and `test-coverage-new-code` rules, new business logic must include unit tests covering at least the happy path and the error path. At a minimum, `OCRV2Actor.__call__` error recovery, `OCRV2CPUActor` initialization defaults, and `NemotronOCRV2.invoke` for the tensor/base64 paths should be covered.
**Rule Used:** New functionality must include corresponding unit ... ([source](https://app.greptile.com/review/custom-context?memory=test-coverage-new-code))
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/local/nemotron_ocr_v2.py
Line: 178-179
Comment:
**Docstring claims file-path detection that the code does not perform**
The `invoke` docstring states `str` is treated as base64 "unless it is an existing file path", but lines 178-179 unconditionally encode any string to bytes and forward it to the model — no `os.path.exists()` check is performed. Either update the docstring to reflect the actual behavior (strings are always treated as base64), or add the file-path check as described.
```suggestion
if isinstance(input_data, str):
if os.path.isfile(input_data):
with open(input_data, "rb") as f:
return self._model(f.read(), merge_level=merge_level)
return self._model(input_data.encode("utf-8"), merge_level=merge_level)
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (3): Last reviewed commit: "Greptile suggestions" | Re-trigger Greptile
Description
Checklist