Skip to content

Commit 3d27645

Browse files
baptiste-aubertinstaghadomolbapCyrilvallez
authored
Add LightOnOCR model implementation (#41621)
* Add LightOnOCR model implementation * fix modular docstring error * Improve LightOnOCR documentation and exports * Rename LightOnOCR multi-modal projector to vision projection and add tests * fix load without lmhead in safetensor * temp * Refactor LightOnOCR config to use sub_configs pattern * rename processor kwargs * Refactor LightOnOCR processor to use effective patch size Calculate effective_patch_size during initialization and use it throughout the processor. Update ProcessorKwargs defaults to include patch_size in images_kwargs. Remove redundant model_input_names property. * Improve LightOnOCR generation support with proper KV cache handling * add modeling tests and compile modular * Clean up LightOnOCR code and remove unused variables Remove unused image_features variable and model_input_names property * Add LightOnOCR documentation and test improvements Add model documentation page with config and class references. Update toctree to include LightOnOCR entry. Clean up test formatting and add vision/text models to private model exceptions. * Refactor LightOnOCR to use standardized RopeParameters and consolidate shared components * Rename LightOnOCR model classes and fix config parameter naming - Rename LightOnOCRText -> LightOnOCRTextModel and LightOnOCRVision -> LightOnOCRVisionModel - Fix parameter naming: image_token_index -> image_token_id - Set tie_word_embeddings default to False - Add special case for inherited Qwen3Config attributes in LightOnOCRTextConfig * Add missing parameter documentation for LightOnOCR config * Simplify LightOnOCR forward methods with decorators and fix loss function call * Reorganize LightOnOCR components to place vision before text and remove debug print * fixup * Fix image token expansion logic in Processor * Copy pixtral attention to have both pixtral and qwen eager attention forward * remove LightOnOCRTextPreTrainedModel from modular to be able to return attention * Support both tensor and list formats for image_sizes parameter * Update tests/models/lightonocr/test_processor_lightonocr.py Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> * Update docs/source/en/model_doc/lightonocr.md Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> * Move image_sizes tensor conversion from model to processor * Simplify weight initialization to use uniform text_config initializer_range * rename 1 letter vars * Get image special tokens from tokenizer attributes in processor * Return BaseModelOutputWithPast from LightOnOCRModel forward * Add chat template to LightOnOCR processor test setup * rm get_output_embeddings from LightOnOCRForConditionalGeneration (not needed) * Add OCR integration test for LightOnOCR model Tests model can perform OCR on real receipt image and extract expected text * Fix device/dtype handling in LightOnOCR vision processing * Add TransformersKwargs type hints to LightOnOCR forward methods * Make torch imports conditional and use _from_config for LightOnOCR sub-models * Set patch_size at runtime instead of modifying class defaults in LightOnOCR processor * type kwargs * Remove loocr forward comments * Add vocab_size property and fix image_token_id in LightOnOCR - Add vocab_size property to LightOnOCRConfig that delegates to text_config - Fix test parameter name from image_token_index to image_token_id - Add Unpack type hint to processor __call__ kwargs - Remove unnecessary comments from modeling forward method * Add vocab_size setter to LightOnOCR configuration * Fix device mismatch in vision rotary embeddings and optimize test image sizes * Improve LightOnOCR integration test with similarity-based output validation * Enable flex attention * Enable flex attention * Loocr description with blogpost * redundant tie_word_embeddings * remove architecture from default config * vocab_size accessors * remove useless tensor conversion * remove useless conversion * move dtype conversion to after image feature extraction * remove useless stuff * fixup * export text and vision config classes * refactor(lightonocr): remove unused weight initialization and fix tied weights mapping #0 - Remove custom _init_weights methods (handled by base class) - Update _tied_weights_keys to dict format with explicit mapping - Update documentation date * fix(lightonocr): fix test failures for vocab_size access and device placement #0 - Use config.text_config.vocab_size instead of config.vocab_size for composite config - Remove explicit device placement from attention_mask and image_sizes tensors - Allow device_map='auto' to handle device placement in model parallelism tests * ruff * rebase 8/12/2025 * rebase 09/12/2025 * review zucchini * review zucchini * rebase 10/12/2025 * refactor(lighton_ocr): fix naming conventions to use snake_case and proper CamelCase #0 - Rename model identifier from 'lightonocr' to 'lighton_ocr' (snake_case) - Update class names from 'LightOnOCR*' to 'LightOnOcr*' (proper CamelCase) - Update all auto mappings, tests, and documentation accordingly * style(lighton_ocr): remove unnecessary import guards for torch and vision #0 * style(lighton_ocr): remove unnecessary pass statement from LightOnOcrVisionConfig #0 * refactor(lighton_ocr): consolidate RMSNorm classes and use PixtralRMSNorm base #0 * refactor(lighton_ocr): import rotary pos emb functions from pixtral instead of redefining #0 - Remove duplicate vision_rotate_half and vision_apply_rotary_pos_emb functions - Import apply_rotary_pos_emb from pixtral modeling - Consolidate rotate_half/apply_rotary_pos_emb in generated modeling file * refactor(lighton_ocr): remove unused LightOnOcrVisionPreTrainedModel class #0 - Remove redundant VisionPreTrainedModel class that was not used - Add LightOnOcrVisionAttentionLayer to _no_split_modules in main PreTrainedModel * refactor(lighton_ocr): simplify LightOnOcrAttention and clarify docstring #0 - Remove redundant __init__ that only called super() - Update docstring to explain why class exists (avoids eager_attention_forward collision with Qwen3) * test(lighton_ocr): remove unnecessary skipped test methods #0 * refactor(lighton_ocr): remove use_sliding_window and max_window_layers from config #0 - Use del in __init__ to explicitly remove inherited attrs from Qwen3Config - Remove LightOnOCRTextConfig from check_config_attributes.py exception list - Fix rms_norm_eps type annotation from int to float * fix make fixup * docs(lighton_ocr): add docstring to LightOnOcrTextConfig and clean up check_repo #0 - Add configuration docstring with all parameters to LightOnOcrTextConfig - Consolidate duplicate comments in PRIVATE_MODELS - Remove redundant entries from IGNORE_NON_TESTED and IGNORE_NON_AUTO_CONFIGURED * chore(lighton_ocr): update copyright headers to LightOn Team #0 * refactor(lighton_ocr): clean up model files and add license headers #0 - Add Apache 2.0 license headers to generated files - Remove unused embedding getter/setter methods from ForConditionalGeneration - Clean up LightOnOcrTextConfig docstring and remove Qwen references * refactor(lighton_ocr): simplify processor token access and test setup #0 - Access special tokens directly from tokenizer attributes instead of getattr with defaults - Simplify test setup to use model_id and inherited ProcessorTesterMixin methods - Fix return types test to handle fast image processor limitations * refactor(lighton_ocr): unify attention functions and fix buffer registration #0 - Remove duplicate vision_eager_attention_forward, reuse eager_attention_forward from Qwen3 - Add num_key_value_groups attribute for GQA compatibility - Register original_inv_freq as buffer instead of plain attribute * refactor(lighton_ocr): remove vision_model property alias #0 * docs(lighton_ocr): add usage example and update release date #0 * rebase 12/01/26 * Update docs/source/en/model_doc/lighton_ocr.md Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * review cyril * review cyril * review cyril * Remove test.py from version control * apply modular * update years everywhere it was not updated * fix date * remove Attention forward implem * Fix all Vision prefixes instead of no prefix * move tying to main config * fix * add to all * immensely simplify * fix test * revert check_repo --------- Co-authored-by: Said Taghadouini <taghadouinisaid@gmail.com> Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
1 parent 77146cc commit 3d27645

17 files changed

Lines changed: 2138 additions & 4 deletions

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1115,6 +1115,8 @@
11151115
title: LayoutXLM
11161116
- local: model_doc/lfm2_vl
11171117
title: LFM2-VL
1118+
- local: model_doc/lighton_ocr
1119+
title: LightOnOcr
11181120
- local: model_doc/lilt
11191121
title: LiLT
11201122
- local: model_doc/llama4
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
4+
License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
9+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
11+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
12+
rendered properly in your Markdown viewer.
13+
14+
specific language governing permissions and limitations under the License. -->
15+
*This model was released on {release_date} and added to Hugging Face Transformers on 2026-01-14.*
16+
17+
# LightOnOcr
18+
19+
20+
**LightOnOcr** is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs.
21+
22+
📝 **[Read the full blog post](https://huggingface.co/blog/lightonai/lightonocr/)** | 📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**
23+
24+
**Model Overview**
25+
26+
LightOnOcr combines a Vision Transformer encoder (Pixtral-based) with a lightweight text decoder (Qwen3-based) distilled from high-quality open VLMs. It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.
27+
28+
## Usage
29+
30+
```python
31+
import torch
32+
33+
from transformers import LightOnOcrForConditionalGeneration, LightOnOcrProcessor
34+
35+
36+
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
37+
dtype = torch.float32 if device == "mps" else torch.bfloat16
38+
39+
model = LightOnOcrForConditionalGeneration.from_pretrained("lightonai/LightOnOCR-1B-1025", dtype=dtype).to(
40+
device
41+
)
42+
processor = LightOnOcrProcessor.from_pretrained("lightonai/LightOnOCR-1B-1025")
43+
44+
url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ocr/resolve/main/SROIE-receipt.jpeg"
45+
46+
conversation = [{"role": "user", "content": [{"type": "image", "url": url}]}]
47+
48+
inputs = processor.apply_chat_template(
49+
conversation,
50+
add_generation_prompt=True,
51+
tokenize=True,
52+
return_dict=True,
53+
return_tensors="pt",
54+
)
55+
inputs = {k: v.to(device=device, dtype=dtype) if v.is_floating_point() else v.to(device) for k, v in inputs.items()}
56+
57+
output_ids = model.generate(**inputs, max_new_tokens=1024)
58+
generated_ids = output_ids[0, inputs["input_ids"].shape[1] :]
59+
output_text = processor.decode(generated_ids, skip_special_tokens=True)
60+
print(output_text)
61+
```
62+
63+
## LightOnOcrConfig
64+
65+
[[autodoc]] LightOnOcrConfig
66+
67+
## LightOnOcrProcessor
68+
69+
[[autodoc]] LightOnOcrProcessor
70+
- __call__
71+
72+
## LightOnOcrModel
73+
74+
[[autodoc]] LightOnOcrModel
75+
- forward
76+
77+
## LightOnOcrForConditionalGeneration
78+
79+
[[autodoc]] LightOnOcrForConditionalGeneration
80+
- forward

src/transformers/models/auto/configuration_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,7 @@
241241
("lfm2_moe", "Lfm2MoeConfig"),
242242
("lfm2_vl", "Lfm2VlConfig"),
243243
("lightglue", "LightGlueConfig"),
244+
("lighton_ocr", "LightOnOcrConfig"),
244245
("lilt", "LiltConfig"),
245246
("llama", "LlamaConfig"),
246247
("llama4", "Llama4Config"),
@@ -705,6 +706,7 @@
705706
("lfm2_moe", "Lfm2Moe"),
706707
("lfm2_vl", "Lfm2Vl"),
707708
("lightglue", "LightGlue"),
709+
("lighton_ocr", "LightOnOcr"),
708710
("lilt", "LiLT"),
709711
("llama", "LLaMA"),
710712
("llama2", "Llama2"),

src/transformers/models/auto/image_processing_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,7 @@
130130
("levit", ("LevitImageProcessor", "LevitImageProcessorFast")),
131131
("lfm2_vl", (None, "Lfm2VlImageProcessorFast")),
132132
("lightglue", ("LightGlueImageProcessor", "LightGlueImageProcessorFast")),
133+
("lighton_ocr", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
133134
("llama4", (None, "Llama4ImageProcessorFast")),
134135
("llava", ("LlavaImageProcessor", "LlavaImageProcessorFast")),
135136
("llava_next", ("LlavaNextImageProcessor", "LlavaNextImageProcessorFast")),

src/transformers/models/auto/modeling_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
240240
("lfm2_moe", "Lfm2MoeModel"),
241241
("lfm2_vl", "Lfm2VlModel"),
242242
("lightglue", "LightGlueForKeypointMatching"),
243+
("lighton_ocr", "LightOnOcrModel"),
243244
("lilt", "LiltModel"),
244245
("llama", "LlamaModel"),
245246
("llama4", "Llama4ForConditionalGeneration"),
@@ -924,6 +925,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
924925
("kosmos-2", "Kosmos2ForConditionalGeneration"),
925926
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"),
926927
("lfm2_vl", "Lfm2VlForConditionalGeneration"),
928+
("lighton_ocr", "LightOnOcrForConditionalGeneration"),
927929
("llama4", "Llama4ForConditionalGeneration"),
928930
("llava", "LlavaForConditionalGeneration"),
929931
("llava_next", "LlavaNextForConditionalGeneration"),

src/transformers/models/auto/processing_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@
101101
("layoutlmv3", "LayoutLMv3Processor"),
102102
("layoutxlm", "LayoutXLMProcessor"),
103103
("lfm2_vl", "Lfm2VlProcessor"),
104+
("lighton_ocr", "LightOnOcrProcessor"),
104105
("llama4", "Llama4Processor"),
105106
("llava", "LlavaProcessor"),
106107
("llava_next", "LlavaNextProcessor"),

src/transformers/models/auto/tokenization_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,7 @@
169169
("layoutlmv3", "LayoutLMv3Tokenizer" if is_tokenizers_available() else None),
170170
("layoutxlm", "LayoutXLMTokenizer" if is_tokenizers_available() else None),
171171
("led", "LEDTokenizer" if is_tokenizers_available() else None),
172+
("lighton_ocr", "Qwen2TokenizerFast" if is_tokenizers_available() else None),
172173
("lilt", "RobertaTokenizer" if is_tokenizers_available() else None),
173174
("longformer", "RobertaTokenizer" if is_tokenizers_available() else None),
174175
("longt5", "T5Tokenizer" if is_tokenizers_available() else None),
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Copyright 2026 The LightOn Team and The HuggingFace Inc. team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from typing import TYPE_CHECKING
15+
16+
from ...utils import _LazyModule
17+
from ...utils.import_utils import define_import_structure
18+
19+
20+
if TYPE_CHECKING:
21+
from .configuration_lighton_ocr import *
22+
from .modeling_lighton_ocr import *
23+
from .processing_lighton_ocr import *
24+
else:
25+
import sys
26+
27+
_file = globals()["__file__"]
28+
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
2+
# This file was automatically generated from src/transformers/models/lighton_ocr/modular_lighton_ocr.py.
3+
# Do NOT edit this file manually as any edits will be overwritten by the generation of
4+
# the file from the modular. If any change should be done, please apply the change to the
5+
# modular_lighton_ocr.py file directly. One of our CI enforces this.
6+
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
7+
# Copyright 2026 The LightOn Team and The HuggingFace Inc. team. All rights reserved.
8+
#
9+
# Licensed under the Apache License, Version 2.0 (the "License");
10+
# you may not use this file except in compliance with the License.
11+
# You may obtain a copy of the License at
12+
#
13+
# http://www.apache.org/licenses/LICENSE-2.0
14+
#
15+
# Unless required by applicable law or agreed to in writing, software
16+
# distributed under the License is distributed on an "AS IS" BASIS,
17+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18+
# See the License for the specific language governing permissions and
19+
# limitations under the License.
20+
from typing import Any
21+
22+
from ...configuration_utils import PretrainedConfig
23+
from ..auto import CONFIG_MAPPING, AutoConfig
24+
25+
26+
class LightOnOcrConfig(PretrainedConfig):
27+
r"""
28+
This is the configuration class to store the configuration of a [`LightOnOcrForConditionalGeneration`]. It is used to instantiate a
29+
LightOnOcr model according to the specified arguments, defining the model architecture.
30+
31+
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
32+
documentation from [`PretrainedConfig`] for more information. Instantiating a configuration with the defaults will yield
33+
a similar configuration to that of the LightOnOcr [lightonocr-hf/lightonocr-9b](https://huggingface.co/lightonocr-hf/lightonocr-9b) architecture.
34+
35+
Args:
36+
spatial_merge_size (`int`, *optional*, defaults to 2):
37+
The size of spatial merging for image patches.
38+
image_token_id (`int`, *optional*, defaults to 151655):
39+
The id of the image token in the vocabulary.
40+
tie_word_embeddings (`bool`, *optional*, defaults to `True`):
41+
Whether the model's input and output word embeddings should be tied.
42+
vision_config (`dict` or `LightOnOcrVisionConfig`, *optional*):
43+
Custom vision configuration or dictionary with vision configuration values.
44+
text_config (`dict` or `LightOnOcrTextConfig`, *optional*):
45+
Custom text configuration or dictionary with text configuration values.
46+
47+
Example:
48+
49+
```python
50+
>>> from transformers import LightOnOcrConfig, LightOnOcrForConditionalGeneration
51+
52+
>>> # Initializing a LightOnOcr configuration
53+
>>> configuration = LightOnOcrConfig()
54+
55+
>>> # Initializing a model from the configuration
56+
>>> model = LightOnOcrForConditionalGeneration(configuration)
57+
58+
>>> # Accessing the model configuration
59+
>>> configuration = model.config
60+
```
61+
"""
62+
63+
model_type = "lighton_ocr"
64+
sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig}
65+
66+
def __init__(
67+
self,
68+
spatial_merge_size: int = 2,
69+
image_token_id: int = 151655,
70+
tie_word_embeddings: bool = True,
71+
vision_config: dict[str, Any] | None = None,
72+
text_config: dict[str, Any] | None = None,
73+
**kwargs,
74+
):
75+
self.spatial_merge_size = spatial_merge_size
76+
self.image_token_id = image_token_id
77+
self.tie_word_embeddings = tie_word_embeddings
78+
79+
if vision_config is None:
80+
self.vision_config = CONFIG_MAPPING["pixtral"](
81+
attention_dropout=0,
82+
head_dim=64,
83+
hidden_act="silu",
84+
hidden_size=1024,
85+
image_size=1540,
86+
initializer_range=0.02,
87+
intermediate_size=4096,
88+
model_type="pixtral",
89+
num_attention_heads=16,
90+
num_channels=3,
91+
num_hidden_layers=24,
92+
patch_size=14,
93+
rope_theta=10000,
94+
)
95+
elif isinstance(vision_config, PretrainedConfig):
96+
self.vision_config = vision_config
97+
else:
98+
vision_config["model_type"] = vision_config.get("model_type", "pixtral")
99+
self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
100+
101+
if text_config is None:
102+
self.text_config = CONFIG_MAPPING["qwen3"](
103+
attention_dropout=0,
104+
head_dim=128,
105+
hidden_act="silu",
106+
hidden_size=1024,
107+
initializer_range=0.02,
108+
intermediate_size=3072,
109+
max_position_embeddings=40960,
110+
num_attention_heads=16,
111+
num_hidden_layers=28,
112+
num_key_value_heads=8,
113+
rms_norm_eps=1e-6,
114+
rope_theta=1000000,
115+
sliding_window=None,
116+
use_cache=True,
117+
vocab_size=151936,
118+
)
119+
elif isinstance(text_config, PretrainedConfig):
120+
self.text_config = text_config
121+
else:
122+
text_config["model_type"] = text_config.get("model_type", "qwen3")
123+
self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
124+
125+
super().__init__(**kwargs)
126+
127+
128+
__all__ = ["LightOnOcrConfig"]

0 commit comments

Comments
 (0)