Skip to content

feat(idefics3): add FMS-native Idefics3 model and HF config mapping#503

Open
toddllm wants to merge 13 commits into
foundation-model-stack:mainfrom
toddllm:review/idefics3
Open

feat(idefics3): add FMS-native Idefics3 model and HF config mapping#503
toddllm wants to merge 13 commits into
foundation-model-stack:mainfrom
toddllm:review/idefics3

Conversation

@toddllm
Copy link
Copy Markdown

@toddllm toddllm commented Jan 16, 2026

Summary

Introduce an FMS-native idefics3 architecture and wire it to HF/SmolVLM configs and SigLIP vision, with minimal supporting plumbing and focused tests. Changes are localized to Idefics3 and shared utilities; other architectures are unaffected.

What changed

  • New FMS-style Idefics3 layout

    • Add fms/models/idefics3.py as the public entrypoint (architecture registration + HF adapter registration).
    • Add fms/models/idefics3_components/ for the implementation (config, core model, connector, packing, vision adapter, HF helpers).
    • Remove the legacy fms/models/idefics3/ package.
  • HF / SmolVLM plumbing

    • Extend _map_model_config in fms/models/hf/utils.py to map Idefics3ForConditionalGeneration to architecture="idefics3", building nested SiglipVisionConfig + LLaMAConfig and deriving connector_scale + image_span_len.
    • Register HF serialization adapter steps so source="hf" can adapt SmolVLM/Idefics3 state dicts into the FMS layout.
    • In fms/models/idefics3_components/hf_adapter.py, normalize connector weight keys so HF modality_projection.* maps into connector.proj.*, and support optional FMS SigLIP usage via FmsSiglipVisionWrapper.
  • Vision + text support

    • fms/models/siglip_vision.py: strip leading "model." from SmolVLM checkpoints during HF→FMS name mapping.
    • fms/models/llama.py: add an inputs_embeds path in LLaMAHeadless.forward so packed multimodal embeddings can drive the text backbone.
  • Config + serialization polish

    • fms/utils/config.py: allow nested overrides via ModelConfig.updated() with dicts (e.g. text_config={...}).
    • fms/utils/serialization.py: switch unused adapted-checkpoint key reporting from print to logger.warning.

Tests

Run on this branch:

  • python -m pytest -q tests/models/test_idefics3.py

This exercises the tiny nested Idefics3 config (SigLIP vision + LLaMA text) and checks both forward behavior and weight keys against the expectations below.

  1. fms/models/idefics3_components/fms_model.py + fms/models/idefics3.py (overall architecture + registration).
  2. fms/models/hf/utils.py::_map_model_config (HF config → FMS config mapping).
  3. fms/models/idefics3_components/hf_adapter.py + vision_adapter.py (SmolVLM/SigLIP wrapping and weight key normalization).
  4. tests/models/test_idefics3.py

- Add FMS-native idefics3 architecture and components\n- Map HF Idefics3ForConditionalGeneration to idefics3 config\n- Wire SigLIP/LLaMA plumbing for HF checkpoints

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
- Harden packing and connector behavior\n- Shifted loss + pad/image masking\n- Padding-aware generation + cached-decode mask handling\n- Avoid is_causal in cached decode under SDPA

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
@toddllm
Copy link
Copy Markdown
Author

toddllm commented Jan 16, 2026

@toddllm toddllm marked this pull request as ready for review January 17, 2026 00:04
ModelConsistencyTestSuite,
Idefics3Fixtures,
):
@staticmethod
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one test failing as I run in my environment deployment:

FAILED tests/models/test_idefics3.py::TestIdefics3::test_model_output[False] - RuntimeError: Expected all tensors to be on the same device, but got index is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__index_select)

My environment has GPU, but I haven't set up any device to run the tests; Maybe good idea to confirm at the tests level all is running under the same device?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test on GPU. Thank you for looking into this @flaviabeo !

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
@toddllm
Copy link
Copy Markdown
Author

toddllm commented Jan 22, 2026

@flaviabeo Nice catch with the GPU testing! I added some fixes.

Reran GPU parity after the multi-image tolerance tweak; max_abs_diff=6.151e-05 on seed 42, passes with atol=rtol=7e-5. Unit tests fixed by skipping random init for non-float buffers (position_ids).

Let me know if you see anything on your GPU.

Copy link
Copy Markdown
Collaborator

@kaoutar55 kaoutar55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @toddllm. I think the implementation is solid and follows FMS conventions. Some minor changes are needed. Please address the GPU test failure and refactor the connector key remapping duplication before merging.

@@ -7,6 +7,7 @@
# Used in Llava Next for Granite vision
from fms.models.siglip_vision import SiglipVisionConfig
from fms.models.granite import GraniteConfig
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraniteConfig is imported but never used. Should be removed.

"layer_norm_eps": vision_cfg.layer_norm_eps,
"attention_dropout": vision_cfg.attention_dropout,
}
use_navit = getattr(vision_cfg, "use_navit_position_buckets", None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles multiple fallback cases for NaViT position buckets detection.
It would be good to add a comment explaining why these fallbacks are necessary.


image_size = config_params["vision_config"].image_size
patch_size = config_params["vision_config"].patch_size
patches_per_side = int(image_size) // int(patch_size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are already ints from the config, unnecessary casting.

Comment thread fms/models/idefics3.py
for k, v in adapted_text.items():
new_sd[f"text_model.{k}"] = v

# ---- connector
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we have duplicate logic in the two if branches (lines 101-113 and 114-123). Could be refactored: simplified version or something like this:

# Simplified version for k, v in input_sd.items(): key = k[len("model."):] if k.startswith("model.") else k if key.startswith("connector."): key = key[len("connector."):] if key.startswith("modality_projection.proj."): key = key[len("modality_projection.proj."):] new_sd[f"connector.proj.{key}"] = v elif key.startswith("modality_projection."): key = key[len("modality_projection."):] new_sd[f"connector.proj.{key}"] = v elif key == "modality_projection": new_sd["connector.proj.weight"] = v else: new_sd[f"connector.{key}"] = v

base.embeddings, "reset_parameters"
):
base.embeddings.reset_parameters()
if hasattr(base, "encoder") and hasattr(base.encoder, "reset_parameters"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complex nested hasattr checks. Consider adding a method to base classes instead:

def reset_parameters(self):
    # Let each component handle its own initialization
    if hasattr(self.vision_tower, 'reset_parameters'):
        self.vision_tower.reset_parameters()
    if hasattr(self.text_model, 'reset_parameters'):
        self.text_model.reset_parameters()
    nn.init.xavier_uniform_(self.connector.proj.weight)

Returns a dict with pixel_values, pixel_attention_mask, and num_patches.
"""

def to_pil(img_like):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_pil function handles many edge cases but has duplicated PIL import checks. Consider extracting to module level.

Comment thread fms/models/llama.py
**attn_kwargs: Unpack[AttentionKwargs],
):
# Embed the given vocabulary indices using the given attention mask, with pre-/post-norm and dropout as specified
if x_in is not None and inputs_embeds is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Type hint for x_in should be Optional[torch.Tensor] for consistency.

Comment thread fms/models/siglip_vision.py Outdated
nb_patches_w, device=position_ids.device, dtype=position_ids.dtype
)

fractional_coords_h = h_indices / nb_patches_h * (1 - 1e-6)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What's the purpose of * (1 - 1e-6)? This should be documented.

Comment thread fms/testing/comparison.py

target_device = torch.device(device)

# Helper function to check if two devices are equivalent
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String comparison 'cuda' should use dev1.type constant or be more defensive.

- Do not apply NaViT-style bucketing when use_navit_position_buckets is disabled, even if patch_attention_mask is provided.
- Match HF bucketing semantics by using exact i/n fractional coords with bucketize(right=True).
- Add a regression test and update downstream expectations impacted by this fix.

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
- Remove unused imports and document NaViT config fallbacks.
- Refactor HF connector key remapping to avoid duplication.
- Simplify reset_parameters checks.
- Hoist Pillow import/requirements handling in preprocessing.
- Minor typing/style tweaks for device-safe comparisons.
- Update SigLIP expected weight keys to include position_ids buffer.

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
@toddllm
Copy link
Copy Markdown
Author

toddllm commented Feb 3, 2026

@kaoutar55 @flaviabeo comments addressed in latest commits

Copy link
Copy Markdown
Contributor

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work! some thoughts - it also may be better to wrap the components in one file similar to llava next / ongoing with mistral3 (multimodal) for more consistent patterns across models in case things are refactored in the future. A bit worried things will be really inconsistent for composite models since there are not well documented interfaces for how they should be called through FMS generate etc atm

"layer_norm_eps": vision_cfg.layer_norm_eps,
"attention_dropout": vision_cfg.attention_dropout,
}
use_navit = getattr(vision_cfg, "use_navit_position_buckets", None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is use_navit_position_buckets expected to be a defined attribute on the HF config? I don't see it in the idefics3 config here.

IMO it may also be a good idea to separate build_siglip_vision_params and build_siglip_idefics3_vision_params for clarity as well

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. use_navit_position_buckets is not reliably present on HF Idefics3/SmolVLM vision configs, so we now treat it as optional and avoid assuming it exists.

I split the logic into:

  • build_siglip_vision_params(...) for generic SigLIP defaults (explicit flag if present, else default False)
  • build_siglip_idefics3_vision_params(...) for idefics3/smolvlm-scoped NaViT inference

build_idefics3_params(...) now uses the idefics3-specific builder. I also added unit coverage in tests/models/hf/test_param_builders.py for:

  • default False when attr is absent
  • smolvlm max_image_size heuristic enabling NaViT
  • no heuristic leakage for non-idefics3 parent configs

H, W: grid dimensions (e.g. 32 for 512/16)
"""
if patch_embeds.shape[1] != H * W:
raise ValueError(f"Expected {H * W} patches, got {patch_embeds.shape[1]}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the error messages here, can you either log the full shape, or use negative indices? This will get confusing if a batch dim of 1 is squeezed out somewhere since it would log the the wrong dim for the number of patches as is

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to make error messages robust to squeezed/broadcasted shapes and to include full tensor shapes.

Changes made:

  • use stable indexing (shape[-2] for patch count) instead of fixed positional assumptions
  • include full shape context in errors, e.g. x.shape=... / patch_embeds.shape=...
  • include expected/actual patch counts and H/W/scale in the message

This should make debugging much clearer when an upstream batch/image dimension changes.

self.out_features = text_hidden
self.proj = nn.Linear(self.in_features, self.out_features, bias=False)

@staticmethod
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be a staticmethod and should just use self.scale

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated.

pixel_unshuffle is now an instance method and uses self.scale directly (no @staticmethod and no external scale arg). This keeps connector behavior fully owned by the module config.

return out

@torch.no_grad()
def generate(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to align this with what is done in llava next, which is to create the merged multimodal embeddings in prepare_inputs_for_generation, using kwargs that have optional values for stuff from the HF processor, e.g., pixel values etc?

I imagine this will likely refactored in the future to be more well patterned, but this will allow for better compatibility with the existing generate, e.g., by passing prepare_inputs_for_generation as a preprocessing hook, and also keep things more consistent with projects that may want to consume it like vLLM spyre

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this alignment with LlavaNext/shared generation flow.

Changes made:

  • added prepare_inputs_for_generation(iteration, input_ids, kwargs) to build multimodal embeddings on prefill and avoid re-encoding images during cached decode
  • switched Idefics3.generate() to use fms.utils.generation.generate(..., prepare_model_inputs_hook=self.prepare_inputs_for_generation)
  • threaded optional processor kwargs (pixel_values, images, pixel_attention_mask) through the hook path

This keeps idefics3 compatible with the shared generate path and closer to the existing multimodal pattern used in LlavaNext.

if hasattr(full_model, "model") and hasattr(full_model.model, "vision_model"):
vision_model = full_model.model.vision_model
elif hasattr(full_model, "vision_model"):
vision_model = full_model.vision_model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this fallback case expected?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification and narrowed the fallback expectations.

Expected/default layout is full_model.model.vision_model.
Fallback path (full_model.vision_model) is only for HF wrapper/layout variation where the nested model container is absent.

I documented this inline and include model type context in fallback logging/errors so it’s explicit when/why the fallback is taken.

and "vision" not in name
and "text" not in name
):
# This is risky, better to rely on structure
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some logs to some of these fallback cases?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logs on fallback paths.

Current behavior:

  • log info when using non-default but expected layout fallback paths
  • log context (HF model class/type) in fallback messages
  • fail loudly (with explicit error) for risky connector extraction cases rather than silently scanning/selecting ambiguous params

This keeps normal/default path quiet while making fallback behavior diagnosable.

import torch


class SmolVLMPreprocessor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to some of the above comments on aligning with llava next for model structure and not having a custom .generate - I don't think that we should have an explicit wrapper for the HF processor (even a lazy one for conditional dependencies), since it may cause confusion with how the model is to be used with transformers

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed the explicit wrapper class.

Changes made:

  • deleted SmolVLMPreprocessor wrapper class
  • kept a thin helper load_smolvlm_processor(...) that returns HF AutoProcessor directly
  • kept load_smolvlm_preprocessor(...) only as a deprecated compatibility alias to avoid immediate breakage for older call sites

So the canonical interface is now the HF processor itself, aligned with your comment.

"vision",
"vision_model",
"visual",
"vision_encoder",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this degree of flexibility actually needed for idefic3, or it is written to generically common attributes for vision towers in VLMs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened this to idefics3/smolvlm-relevant layouts rather than broad generic scanning.

Current order:

  1. checkpoint.model.vision_model (default)
  2. checkpoint.model.vision_tower (fallback, logged)
  3. top-level known attrs (vision_model, vision_tower, vision) (fallback, logged)
  4. dict-like wrapper keys only if values are actual nn.Module objects

So flexibility is now constrained to known/expected variants, with logs on non-default selections.

Comment thread fms/models/idefics3.py Outdated
elif kk == "modality_projection":
new_sd["connector.proj.weight"] = v
else:
new_sd[f"connector.{kk}"] = v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit apprehensive about adding so much state dict key manipulation here since the serialization utils are so complex already, and other models use regexes for pattern replacement, e.g., here. 🤔 I am not a big fan of the regexes either, but might be better to do things consistently so that we don't see different behaviors for this model, since it will be hard to debug, especially if more adapter steps are added

@kaoutar55 @flaviabeo @gkumbhat any thoughts?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligned this with the existing regex-replacement pattern approach for consistency.

I replaced the branchy connector key remap logic with ordered regex replacement rules (including both model.connector.* and connector.* inputs), then gate on connector. outputs. This keeps behavior explicit while matching the style used elsewhere (e.g. llava_next) for easier debugging/maintenance.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Yes, agreed @alex-jw-brooks - better to standardize the way of doing these replacements. These can cause warnings when the weights keys are not mapped correctly and can be hard to fix a bug on the mappings.

Comment thread fms/models/siglip_vision.py Outdated
embeddings = patch_embeds.flatten(2).transpose(1, 2)
embeddings = embeddings + self.position_embedding(self.position_ids)

if (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you separate most of the nai vt handling (except the early return with no patch attn mask, which is the same as the normal case) into a separate function to keep forward more minimal? it'll be easier to look at other models that use siglip without it, e.g., granite vision

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I split most NaViT-specific logic out of SiglipVisionEmbeddings.forward into a helper (_navit_position_ids(...)).

forward now keeps the early return for the non-NaViT/no-mask path and only calls the helper for the NaViT bucketed position-id path. This keeps the main forward path minimal for non-NaViT siglip users.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some tests failing:

______ test_config_mapping[SiglipModel-google/siglip-so400m-patch14-384] _______

arch = 'SiglipModel', model_name = 'google/siglip-so400m-patch14-384'

    @pytest.mark.parametrize("arch,model_name", MODEL_CONFIG_MAP.items())
    def test_config_mapping(arch: str, model_name: str):
        from fms.models.hf.utils import infer_model_configuration
    
        cfg_filename = get_file_kwargs_filename(model_name)
        kwargs_path = os.path.join(FMS_CONFIGS_DIR, cfg_filename)
        if not os.path.isfile(kwargs_path):
            raise FileNotFoundError(
                f"Model {model_name} has no kwargs file; have you generated it?"
            )
        with open(kwargs_path, "r") as f:
            expected_kwargs = json.load(f)
        actual_kwargs = infer_model_configuration(model_name, download_weights=False)
    
        for k, actual_v in actual_kwargs.items():
>           expected_v = expected_kwargs[k]
E           KeyError: 'use_navit_position_buckets'

tests/utils/test_config_mapping.py:62: KeyError
----------------------------- Captured stderr call -----------------------------

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 29.36it/s]
_ test_config_mapping[LlavaNextForConditionalGeneration-ibm-granite/granite-vision-3.2-2b] _

arch = 'LlavaNextForConditionalGeneration'
model_name = 'ibm-granite/granite-vision-3.2-2b'

It seems related to this change: use_navit_position_buckets - something is not inferring correctly this model configuration or the args are getting lost in between set up and call, maybe?

- add build_siglip_idefics3_vision_params for idefics3/smolvlm-specific inference
- keep generic build_siglip_vision_params behavior explicit
- wire build_idefics3_params to the specialized builder
- add unit tests for default false, smolvlm enablement, and non-idefics3 no-leak behavior

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
- make connector pixel_unshuffle use self.scale and improve shape error messages
- document expected HF layout fallbacks and add non-default path logs
- tighten vision tower extraction to known idefics3/smolvlm layouts
- factor NaViT-specific position-id handling out of siglip forward
- normalize docstrings/comments to ASCII for terminal-friendly diffs

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
…ndling

- move idefics3 generation onto fms.utils.generation with prepare_inputs_for_generation
- thread optional image kwargs through prefill/decode flow
- support pixel_attention_mask in image encoding and packing path
- fix 5D image + singleton 4D mask broadcasting in _encode_images
- add regression coverage for multi-image singleton mask handling

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
- remove custom SmolVLMPreprocessor wrapper class in favor of HF AutoProcessor helper
- keep deprecated load_smolvlm_preprocessor alias with compatibility shim
- switch connector key remapping to ordered regex replacement rules
- add regression test to lock contiguous expanded pixel_attention_mask behavior

Signed-off-by: Todd Deshane <todd.deshane@ibm.com>
@toddllm
Copy link
Copy Markdown
Author

toddllm commented Feb 24, 2026

@alex-jw-brooks comments addressed and unit and parity tests re-ran on CPU and GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants