Add Sapiens2 Model by guarin · Pull Request #45919 · huggingface/transformers

guarin · 2026-05-12T18:20:58Z

What does this PR do?

Adds the new Sapiens2 model from Meta

There is an open PR for the original Sapiens model (v1) from 2024: #33167 I started from scratch for v2 as it supersedes the old version.

Sapiens2 repo: https://github.com/facebookresearch/sapiens2

TODO before merge

Drop cv2 dependency?
Re-use pose pre- and post-processing from ViTPose where possible
Update docs
Tidy up all docstrings
Once config is settled, create PR to hub with model and processor configs

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2026-05-12T18:31:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…into add-sapiens2

molbap

Nice! Left a small initial review

molbap · 2026-05-13T11:43:06Z

+        if not config.use_mask_token:
+            del self.mask_token


do we need a conditional here?

DINOv3 has always a mask token. Sapiens2 was pretrained using a mask token but the checkpoints were uploaded without it (probably EMA model). I had to add the conditional to handle the checkpoints without mask token. If someone would like to continue pretraining Sapiens2 they would need to set use_mask_token=True. If I don't add the conditional I get a warning about missing mask tokens in the checkpoint.

warning about missing mask token is OK imo ( a bit annoying indeed). My point here is that the deletion is conditional to this in the modular file so I was surprised, but it just copies it over. A del statement in a modular file entirely deletes an attribute in the expanded modeling file, else

Is it possible that a part of your comment is missing?

molbap · 2026-05-13T11:43:32Z

+        if bool_masked_pos is not None and not self.config.use_mask_token:
+            raise ValueError("bool_masked_pos requires use_mask_token=True in the config")


same question here, is it something that can happen?

molbap · 2026-05-13T11:51:43Z

to fill before merge, it's also nice to add some usage examples here, possibly link to documentation images that we can host on the hub

Will do 👍🏼

molbap · 2026-05-13T12:25:45Z

+)
+
+
+# TODO(guarin): Double check if we cannot inherit attribute docstrings from parent class.


unfortunately, not at the moment 😬

molbap · 2026-05-13T12:26:00Z

+
+    model_type = "sapiens2"
+
+    # TODO(guarin): This is needed to load the original checkpoints but makes unit tests fail.


ah, how so?

> file_pointer = safe_open(file, framework="pt", device="cpu") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E FileNotFoundError: No such file or directory: /var/folders/c8/0vwzz_7s429ggn8376hs9q9c0000gn/T/tmptyicczri/sapiens2_0.4b_pretrain.safetensors

In a lot of inherited tests:

FAILED tests/models/sapiens2/test_modeling_sapiens2.py::Sapiens2ModelTest::test_bc_torch_dtype - FileNotFoundError: No such file or directory: /var/folders/c8/0vwzz_7s429ggn8376hs9q9c0000gn/T/tmp6xiqbxf5/sapiens2_0.4b_pretr... FAILED tests/models/sapiens2/test_modeling_sapiens2.py::Sapiens2ModelTest::test_can_load_from_already_mapped_keys - FileNotFoundError: No such file or directory: /var/folders/c8/0vwzz_7s429ggn8376hs9q9c0000gn/T/tmpl91m5jdt/sapiens2_0.4b_pretr... FAILED tests/models/sapiens2/test_modeling_sapiens2.py::Sapiens2ModelTest::test_can_use_safetensors - FileNotFoundError: No such file or directory: /var/folders/c8/0vwzz_7s429ggn8376hs9q9c0000gn/T/tmp9q3b8ixl/sapiens2_0.4b_pretr... FAILED tests/models/sapiens2/test_modeling_sapiens2.py::Sapiens2ModelTest::test_correct_missing_keys - FileNotFoundError: No such file or directory: /var/folders/c8/0vwzz_7s429ggn8376hs9q9c0000gn/T/tmpij68xy5b/sapiens2_0.4b_pretr... FAILED tests/models/sapiens2/test_modeling_sapiens2.py::Sapiens2ModelTest::test_eager_matches_sdpa_inference_00_fp16_pad_left_sdpa_kernels - FileNotFoundError: No such file or directory: /var/folders/c8/0vwzz_7s429ggn8376hs9q9c0000gn/T/tmpbda3izxx/sapiens2_0.4b_pretr... ...

Didn't have time to look into it in detail yet. Might be that it tries to save and then reload the model again. When reloading it will then try to find it under sapiens2_0.4b_pretrain.safetensors but probably it is saved to model.safetensors instead.

Relevant part from stacktrace

def test_sdpa_can_dispatch_non_composite_models(self): """ Tests if non-composite models dispatch correctly on SDPA/eager when requested so when loading the model. This tests only by looking at layer names, as usually SDPA layers are called "SDPAAttention". """ if not self.has_attentions: self.skipTest(reason="Model architecture does not support attentions") if not self.all_model_classes[0]._supports_sdpa or self._is_composite: self.skipTest(f"{self.all_model_classes[0].__name__} does not support SDPA") for model_class in self.all_model_classes: config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() model = model_class(config) with tempfile.TemporaryDirectory() as tmpdirname: model.save_pretrained(tmpdirname) > model_sdpa = model_class.from_pretrained(tmpdirname)

Looking at model.save_pretrained I couldn't find a mention of config.transformers_weights which probably means that it is saved to a default path instead.

I also couldn't find any usage of transformers_weights in any other model config so I guess the proper fix is to leave it as None and make a PR to the hub instead.

yes there's no usage here because we have very few model releases that are as scarce in standard files as this one, but it might not be a bad precedent. QQ to @ArthurZucker on this - WDYT would suit better given

context: a model release with a single file named whatever.safetensors, no config.json, no index, nothing

use transformers_weights in the new default config and make sure save/load works?

Open a PR to original repo?

If 1) is too complicated we can use 2) and refer to the git branch of the pr (I mean model = AutoModel.from_pretrained("org/model-name", revision="refs/pr/1")

There is also no preprocessor config, so AutoImageProcessor doesn't work either.

Hey! if it only has a safetensors, its safe to assume 0 libraries depend on it -> let's open a PR and try to get it merged reaching out the authors! (we can most probably push final format of the weights as well like we do often)

molbap · 2026-05-13T12:30:47Z

seems like a good draft overall!

molbap · 2026-05-13T12:31:26Z

+        "sapiens2": [
+            WeightRenaming(r"^cls_token$", r"embeddings.cls_token"),
+            WeightRenaming(r"^storage_tokens$", r"embeddings.register_tokens"),
+            WeightRenaming(r"^patch_embed\.projection\.", r"embeddings.patch_embeddings."),
+            WeightRenaming(r"^rope_embed\.", r"rope_embeddings."),
+            WeightRenaming(r"blocks\.(\d+)\.", r"model.layer.\1."),
+            WeightRenaming(r"attn\.proj\.", r"attention.o_proj."),
+            WeightRenaming(r"attn\.wq\.", r"attention.q_proj."),
+            WeightRenaming(r"attn\.wk\.", r"attention.k_proj."),
+            WeightRenaming(r"attn\.wv\.", r"attention.v_proj."),
+            WeightRenaming(r"attn\.q_norm\.", r"attention.q_norm."),
+            WeightRenaming(r"attn\.k_norm\.", r"attention.k_norm."),
+            WeightRenaming(r"attn\.gamma\.weight$", r"layer_scale1.lambda1"),
+            WeightRenaming(r"ffn\.w3\.", r"mlp.down_proj."),
+            WeightRenaming(r"\.ln1\.", r".norm1."),
+            WeightRenaming(r"\.ln2\.", r".norm2."),
+            WeightRenaming(r"^ln1\.", r"norm."),
+            WeightConverter(
+                source_patterns=r"ffn\.w12\.weight",
+                target_patterns=[r"mlp.gate_proj.weight", r"mlp.up_proj.weight"],
+                operations=[Chunk(dim=0)],
+            ),
+            WeightConverter(
+                source_patterns=r"ffn\.w12\.bias",
+                target_patterns=[r"mlp.gate_proj.bias", r"mlp.up_proj.bias"],
+                operations=[Chunk(dim=0)],
+            ),
+        ],


good usage here

molbap · 2026-05-13T12:32:44Z

+
+    @slow
+    def test_inference_no_head(self):
+        # TODO(guarin): cleanup. transformers_weights required because original checkpoints are called "sapiens2_0.4b_pretrain.safetensors" instead of "model.safetensors"


should be solved with default config values, right?

Yes I think so. Once configs are on the hub the AutoImageProcessor should also load correctly, this seems to fail for now.

molbap · 2026-05-13T12:42:22Z

+
+@require_torch
+@require_vision
+class Sapiens2ModelIntegrationTest(unittest.TestCase):


Nice to check that the model behaves as expected - FYI for most generative models we also try to put a full end-to-end test with generation (model.generate()), if relevant

molbap · 2026-05-13T12:42:54Z

+        with torch.no_grad():
+            outputs = model(**inputs)
+
+        # verify the last hidden states


do these come from the original implem?

Yes, however tests do not always pass. Sapiens2 runs everything in bf16 by default. When I convert the model and input image to bf16 and use logits from the original model in bf16 the tests pass. For this I also had to use bf16 in the rope embed following the original code. If I run with fp32 and compare to fp32 logits from the original model I get some differences even if I keep rope in bf16 as in the original repo. Not yet sure where the diff comes from.

In general, if the original model is in bf16 should we also load and test it in bf16? Or do we prefer to load in fp32 and adjust tests accordingly?

in general we prefer to make the tests conditions match the original implementation/environment. For RoPE sometimes there's some hidden upcasting... if you have the full implementation on both sides and can't track the origin, having the output in json from https://github.com/huggingface/transformers/blob/main/src/transformers/model_debugging_utils.py can help

So it is fine if we use bf16 in the tests but then load the model by default in fp32? In the meantime I'll try to figure out where the diff for fp32 comes from.

Also regarding rope, is it ok if I keep it fixed to bf16 as in Sapiens2 or would you prefer to keep the more flexible implementation following DINOv3? If I go with fixed bf16 I'll also have to add a custom apply_rotary_pos_emb implementation with the dtype casting.

I updated now the rope implementation to exactly match the original code. Now the tests pass for bf16 and fp32. I had to slightly reduce to tolerance from 1e-4 to 1e-3 because of one value that is slightly different.

Tracked it down, for a perfect match I had to change a couple of things:

Generate expected logits with

torch.backends.cudnn.allow_tf32 = False torch.backends.cudnn.conv.fp32_precision = "ieee" torch.backends.fp32_precision = "ieee"

as is the default in the tests

Update image loading to use torchvision decode_image

Skip imageprocessor and use torchvision.transforms.v2 instead. I believe the difference between the two comes from the norm+rescale fusing in ImageProcessor which gives slightly different values

Given that those are all FP precision issues and will likely result in tests failing on different architectures/torch versions I propose to keep 1e-3 tolerance.

guarin · 2026-05-13T14:27:20Z

+    def test_output_hidden_states(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()


test_output_hidden_states is defined on the tester instead of the test class for DINOv3. I moved it to the test class for Sapiens2. Might merit a follow-up PR for DINOv3.

…into add-sapiens2

github-actions · 2026-05-14T16:40:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, sapiens2

…into add-sapiens2

github-actions · 2026-05-15T17:04:39Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45919&sha=6c32cb

yonigozlan

Very nice!! Looks great for a first draft. As you said, there's a bit of standardization to do with the VitPose image processor, as for the cv2 requirements, there's no real equivalence for INTER_AREA in pil and torchvision, so we might have no choice but too keep the requirement.
Most of the comments are nit-picking, the overall structure looks great. One other thing to nit pick in the structure, I usually prefer to put the image processor code at the top of modular, but not a big deal

yonigozlan · 2026-05-15T19:00:05Z

+        if self.use_qk_norm:
+            self.q_norm = nn.RMSNorm(self.head_dim, eps=config.layer_norm_eps)
+            self.k_norm = nn.RMSNorm(self.head_dim, eps=config.layer_norm_eps)


Nit: we can use a ternary with q/k_norm set to identity when use_qk_norm is False, so we don't need to set an attr use_qk_norm, and have only one path in forward (see internvl)

yonigozlan · 2026-05-15T19:07:54Z

+        self.num_kv_heads = (
+            self.num_heads if config.layer_types[layer_idx] == "full_attention" else config.num_key_value_heads
+        )


Let's define self.num_key_value_groups instead with repeat_kv in the eager method instead (like in gemma4 for example), we can take eager_attention_forward from a model other than dinov3_vit.

We can probably inherit Sapiens2Attention from another model as well, at least partially for the init

yonigozlan · 2026-05-15T19:21:01Z

+class Sapiens2RopePositionEmbedding(nn.Module):
+    periods: torch.Tensor
+
+    def __init__(self, config: Sapiens2Config):
+        super().__init__()
+
+        self.patch_size = config.patch_size
+        self.pos_embed_shift = config.pos_embed_shift
+        self.pos_embed_jitter = config.pos_embed_jitter
+        self.pos_embed_rescale = config.pos_embed_rescale
+        self.base = config.rope_theta
+        self.head_dim = config.hidden_size // config.num_attention_heads
+        self.pos_embed_dtype = getattr(torch, config.pos_embed_dtype)
+
+        periods = self.base ** (
+            2 * torch.arange(self.head_dim // 4, dtype=self.pos_embed_dtype) / (self.head_dim // 2)
+        )
+        self.register_buffer("periods", periods, persistent=True)  # persistent=True to match original checkpoints
+
+    def forward(self, pixel_values: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        _, _, height, width = pixel_values.shape
+        num_patches_h = height // self.patch_size
+        num_patches_w = width // self.patch_size
+
+        device = pixel_values.device
+        device_type = device.type if isinstance(device.type, str) and device.type != "mps" else "cpu"
+
+        with maybe_autocast(device_type=device_type, enabled=False):
+            patch_coords = get_patches_center_coordinates(
+                num_patches_h, num_patches_w, dtype=self.pos_embed_dtype, device=device
+            )
+            if self.training:
+                patch_coords = augment_patches_center_coordinates(
+                    patch_coords,
+                    shift=self.pos_embed_shift,
+                    jitter=self.pos_embed_jitter,
+                    rescale=self.pos_embed_rescale,
+                )
+
+            # (height * width, 2, head_dim / 4) -> (height * width, head_dim / 2) -> (height * width, head_dim)
+            angles = 2 * math.pi * patch_coords[:, :, None] / self.periods[None, None, :].to(self.pos_embed_dtype)
+            angles = angles.flatten(1, 2)
+            angles = angles.tile(2)
+
+            cos = torch.cos(angles)
+            sin = torch.sin(angles)
+
+        return cos, sin


Can't we fully reuse DINOv3ViTRopePositionEmbedding here?

yonigozlan · 2026-05-15T19:22:46Z

+        periods = self.base ** (
+            2 * torch.arange(self.head_dim // 4, dtype=self.pos_embed_dtype) / (self.head_dim // 2)
+        )
+        self.register_buffer("periods", periods, persistent=True)  # persistent=True to match original checkpoints


we usually name this inv_freq, also no need to set persistent to True to match the original checkpoint. If we don't end up pushing new checkpoint, we can change the _keys_to_ignore_on_load_missing attr of the model

yonigozlan · 2026-05-15T19:32:36Z

+class Sapiens2ConvLayer(nn.Module):
+    def __init__(
+        self, in_channels: int, out_channels: int, kernel_size: int = 1, bias: bool = True, activation: str = "silu"
+    ):
+        super().__init__()
+        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, bias=bias)
+        self.norm = nn.InstanceNorm2d(out_channels)
+        self.activation = ACT2FN[activation]
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.activation(self.norm(self.conv(hidden_states)))


NIt: we can inherit from e.g. BeitConvLayer and change only the norm in the init and the init signature

yonigozlan · 2026-05-15T19:42:56Z

+class Sapiens2PoseHead(Sapiens2SegmentationHead):
+    pass


I don't think we need this, lets just use Sapiens2SegmentationHead in Sapiens2ForPoseEstimation

yonigozlan · 2026-05-15T19:59:09Z


 ALLOWED_LAYER_TYPES = (
    "full_attention",
+    "grouped_query_attention",  # used in Sapiens2


I don't think we need this, we can just have num_key_value_groups=1 for full attention

guarin added 10 commits May 12, 2026 15:46

Add sapiens2

c3b05fd

Add attention

2df762a

Add todo

c756c60

Update norm

b375de0

fix-repo

10b9022

Amend

9f9926c

Fix tests

25f81fb

Update license

34d653a

Fix docstrings

1a93b48

Add backbone test

7c1d3bd

guarin added 5 commits May 13, 2026 08:49

Fix rescale

3673968

Add checkpoint conversion

1a19282

Merge branch 'main' into add-sapiens2

cae8877

Add expected logits

0f3ef31

Merge branch 'add-sapiens2' of https://github.com/guarin/transformers …

1a08ee5

…into add-sapiens2

molbap reviewed May 13, 2026

View reviewed changes

guarin commented May 13, 2026

View reviewed changes

guarin added 12 commits May 13, 2026 15:33

Remove unused class definitions

8b45e17

Merge branch 'main' into add-sapiens2

8928975

Merge branch 'add-sapiens2' of https://github.com/guarin/transformers …

0f8f772

…into add-sapiens2

Default to bf16 for rope

dbe697b

Update layer_norm_eps

08315d4

Update test

957268c

Update rope and test

05f528d

Format

07ea916

Handle layernorm

684e071

Add layer_types

fbe1558

Update image loading in test

be1d5c6

Add docs

b1d62f8

guarin added 12 commits May 14, 2026 11:36

Merge branch 'main' into add-sapiens2

ab4c037

Update config attributes

a387ce7

Merge branch 'add-sapiens2' of https://github.com/guarin/transformers …

cb90b56

…into add-sapiens2

Add semantic segmentation

620771c

Update logits for 1024x768

918f8f3

Update logits for 1024x768

43fb6b9

Update seg conversion

ca75472

Merge branch 'add-sapiens2' of https://github.com/guarin/transformers …

6fa22f5

…into add-sapiens2

Update seg inference tests

b4187b5

Add segmentation docs

63fbaa2

Add masks preprocessing

a9d9600

Merge branch 'main' into add-sapiens2

31e45a5

guarin added 15 commits May 15, 2026 09:40

Move head args to config

c5dcd6b

Add pose head

b0f22c2

Add pose mapping

494f8fa

Add pose image processing

7991a0f

Merge branch 'add-sapiens2' of https://github.com/guarin/transformers …

6d2cc6d

…into add-sapiens2

Update test inference configs

42e2869

Fix test

89ab067

Update test configs

18c9efe

Update expected heatmap size

0961036

Fix pose processing

b579e91

Merge branch 'main' into add-sapiens2

0dfd837

Add to ignore_non_auto

33ae68c

Fix test

7548570

Merge branch 'add-sapiens2' of https://github.com/guarin/transformers …

3ce9e19

…into add-sapiens2

Format

6c32cb6

yonigozlan reviewed May 15, 2026

View reviewed changes

		if bool_masked_pos is not None and not self.config.use_mask_token:
		raise ValueError("bool_masked_pos requires use_mask_token=True in the config")

		)


		# TODO(guarin): Double check if we cannot inherit attribute docstrings from parent class.


		model_type = "sapiens2"

		# TODO(guarin): This is needed to load the original checkpoints but makes unit tests fail.

		def test_output_hidden_states(self):
		config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

Conversation

guarin commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2026

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guarin May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guarin May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guarin May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guarin May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

yonigozlan left a comment

guarin commented May 12, 2026 •

edited

Loading

guarin May 13, 2026 •

edited

Loading

guarin May 13, 2026 •

edited

Loading

guarin May 13, 2026 •

edited

Loading

guarin May 13, 2026 •

edited

Loading