Skip to content

[WIP] Granite Vision Embedding Support#488

Draft
alex-jw-brooks wants to merge 13 commits into
foundation-model-stack:mainfrom
alex-jw-brooks:granite_vision_embed
Draft

[WIP] Granite Vision Embedding Support#488
alex-jw-brooks wants to merge 13 commits into
foundation-model-stack:mainfrom
alex-jw-brooks:granite_vision_embed

Conversation

@alex-jw-brooks
Copy link
Copy Markdown
Contributor

@alex-jw-brooks alex-jw-brooks commented Nov 12, 2025

In progress support for granite vision 3.3 embeddings. Current state:

  • The model loads all weights successfully with no warnings
  • A very small encode() func has been added to the generate utils with a similar API that we can use in local tests for embedding models
  • Forward pass is implemented for text and image embeddings
    • Equivalency test for text only & image only is passing against HF is currently passing on CPU with transformers 4.50
    • Some other applicable tests for configs etc are passing; still fixing the compile & consistency failures

Next steps
[ ] Add the forward pass for the vision tower & special packing utils
[ ] Pull the config into its own class (currently using llava next's, but there are a few extra things we'll need here)
[ ] Fix the remaining tests + add image embedding tests
[ ] Refactor to reduce duplication of utils in llava next - lots of the adapter stuff is currently copied

As this will touch some stuff in other models I plan to break some pieces out into dependent PRs to avoid one off interface changes, like allowing the granite LLM to run without its LM head, but opening this as a WIP in case anyone has early thoughts

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
use_high_precision_pow=True,
)
# Hack for equivalence testing
self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't merge this, but it's a potential reason for currently diverging outputs coming out of the vision tower - I wrote a quick local equivalency test for calling the vision tower directly (after loading it with granite vision) and noticed that the things are off by a bit without this change, so leaving it here for now

@kaoutar55 kaoutar55 self-requested a review November 18, 2025 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant