Skip to content

Fix clip_skip AttributeError on Stable Diffusion pipelines with transformers>=5.6#14043

Open
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:2
Open

Fix clip_skip AttributeError on Stable Diffusion pipelines with transformers>=5.6#14043
Sunt-ing wants to merge 1 commit into
huggingface:mainfrom
Sunt-ing:2

Conversation

@Sunt-ing

Copy link
Copy Markdown

What does this PR do?

Passing clip_skip to any SD1.x-family pipeline crashes on transformers>=5.6:

AttributeError: 'CLIPTextModel' object has no attribute 'text_model'

transformers 5.6 flattened CLIPTextModel (huggingface/transformers#46285): embeddings / encoder / final_layer_norm became direct submodules and the text_model wrapper was removed (CLIPTextModelWithProjection still wraps via text_model). The clip_skip branch of encode_prompt re-applies the final LayerNorm by hand via self.text_encoder.text_model.final_layer_norm(...), so it raises as soon as clip_skip is set. The default clip_skip=None path is unaffected because it uses last_hidden_state (already normalized) and never touches .text_model. diffusers declares no upper bound on transformers and is actively migrating to 5.x, so this is a live defect rather than an out-of-range version.

This reuses the exact guard already merged for the from_single_file path in #13843:

text_model = self.text_encoder.text_model if hasattr(self.text_encoder, "text_model") else self.text_encoder
prompt_embeds = text_model.final_layer_norm(prompt_embeds)

StableDiffusionPipeline.encode_prompt is the source; make fix-copies propagates it to 39 # Copied from consumers, and 6 hand-written siblings (alt_diffusion x2, animatediff video2video x2, i2vgen_xl, ledits_pp) are updated to match, 46 files in total. SDXL is not affected: its encode_prompt reads hidden_states[-(clip_skip + 2)] and never calls final_layer_norm, and text_encoder_2 is a CLIPTextModelWithProjection (still has .text_model).

Reproduction (CPU, no GPU; transformers 5.12.1)
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("hf-internal-testing/tiny-stable-diffusion-pipe", safety_checker=None)
pipe.set_progress_bar_config(disable=True)
print(type(pipe.text_encoder).__name__, "has .text_model:", hasattr(pipe.text_encoder, "text_model"))

pipe(prompt="a cat", num_inference_steps=2, guidance_scale=0.0, output_type="np")                 # clip_skip=None: OK
pipe(prompt="a cat", num_inference_steps=2, guidance_scale=0.0, output_type="np", clip_skip=1)     # crashes on main
# main:  CLIPTextModel has .text_model: False
#        clip_skip=None -> image (1, 64, 64, 3)
#        clip_skip=1    -> AttributeError: 'CLIPTextModel' object has no attribute 'text_model'
# this:  clip_skip=None and clip_skip=1 both produce (1, 64, 64, 3)

Confirmed identically on real weights (stabilityai/sd-turbo, a CLIPTextModel): clip_skip=None fine, clip_skip=1 raises on main and works with this PR. The crash is in encode_prompt, before the denoise loop, so it is weight-independent and CPU-reproducible.

Tests

Added test_stable_diffusion_clip_skip to StableDiffusionPipelineFastTests: it asserts a clip_skip=1 call runs and that its output differs from clip_skip=None. The test errors on main (the AttributeError) and passes with this PR.

tests/pipelines/stable_diffusion/test_stable_diffusion.py::...::test_stable_diffusion_clip_skip  PASSED

ruff (0.9.10), check_copies, and check_dummies are clean on the changed files.

Before submitting

Who can review?

@asomoza @DN6

…formers>=5.6

transformers 5.6 flattened CLIPTextModel (huggingface/transformers#46285): embeddings/encoder/final_layer_norm became direct submodules and the text_model wrapper was removed. The clip_skip branch of encode_prompt re-applies the final LayerNorm via self.text_encoder.text_model.final_layer_norm, so any SD1.x-family pipeline call with clip_skip set raised AttributeError. Guard the access with the same hasattr check already merged for from_single_file in huggingface#13843, propagated via make fix-copies and applied to the hand-written siblings. SDXL is unaffected (uses hidden_states[-(clip_skip+2)] and a CLIPTextModelWithProjection encoder).

Signed-off-by: Ting Sun <suntcrick@gmail.com>
@github-actions github-actions Bot added size/L PR with diff > 200 LOC tests pipelines and removed size/L PR with diff > 200 LOC labels Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant