Feat -- Stable Audio 3 by buffett0323 · Pull Request #14119 · huggingface/diffusers

buffett0323 · 2026-07-03T23:05:16Z

What does this PR do?

Adds Stable Audio 3 Medium to Diffusers.

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case. Support for Stable Audio 3 Medium #13793
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

…n mask The diffusers Stable Audio 3 pipeline produced noise instead of music because the cross-attention conditioning was built differently from the reference: - The reference text conditioner uses padding_mode="learned": padded text positions (~245 of 256 for a short prompt) are filled with a trained `padding_embedding`, and the DiT attends to *all* positions (its cross-attention mask is intentionally disabled). Our pipeline instead zeroed padded positions and masked them out, wiping ~95% of the conditioning signal. Changes: - transformer_stable_audio3: add learned `prompt_padding_embedding`; in forward, replace masked text positions with it (in cond_token_dim space, before to_cond_embed) then attend to the full context, matching the reference. - pipeline_stable_audio_3: stop zeroing padded positions (the DiT now handles them); default `silence_padding_duration` to 6.0 (reference headroom default). - convert_..._to_diffusers: convert `conditioner.conditioners.prompt.padding_embedding` into the DiT. - scheduling_ping_pong: pin schedule endpoints (sigmas[0]=1.0, sigmas[-1]=0.0) to match the reference LogSNRShift endpoint preservation. - run_..._inference: coerce float16->float32 on CPU. - tests: update transformer state-dict expectations (523 tensors, add prompt_padding_embedding) and ping-pong sigma endpoint assertions. Verified against the reference with identical noise + identical conditioning: full 8-step ping-pong trajectories agree to 4e-6 (final latent). Removed the dev-only parity scripts (verify_*_dit_parity, verify_*_vae_parity) from the tree. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

buffett0323 and others added 7 commits July 3, 2026 11:47

Stable Audio 3 First Version Commit, require model dict loading on GPUs

782e0d2

Fixing bugs for sa3 VAE

a033678

Make quality fix

82656dd

Fix the unit test failure

6a4713b

Style fix

e10d398

Autopipeline available

7f9ee7e

github-actions Bot added documentation Improvements or additions to documentation models tests utils pipelines schedulers fixes-issue size/L PR with diff > 200 LOC labels Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat -- Stable Audio 3#14119

Feat -- Stable Audio 3#14119
buffett0323 wants to merge 7 commits into
huggingface:mainfrom
buffett0323:feat/stable-audio-medium

buffett0323 commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

buffett0323 commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

buffett0323 commented Jul 3, 2026 •

edited

Loading