Skip to content

[Blog] Muon Optimizer Support in DeepSpeed#7962

Open
delock wants to merge 18 commits intomasterfrom
gma/muon_blog
Open

[Blog] Muon Optimizer Support in DeepSpeed#7962
delock wants to merge 18 commits intomasterfrom
gma/muon_blog

Conversation

@delock
Copy link
Copy Markdown
Collaborator

@delock delock commented Apr 8, 2026

Author: @PKUWZP & @delock
Blog post introducing Muon optimizer support in DeepSpeed, covering how it integrates with
ZeRO Stage 2/3, measured convergence and memory results, and the roadmap ahead.

delock and others added 16 commits April 8, 2026 23:56
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Replace placeholder claims with actual experiment results:
- Add lr sweep results for both AdamW and Muon optimizers
- Report measured GPU memory: AdamW 34.5 GiB vs Muon 31.4 GiB (9% savings)
- Remove old convergence chart (adamw_vs_muon_3b.png)
- Fix inaccurate claims (Muon 19% better, Adam OOM on 2xA100)
- Add hybrid optimizer explanation and separate lr config docs

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
@delock delock marked this pull request as ready for review April 10, 2026 13:04
@delock delock requested review from loadams and tjruwase as code owners April 10, 2026 13:04
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…g fixes

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
@delock
Copy link
Copy Markdown
Collaborator Author

delock commented Apr 17, 2026

Hi @PKUWZP @tjruwase this PR is ready for review and merge, thanks!

Muon optimizer has gained momentum with more and more use from community and also from Large Foundation Model like Kimi-K2-Thinking. Now DeepSpeed supports Muon optimizer.

## What is Muon optimizer?
Muon is an optimizer designed for hidden 2D weights of a neural network. It takes gradient of the weight, computes its momentum, and applies Newton-Schulz iterations to orthogonalize the momentum matrix, then uses this orthogonalized matrix to update the weight[1](https://kellerjordan.github.io/posts/muon/). Because Muon only maintains one momentum buffer (versus Adam’s two), it uses less memory for optimizer states.
Copy link
Copy Markdown
Collaborator

@sfc-gh-truwase sfc-gh-truwase Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The citations are appear broken when rendered. For example, see weight[1] below.

Image


![Muon optimizer convergence on Qwen2.5-3B](images/muon_loss_3b.png)

Muon optimizer converges smoothly and shows no overfitting during finetuning.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we show side-by-side convergence comparison with AdamW?

Muon optimizer has gained momentum with more and more use from community and also from Large Foundation Model like Kimi-K2-Thinking. Now DeepSpeed supports Muon optimizer.

## What is Muon optimizer?
Muon is an optimizer designed for hidden 2D weights of a neural network. It takes gradient of the weight, computes its momentum, and applies Newton-Schulz iterations to orthogonalize the momentum matrix, then uses this orthogonalized matrix to update the weight[1](https://kellerjordan.github.io/posts/muon/). Because Muon only maintains one momentum buffer (versus Adam’s two), it uses less memory for optimizer states.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this phrase, hidden 2D weights, requires more introduction and description. I think this section should give the high-level memory efficiency advantage of Muon over Adam before diving into implementation details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants