[Blog] Muon Optimizer Support in DeepSpeed by delock · Pull Request #7962 · deepspeedai/DeepSpeed

delock · 2026-04-08T07:26:48Z

Author: @PKUWZP & @delock
Blog post introducing Muon optimizer support in DeepSpeed, covering how it integrates with
ZeRO Stage 2/3, measured convergence and memory results, and the roadmap ahead.

Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Replace placeholder claims with actual experiment results: - Add lr sweep results for both AdamW and Muon optimizers - Report measured GPU memory: AdamW 34.5 GiB vs Muon 31.4 GiB (9% savings) - Remove old convergence chart (adamw_vs_muon_3b.png) - Fix inaccurate claims (Muon 19% better, Adam OOM on 2xA100) - Add hybrid optimizer explanation and separate lr config docs Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

chatgpt-codex-connector · 2026-04-10T13:04:19Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…g fixes Signed-off-by: Guokai Ma <guokai.ma@intel.com>

delock · 2026-04-17T02:52:26Z

Hi @PKUWZP @tjruwase this PR is ready for review and merge, thanks!

sfc-gh-truwase · 2026-04-20T11:15:55Z

+Muon optimizer has gained momentum with more and more use from community and also from Large Foundation Model like Kimi-K2-Thinking.  Now DeepSpeed supports Muon optimizer.
+
+## What is Muon optimizer?
+Muon is an optimizer designed for hidden 2D weights of a neural network.  It takes gradient of the weight, computes its momentum, and applies Newton-Schulz iterations to orthogonalize the momentum matrix, then uses this orthogonalized matrix to update the weight[1](https://kellerjordan.github.io/posts/muon/).  Because Muon only maintains one momentum buffer (versus Adam’s two), it uses less memory for optimizer states.


The citations are appear broken when rendered. For example, see weight[1] below.

sfc-gh-truwase · 2026-04-20T12:18:57Z

+
+![Muon optimizer convergence on Qwen2.5-3B](images/muon_loss_3b.png)
+
+Muon optimizer converges smoothly and shows no overfitting during finetuning.


Can we show side-by-side convergence comparison with AdamW?

sfc-gh-truwase · 2026-04-20T21:49:39Z

+Muon optimizer has gained momentum with more and more use from community and also from Large Foundation Model like Kimi-K2-Thinking.  Now DeepSpeed supports Muon optimizer.
+
+## What is Muon optimizer?
+Muon is an optimizer designed for hidden 2D weights of a neural network.  It takes gradient of the weight, computes its momentum, and applies Newton-Schulz iterations to orthogonalize the momentum matrix, then uses this orthogonalized matrix to update the weight[1](https://kellerjordan.github.io/posts/muon/).  Because Muon only maintains one momentum buffer (versus Adam’s two), it uses less memory for optimizer states.


I think this phrase, hidden 2D weights, requires more introduction and description. I think this section should give the high-level memory efficiency advantage of Muon over Adam before diving into implementation details.

delock force-pushed the gma/muon_blog branch from 6271454 to 5da5fad Compare April 9, 2026 06:56

delock and others added 16 commits April 8, 2026 23:56

Muon optimizer blog draft

8f69006

Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

add contributor list

58612b4

Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

fix checkboxes

5a10584

Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

expand memory analysis

c4e42e7

Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

trim down

7efd651

Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

remove memory data

dc7c02b

Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

fix formatting

f2967a1

Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

fix gramma

0890c49

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add convergence experiment result and fix typos in Muon blog

7c57422

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add training configuration caption to convergence chart

8d9433e

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Update Muon blog future plan: mark ZeRO stage 3 and Gram NS as done

be3cb5d

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add Muon pretraining convergence advantage to What is Muon section

379e6b8

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Revamp future plan into What's Next with active roadmap tone

fee921d

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add GLM-5 as Muon adopter and fix What's Next roadmap

5da5fad

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Add Muon blog to Latest News in README and docs landing page

a91018f

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

delock marked this pull request as ready for review April 10, 2026 13:04

delock requested review from loadams and tjruwase as code owners April 10, 2026 13:04

Refine Muon blog: convergence results, LR tuning guide, and formattin…

7932533

…g fixes Signed-off-by: Guokai Ma <guokai.ma@intel.com>

delock force-pushed the gma/muon_blog branch from d1b1497 to 7932533 Compare April 10, 2026 13:09

Merge branch 'master' into gma/muon_blog

28db3d4

sfc-gh-truwase reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blog] Muon Optimizer Support in DeepSpeed#7962

[Blog] Muon Optimizer Support in DeepSpeed#7962
delock wants to merge 18 commits intomasterfrom
gma/muon_blog

delock commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 10, 2026

Uh oh!

delock commented Apr 17, 2026

Uh oh!

sfc-gh-truwase Apr 20, 2026 •

edited

Loading

Uh oh!

sfc-gh-truwase Apr 20, 2026

Uh oh!

sfc-gh-truwase Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		![Muon optimizer convergence on Qwen2.5-3B](images/muon_loss_3b.png)

		Muon optimizer converges smoothly and shows no overfitting during finetuning.

Conversation

delock commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 10, 2026

Uh oh!

delock commented Apr 17, 2026

Uh oh!

sfc-gh-truwase Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfc-gh-truwase Apr 20, 2026 •

edited

Loading