[Perf] Adjust KV Cache for torch.compile friendly by tsdocode · Pull Request #163 · nari-labs/dia

tsdocode · 2025-05-02T11:33:08Z

Build on top of: #162

Adjust KVCache like gpt-fast style:
- Using a outside current_index tensor instead of a class attribute which cause: skipping cudagraphs due to mutated inputs (36 instances)
Add benchmark script

Result:
A100:

generate step 86: speed=103.650 tokens/s, realtime factor=1.205x
generate step 172: speed=192.087 tokens/s, realtime factor=2.234x
generate step 258: speed=191.480 tokens/s, realtime factor=2.227x
generate step 344: speed=192.203 tokens/s, realtime factor=2.235x
generate step 430: speed=192.052 tokens/s, realtime factor=2.233x

4090:

generate step 86: speed=78.049 tokens/s, realtime factor=0.908x
generate step 172: speed=197.427 tokens/s, realtime factor=2.296x
generate step 258: speed=197.933 tokens/s, realtime factor=2.302x
generate step 344: speed=197.997 tokens/s, realtime factor=2.302x
generate step 430: speed=197.914 tokens/s, realtime factor=2.301x
generate step 516: speed=197.846 tokens/s, realtime factor=2.301x
generate step 602: speed=197.875 tokens/s, realtime factor=2.301x
generate step 688: speed=197.797 tokens/s, realtime factor=2.300x

…ry usage.

buttercrab

LGTM

JakubCzarlinski · 2025-05-02T14:05:51Z

@tsdocode Amazing work here!

Tested this after the merge. Another 30 tokens/s compared to #162 in my env.

generate step 344: speed=234.115 tokens/s, realtime factor=2.722x
generate step 430: speed=233.398 tokens/s, realtime factor=2.714x
generate step 516: speed=234.825 tokens/s, realtime factor=2.731x
generate step 602: speed=233.978 tokens/s, realtime factor=2.721x
generate step 688: speed=233.303 tokens/s, realtime factor=2.713x
generate step 774: speed=233.219 tokens/s, realtime factor=2.712x

JakubCzarlinski and others added 4 commits April 30, 2025 21:12

Fix fullgraph=True compilation leading to improved runtime and memo…

aaa1de8

…ry usage.

Enable cuda.matmul.allow_tf32 flag for performance.

ad0b67b

feat: adjust KVCache for torch.compile friendly

997882f

chore: add benchmark script

5940428

tsdocode mentioned this pull request May 2, 2025

Fix compiling with fullgraph=True to improve inference time. #162

Closed

fix lints & format

fd28a61

buttercrab approved these changes May 2, 2025

View reviewed changes

buttercrab merged commit 052a840 into nari-labs:main May 2, 2025
1 check passed

tsdocode mentioned this pull request May 6, 2025

Is this model suitable for real-time conversational TTS? #181

Closed

This was referenced Jun 23, 2025

Feature implementation from commits bdf7f90..6b71cc7 yashuatla/dia#1

Open

Feature implementation from commits 47d42f5..4dcff1c yashuatla/dia#2

Open

jeffDebug mentioned this pull request Dec 6, 2025

How to make DIA-1.6B run in real-time stream mode? #287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Adjust KV Cache for torch.compile friendly#163

[Perf] Adjust KV Cache for torch.compile friendly#163
buttercrab merged 5 commits intonari-labs:mainfrom
tsdocode:perf/torch-compile

tsdocode commented May 2, 2025

Uh oh!

buttercrab left a comment

Uh oh!

Uh oh!

JakubCzarlinski commented May 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tsdocode commented May 2, 2025

Uh oh!

buttercrab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JakubCzarlinski commented May 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants