Skip to content

[Perf] Adjust KV Cache for torch.compile friendly#163

Merged
buttercrab merged 5 commits intonari-labs:mainfrom
tsdocode:perf/torch-compile
May 2, 2025
Merged

[Perf] Adjust KV Cache for torch.compile friendly#163
buttercrab merged 5 commits intonari-labs:mainfrom
tsdocode:perf/torch-compile

Conversation

@tsdocode
Copy link
Copy Markdown
Contributor

@tsdocode tsdocode commented May 2, 2025

Build on top of: #162

  • Adjust KVCache like gpt-fast style:
    • Using a outside current_index tensor instead of a class attribute which cause: skipping cudagraphs due to mutated inputs (36 instances)
  • Add benchmark script

Result:
A100:

generate step 86: speed=103.650 tokens/s, realtime factor=1.205x
generate step 172: speed=192.087 tokens/s, realtime factor=2.234x
generate step 258: speed=191.480 tokens/s, realtime factor=2.227x
generate step 344: speed=192.203 tokens/s, realtime factor=2.235x
generate step 430: speed=192.052 tokens/s, realtime factor=2.233x

4090:

generate step 86: speed=78.049 tokens/s, realtime factor=0.908x
generate step 172: speed=197.427 tokens/s, realtime factor=2.296x
generate step 258: speed=197.933 tokens/s, realtime factor=2.302x
generate step 344: speed=197.997 tokens/s, realtime factor=2.302x
generate step 430: speed=197.914 tokens/s, realtime factor=2.301x
generate step 516: speed=197.846 tokens/s, realtime factor=2.301x
generate step 602: speed=197.875 tokens/s, realtime factor=2.301x
generate step 688: speed=197.797 tokens/s, realtime factor=2.300x

Copy link
Copy Markdown
Contributor

@buttercrab buttercrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@buttercrab buttercrab merged commit 052a840 into nari-labs:main May 2, 2025
1 check passed
@JakubCzarlinski
Copy link
Copy Markdown
Contributor

@tsdocode Amazing work here!

Tested this after the merge. Another 30 tokens/s compared to #162 in my env.

generate step 344: speed=234.115 tokens/s, realtime factor=2.722x
generate step 430: speed=233.398 tokens/s, realtime factor=2.714x
generate step 516: speed=234.825 tokens/s, realtime factor=2.731x
generate step 602: speed=233.978 tokens/s, realtime factor=2.721x
generate step 688: speed=233.303 tokens/s, realtime factor=2.713x
generate step 774: speed=233.219 tokens/s, realtime factor=2.712x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants