Perf: Some small enhancements#229
Conversation
|
Could you please sync your repo with the original? Yours is 16 commits behind, and it doesn't work on a MacBook due to some issues I fixed later, which aren't included in your repo. I'm mentioning this because I want to test it. |
|
Could you fix the lint & format? |
|
@tsdocode I know the issue is closed, but I did get to test it. I saw a ~30% increase in processing on my M3 MacBook Pro with 36GB VRAM. It's a huge difference, very noticeable as well. Here are the logs from my recent run of example/simple-mac.py: generate step 86: speed=6.268 tokens/s, realtime factor=0.073x
generate step 172: speed=11.486 tokens/s, realtime factor=0.134x
generate step 258: speed=11.468 tokens/s, realtime factor=0.133x
generate step 344: speed=11.472 tokens/s, realtime factor=0.133x
generate step 430: speed=11.379 tokens/s, realtime factor=0.132x
generate step 516: speed=11.226 tokens/s, realtime factor=0.131x
generate step 602: speed=11.396 tokens/s, realtime factor=0.133x
generate step 688: speed=11.337 tokens/s, realtime factor=0.132x
generate: avg steps=758.0, total duration=75.467sAnd below are the logs of a previous run for the same script with the previous version: generate: starting generation loop
generate step 86: speed=7.759 tokens/s, realtime factor=0.090x
generate step 172: speed=8.470 tokens/s, realtime factor=0.098x
generate step 258: speed=8.536 tokens/s, realtime factor=0.099x
generate step 344: speed=8.489 tokens/s, realtime factor=0.099x
generate step 430: speed=8.607 tokens/s, realtime factor=0.100x
generate step 516: speed=8.615 tokens/s, realtime factor=0.100x
generate step 602: speed=8.592 tokens/s, realtime factor=0.100x
generate step 688: speed=8.597 tokens/s, realtime factor=0.100x
generate: avg steps=747.0, total duration=91.026s |
|
@V12Hero Add this to you code before running generate, this fused qkv operation, maybe it will help speedup a little more: |
|
@tsdocode is this how you would want it placed? from dia.model import Dia
model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")
text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
for idx in range(len(model.model.decoder.layers)):
layer = model.model.decoder.layers[idx]
layer.self_attention.patch_fused_qkv()
# It is important to set the `use_torch_compile` argument to `False` when using Dia on MacOS.
# This is because the `torch.compile` function is not supported on MacOS.
output = model.generate(text, use_torch_compile=False, verbose=True)
model.save_audio("simple.mp3", output) |
|
Yes |
|
Ok, I can't see any difference, the slight dip in the performance is because I'm now running a bit low on battery but the overall performance I'd say is the same generate: starting generation loop
generate step 86: speed=9.641 tokens/s, realtime factor=0.112x
generate step 172: speed=11.205 tokens/s, realtime factor=0.130x
generate step 258: speed=11.287 tokens/s, realtime factor=0.131x
generate step 344: speed=11.277 tokens/s, realtime factor=0.131x
generate step 430: speed=11.266 tokens/s, realtime factor=0.131x
generate step 516: speed=11.194 tokens/s, realtime factor=0.130x
generate step 602: speed=11.253 tokens/s, realtime factor=0.131x
generate step 688: speed=11.281 tokens/s, realtime factor=0.131x
generate step 774: speed=11.173 tokens/s, realtime factor=0.130x
generate: avg steps=772.0, total duration=71.836s |

Did in PR:
Attentionto 2 class:SelfAttentionandCrossAttentionfor further optimizationTest scripts:
Result on A100 80Gb:
Other room for speed-up: