Comparing changes

- Added fattn-gdn.cuh/cu: Flash attention CUDA kernels for S_v = 16, 32, 64, 128 - Added dispatch logic in gated_delta_net.cu (enabled when n_tokens > 32 && K == 1) - Added C++ unit tests (6 tests covering basic, correctness, seq lengths, KDA, state retention, performance) - Added Python integration tests - Added documentation and benchmark scripts - Updated ggml-cuda.cu and CMakeLists.txt for integration Expected 1.5x-3.5x+ speedup for sequences 64-1024+ tokens

…Q/K head broadcast indexing in flash GDN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Uh oh!

Commits on May 21, 2026

This comparison is taking too long to generate.

Uh oh!