Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: opensensor/llama.cpp
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 2 commits
  • 14 files changed
  • 1 contributor

Commits on May 21, 2026

  1. feat: implement flash attention for gated delta net (Qwen3Next)

    - Added fattn-gdn.cuh/cu: Flash attention CUDA kernels for S_v = 16, 32, 64, 128
    - Added dispatch logic in gated_delta_net.cu (enabled when n_tokens > 32 && K == 1)
    - Added C++ unit tests (6 tests covering basic, correctness, seq lengths, KDA, state retention, performance)
    - Added Python integration tests
    - Added documentation and benchmark scripts
    - Updated ggml-cuda.cu and CMakeLists.txt for integration
    
    Expected 1.5x-3.5x+ speedup for sequences 64-1024+ tokens
    matteius committed May 21, 2026
    Configuration menu
    Copy the full SHA
    c7f2b19 View commit details
    Browse the repository at this point in the history
  2. makes fused GDN CUDA dispatch explicit for KDA/non-KDA, and corrects …

    …Q/K head broadcast indexing in flash GDN
    matteius committed May 21, 2026
    Configuration menu
    Copy the full SHA
    227a2fc View commit details
    Browse the repository at this point in the history
Loading