Skip to content

Releases: ggml-org/llama.cpp

b8907

23 Apr 23:00
12568ca

Choose a tag to compare

b8905

23 Apr 21:38
0949beb

Choose a tag to compare

b8902

23 Apr 13:21
550d684

Choose a tag to compare

b8901

23 Apr 10:59
8635e22

Choose a tag to compare

b8893

23 Apr 02:36
6217b49

Choose a tag to compare

b8892

22 Apr 22:41
0d0764d

Choose a tag to compare

[WebGPU] Implement async tensor api and event api (#22099)

  • Only run webgpu CI on my fork

  • Implement set_tensor_async

  • Implement synchronize api

  • Implement event creation and deletion API

  • Cleanup

  • Cleanup

  • Comment out jobs for local CI run

  • Add webgpu only workflow

  • Delete .github/workflows/build-webgpu.yml

  • Cleanup

  • Cleanup

  • Update API with function handlers

  • Run clang-format

  • Replace one-shot buffer with a direct queue.WriteBuffer using the buffer context

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

b8891

22 Apr 22:37
6da7168

Choose a tag to compare

b8890

22 Apr 22:34
8bccdbb

Choose a tag to compare

chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs (#22217)

  • chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs

  • Fix ty errors.

  • Fix flake8 err

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

b8889

22 Apr 22:20
bcb5eeb

Choose a tag to compare

b8888

22 Apr 22:09
225088e

Choose a tag to compare

sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119)

  • sycl: size mul_mat_id staging buffers by routed rows

Previously src1_contiguous/dst_contiguous in ggml_sycl_mul_mat_id were
sized to ggml_nelements(src1/dst), which over-allocates when ne12 > 1
and can fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero for
MoE models (notably with --cpu-moe). Size them by the actual number of
routed rows (ids->ne[1] * n_ids) instead.

  • sycl: add bf16 mul_mat fast path via DNNL

When src0 is BF16 (commonly the case for lm_head / output.weight), the
existing f16 path is skipped because bf16 isn't covered, and the f32
fallback dequantizes the entire src0 slab to f32 in a single pool alloc
(row_diff*ne00 floats). For large-vocab models this can reach several
GB and fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero.

Add a bf16xbf16 -> f32 DNNL matmul fast path that uses the bf16 storage
in place and only materializes a small src1 bf16 conversion buffer. bf16
matmul accumulates in f32, so it's correct even when the op requests
GGML_PREC_F32 (as lm_head does).

  • gemm.hpp: map bfloat16 to dnnl::memory::data_type::bf16.
  • convert.{hpp,cpp}: expose ggml_get_to_bf16_sycl for f32/f16/bf16 -> bf16.
  • ggml-sycl.cpp: take the bf16 path early in ggml_sycl_op_mul_mat_sycl
    when DNNL and GGML_SYCL_HAS_BF16 are both available.

macOS/iOS:

Linux:

Android:

Windows:

openEuler: