Releases · ggml-org/llama.cpp

vendor : update LibreSSL to 4.3.1 (#22285)

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

fix build number for sycl release (#22283)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

server: Enable transcriptions API for LFM2-Audio (#22000)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

metal : fix event synchronization (#22260)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

HIP: flip GGML_HIP_GRAPHS to default on (#22254)

In #11362 hip graph was disabled by default as, at the time, its performance impact was negative. Due to improvements in rocm and our usage and construction of graphs this is no longer true, so lets change the default.

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

[WebGPU] Implement async tensor api and event api (#22099)

Only run webgpu CI on my fork
Implement set_tensor_async
Implement synchronize api
Implement event creation and deletion API
Cleanup
Cleanup
Comment out jobs for local CI run
Add webgpu only workflow
Delete .github/workflows/build-webgpu.yml
Cleanup
Cleanup
Update API with function handlers
Run clang-format
Replace one-shot buffer with a direct queue.WriteBuffer using the buffer context

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

ggml-webgpu: Add fused RMS_NORM + MUL (#21983)

fused rms_norm_mul + mul
Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion.
Decouple num_fused_ops from webgpu_context; misc cleanup
Fix eps handling and remove disable_fusion.
Fix not to use c++20 initializers.

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs (#22217)

chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs
Fix ty errors.
Fix flake8 err

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

speculative-simple : add checkpoint support (#22227)

speculative-simple : add checkpoint support
cont : fix build

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119)

sycl: size mul_mat_id staging buffers by routed rows

Previously src1_contiguous/dst_contiguous in ggml_sycl_mul_mat_id were
sized to ggml_nelements(src1/dst), which over-allocates when ne12 > 1
and can fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero for
MoE models (notably with --cpu-moe). Size them by the actual number of
routed rows (ids->ne[1] * n_ids) instead.

sycl: add bf16 mul_mat fast path via DNNL

When src0 is BF16 (commonly the case for lm_head / output.weight), the
existing f16 path is skipped because bf16 isn't covered, and the f32
fallback dequantizes the entire src0 slab to f32 in a single pool alloc
(row_diff*ne00 floats). For large-vocab models this can reach several
GB and fail with UR_RESULT_ERROR_OUT_OF_HOST_MEMORY on Level Zero.

Add a bf16xbf16 -> f32 DNNL matmul fast path that uses the bf16 storage
in place and only materializes a small src1 bf16 conversion buffer. bf16
matmul accumulates in f32, so it's correct even when the op requests
GGML_PREC_F32 (as lm_head does).

gemm.hpp: map bfloat16 to dnnl::memory::data_type::bf16.
convert.{hpp,cpp}: expose ggml_get_to_bf16_sycl for f32/f16/bf16 -> bf16.
ggml-sycl.cpp: take the bf16 path early in ggml_sycl_op_mul_mat_sycl
when DNNL and GGML_SYCL_HAS_BF16 are both available.

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Releases: ggml-org/llama.cpp

b8907

Uh oh!

b8905

Uh oh!

b8902

Uh oh!

b8901

Uh oh!

b8893

Uh oh!

b8892

Uh oh!

b8891

Uh oh!

b8890

Uh oh!

b8889

Uh oh!

b8888

Uh oh!