Skip to content

ivanopcode/devnote-override-macos-metal-vram-cap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Dev Note — Override macOS Metal “VRAM” Cap on Apple Silicon for Local LLMs

Based on: Greg Stencel, Apple Silicon limitations with usage on local LLM Source: stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm .html

Why this matters

On Apple Silicon, Metal exposes a recommendedMaxWorkingSetSize (~75% of unified RAM). Tools like Ollama/llama.cpp (MPS/Metal) treat this as a hard ceiling, e.g. ~96 GB on a 128 GB M1 Ultra. You can raise this cap via a hidden kernel parameter to fit larger models fully on the GPU.


Contents


Prerequisites

  • Apple Silicon Mac with unified memory.
  • macOS Sonoma (14+) recommended; older versions supported with a different key.
  • Admin privileges for sudo.
  • Willingness to accept unsupported/undocumented tweaks (may impact stability).

Quick Start (Sonoma 14+)

Increase Metal’s working set so larger LLMs stay on-GPU:

# Example: set ~120 GB on a 128 GB machine (122880 MB)
sudo sysctl iogpu.wired_limit_mb=122880

No reboot required. See Verify.


Choose a Safe Limit

Leave 8–16 GB for macOS and other processes to avoid memory pressure.

Total RAM Typical Default Cap (~75%) Example Raised Cap MB Value
32 GB ~21–24 GB ~28–30 GB 28672–30720
64 GB ~48 GB ~56 GB 57344
128 GB ~96 GB ~120 GB 122880

Formula: desired_MB = desired_GiB * 1024


Set the Limit (Sonoma 14+)

Use the MB-based sysctl:

# Check current value (0 = system default ~75%)
sysctl iogpu.wired_limit_mb

# Set new cap (e.g., 120 GB)
sudo sysctl iogpu.wired_limit_mb=122880

Set the Limit (Ventura/Monterey)

Older macOS used a bytes key:

# Bytes value (example ~56 GB)
# 56 * 1024 * 1024 * 1024 = 60129542144
sudo sysctl debug.iogpu.wired_limit=60129542144

Verify

  • Ollama / llama.cpp logs should show a higher ggml_metal_init: recommendedMaxWorkingSetSize = XXXXX MB

  • Read back the sysctl:

    sysctl iogpu.wired_limit_mb
  • Watch Activity Monitor → Memory Pressure while running the model.


Persist Across Reboots (Optional)

May require disabling SIP to modify system files. Consider re-running the one-liner manually when needed instead.

# /etc/sysctl.conf (create if missing)
# Sonoma 14+ (MB)
iogpu.wired_limit_mb=122880

Reboot to apply.


Revert

# Reset to system default (~75%)
sudo sysctl iogpu.wired_limit_mb=0

# If persisted, remove the line from /etc/sysctl.conf and reboot.

Operational Tips & Cautions

  • Do not set to 100% of RAM—reserve headroom (8–16 GB).
  • If Memory Pressure turns yellow/red or system starts swapping, lower the cap.
  • Changes are unsupported by Apple; future macOS updates may alter behavior.

Notes for Local LLM Usage

  • Raising the cap helps keep entire models + KV cache on GPU for faster tokens/sec.
  • Quantization (e.g., 4–8 bit) reduces footprint; balance context length vs memory.
  • Hybrid fallback is possible (overflow on CPU) but slows generation.
  • Keep Ollama/llama.cpp updated; Metal/MPS backends evolve frequently.

Credit: This procedure and context are derived from Greg Stencel’s post: Apple Silicon limitations with usage on local LLMhttps://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm%20.html

About

Dev Note — Override macOS Metal “VRAM” Cap on Apple Silicon for Local LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages