Based on: Greg Stencel, Apple Silicon limitations with usage on local LLM Source: stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm .html
On Apple Silicon, Metal exposes a recommendedMaxWorkingSetSize (~75% of unified RAM). Tools like Ollama/llama.cpp (MPS/Metal) treat this as a hard ceiling, e.g. ~96 GB on a 128 GB M1 Ultra. You can raise this cap via a hidden kernel parameter to fit larger models fully on the GPU.
- Prerequisites
- Quick Start (Sonoma 14+)
- Choose a Safe Limit
- Set the Limit (Sonoma 14+)
- Set the Limit (Ventura/Monterey)
- Verify
- Persist Across Reboots (Optional)
- Revert
- Operational Tips & Cautions
- Notes for Local LLM Usage
- Apple Silicon Mac with unified memory.
- macOS Sonoma (14+) recommended; older versions supported with a different key.
- Admin privileges for
sudo. - Willingness to accept unsupported/undocumented tweaks (may impact stability).
Increase Metal’s working set so larger LLMs stay on-GPU:
# Example: set ~120 GB on a 128 GB machine (122880 MB)
sudo sysctl iogpu.wired_limit_mb=122880No reboot required. See Verify.
Leave 8–16 GB for macOS and other processes to avoid memory pressure.
| Total RAM | Typical Default Cap (~75%) | Example Raised Cap | MB Value |
|---|---|---|---|
| 32 GB | ~21–24 GB | ~28–30 GB | 28672–30720 |
| 64 GB | ~48 GB | ~56 GB | 57344 |
| 128 GB | ~96 GB | ~120 GB | 122880 |
Formula:
desired_MB = desired_GiB * 1024
Use the MB-based sysctl:
# Check current value (0 = system default ~75%)
sysctl iogpu.wired_limit_mb
# Set new cap (e.g., 120 GB)
sudo sysctl iogpu.wired_limit_mb=122880Older macOS used a bytes key:
# Bytes value (example ~56 GB)
# 56 * 1024 * 1024 * 1024 = 60129542144
sudo sysctl debug.iogpu.wired_limit=60129542144-
Ollama / llama.cpp logs should show a higher
ggml_metal_init: recommendedMaxWorkingSetSize = XXXXX MB -
Read back the sysctl:
sysctl iogpu.wired_limit_mb
-
Watch Activity Monitor → Memory Pressure while running the model.
May require disabling SIP to modify system files. Consider re-running the one-liner manually when needed instead.
# /etc/sysctl.conf (create if missing)
# Sonoma 14+ (MB)
iogpu.wired_limit_mb=122880Reboot to apply.
# Reset to system default (~75%)
sudo sysctl iogpu.wired_limit_mb=0
# If persisted, remove the line from /etc/sysctl.conf and reboot.- Do not set to 100% of RAM—reserve headroom (8–16 GB).
- If Memory Pressure turns yellow/red or system starts swapping, lower the cap.
- Changes are unsupported by Apple; future macOS updates may alter behavior.
- Raising the cap helps keep entire models + KV cache on GPU for faster tokens/sec.
- Quantization (e.g., 4–8 bit) reduces footprint; balance context length vs memory.
- Hybrid fallback is possible (overflow on CPU) but slows generation.
- Keep Ollama/llama.cpp updated; Metal/MPS backends evolve frequently.
Credit: This procedure and context are derived from Greg Stencel’s post: Apple Silicon limitations with usage on local LLM — https://stencel.io/posts/apple-silicon-limitations-with-usage-on-local-llm%20.html