Research — March 2026

Local AI Models on Apple Silicon

Autonomous coding agents on your Mac. Coding vs reasoning models, real-world RAM after macOS overhead, and tok/s by memory bandwidth.

Coding Models
Qwen3 Coder
Alibaba · Code-optimized MoE
Qwen3-Coder-30B-A3B
MoE ~18 GB
Qwen3-Coder-Next 80B/3B
70.6% SWE ~48 GB
Mistral / Devstral
Mistral AI · Code-first
Codestral 25.01
86.6% HE ~13 GB
Devstral Small 2
24B New ~14 GB
Devstral 2 (123B, 72.2% SWE) too large for consumer Mac.
Reasoning Models
Qwen 3.5
Alibaba · Feb–Mar 2026 · Sonnet-class
Qwen3.5-27B
Dense ~16 GB
Qwen3.5-35B-A3B
MoE ~20 GB
Qwen3.5-122B-A10B
MoE ~70 GB
Qwen3.5-397B-A17B
Frontier ~230 GB
DeepSeek R1
DeepSeek · Dense Distillations
R1-Distill-14B
14B ~9 GB
R1-Distill-32B
32B ~20 GB
R1-Distill-70B
Best Value ~40 GB
Q8 = ~70 GB. R2/V4 not yet released.
Phi-4
Microsoft · Beats DS-R1-70B on AIME
Phi-4-reasoning-plus
77.7% AIME ~10 GB
Phi-4-reasoning-vision
15B multi ~10 GB
68.8% LiveCodeBench. Outstanding at 14B.
Llama 4
Meta · MoE · April 2025
Llama 4 Scout
109B/17B ~12 GB*
Llama 4 Maverick
400B/17B ~230 GB
*Scout: 17B active, 10M context. Full weights needed.
Gemma 3  27B Dense · ~15 GB
Google · MLX-optimized · QAT models available · 20-50% faster than llama.cpp natively
macOS Memory Overhead

Metal GPU caps VRAM at ~75-80% of unified memory. Reserve 8-12 GB for macOS + apps.

64 GB
Total Unified Memory
~50 GB usable
128 GB
Total Unified Memory
~100 GB usable
256 GB
Total Unified Memory
~220 GB usable
sysctl tweak (iogpu.wired_limit_mb) can raise GPU limit on dedicated machines. Model ≤60-70% of total RAM for KV cache + context headroom.
RAM Footprint at Q4 — vs Usable Memory
Phi-4 Reasoning+
10
Llama 4 Scout
12
Codestral 25.01
13
Devstral Small 2
14
Qwen3-Coder-30B
18
DS R1-Distill-32B
20
DS R1-Distill-70B
40
Qwen3-Coder-Next
48
DS R1-70B (Q8)
70
Qwen3-235B-A22B
143
Qwen3.5-397B
230
Memory Bandwidth & Tokens/Second
Token generation is memory-bandwidth-bound. tok/s ≈ bandwidth ÷ model_size_bytes. Prefill is compute-bound (M5 neural accelerators: 4x faster TTFT).
Chip Bandwidth 14B Q4 32B Q4 70B Q4 70B Q8
M4 Pro 273 GB/s ~45 t/s ~22 t/s won't fit won't fit
M4 Max 546 GB/s ~80 t/s ~40 t/s ~28 t/s ~16 t/s
M5 Max 614 GB/s ~90 t/s ~48 t/s ~32 t/s ~18 t/s
M3 Ultra 800 GB/s ~110 t/s ~55 t/s ~50 t/s ~28 t/s
M5 Ultra (exp.) ~1,228 GB/s ~140 t/s ~80 t/s ~65 t/s ~36 t/s
Approximate MLX tok/s for token generation. llama.cpp: 20-50% slower. MoE models faster (only active params read).
Best Models by Mac Tier
Mac Mini
M4 Pro
64 GB · 273 GB/s
~50 GB usable
~$1,400
vs Sonnet~70%
  • Qwen3-Coder-30B-A3B 69.6% SWE · MoE · 18 GB ~45 t/s (MoE, 3B active)
  • Phi-4-reasoning-plus 77.7% AIME · 10 GB ~45 t/s
  • Codestral 25.01 86.6% HumanEval · 13 GB ~35 t/s
Best combo: Qwen3-Coder-30B + Phi-4 together (28 GB, room to spare)
Mac Studio
M5 Ultra (expected mid-2026)
256 GB · ~1,228 GB/s
~220 GB usable
2x bandwidth of M5 Max
~$8,000 – $10,000
vs Sonnet~90-95%
  • Qwen3-235B-A22B (Q4) Frontier MoE · 143 GB · 77 GB to spare ~55 t/s (MoE, 22B active)
  • Qwen3.5-397B-A17B Latest flagship · ~230 GB · tight fit ~70 t/s (MoE, 17B active)
  • DS R1-70B FP16 + Coder-Next Full precision 140 GB + 48 GB = 188 GB ~36 t/s (70B) · both models simultaneously
Best: Qwen3-235B-A22B (143 GB) + Coder-Next (48 GB) = 191 GB. Frontier reasoning + coding, fully local.
Inference Stack — Performance Hierarchy
RunnerBest ForSpeed
vllm-mlx Production serving, continuous batching, paged KV cache Fastest overall · 2x llama.cpp · 6x Ollama (8 concurrent)
MLX (mlx-lm) Single-stream inference, zero-copy unified memory 20-50% faster than llama.cpp · purpose-built for Metal
MLC-LLM Cross-platform deployment Slightly behind MLX
Ollama Easiest setup, OpenAI-compatible API llama.cpp backend · best DX for agent integration
llama.cpp Portable C++ with Metal backend Good, but Metal is an add-on — not native like MLX
llamafile Single-file distribution (cosmopolitan libc) Same as llama.cpp · convenience, not speed
No C++ engine beats MLX on Apple Silicon. MLX was built by Apple specifically for unified memory + Metal. llama.cpp added Metal support retroactively.
Solo Developer Recommendation
M5 Max MacBook Pro
128 GB is the sweet spot.

DS R1-70B Q8 (70 GB) + Phi-4 (10 GB) = 80 GB, leaving 48 GB for KV cache + context. 614 GB/s bandwidth gives ~18 t/s on 70B. Neural accelerators cut TTFT by 4x vs M4.

~$5,200
TOTAL STARTUP
ROI in ~17 months vs API
Hardware
MacBook Pro M5 Max
128 GB · 614 GB/s · 40-core GPU
Stack
MLX + Ollama
vllm-mlx for serving
Primary Reasoning
DS R1-Distill-70B (Q8)
~70 GB · ~18 t/s
Primary Coding
Qwen3-Coder-Next
~48 GB · ~90+ t/s (MoE)
Fast Triage
Phi-4-reasoning-plus
14B · ~10 GB · ~90 t/s
API Budget
$20–50/mo for hard tasks
Claude / frontier calls
Why not 64 GB Mac Mini?
~50 GB usable. 273 GB/s = slow on 70B. 32B models need babysitting. Good as a secondary inference server.
Why not 256 GB Mac Studio?
M5 Ultra not shipping yet (mid-2026). Double the cost for ~10-15% better output. Buy when revenue justifies it.