Autonomous coding agents on your Mac. Coding vs reasoning models, real-world RAM after macOS overhead, and tok/s by memory bandwidth.
| Chip | Bandwidth | 14B Q4 | 32B Q4 | 70B Q4 | 70B Q8 |
|---|---|---|---|---|---|
| M4 Pro | 273 GB/s | ~45 t/s | ~22 t/s | won't fit | won't fit |
| M4 Max | 546 GB/s | ~80 t/s | ~40 t/s | ~28 t/s | ~16 t/s |
| M5 Max | 614 GB/s | ~90 t/s | ~48 t/s | ~32 t/s | ~18 t/s |
| M3 Ultra | 800 GB/s | ~110 t/s | ~55 t/s | ~50 t/s | ~28 t/s |
| M5 Ultra (exp.) | ~1,228 GB/s | ~140 t/s | ~80 t/s | ~65 t/s | ~36 t/s |
| Runner | Best For | Speed |
|---|---|---|
| vllm-mlx | Production serving, continuous batching, paged KV cache | Fastest overall · 2x llama.cpp · 6x Ollama (8 concurrent) |
| MLX (mlx-lm) | Single-stream inference, zero-copy unified memory | 20-50% faster than llama.cpp · purpose-built for Metal |
| MLC-LLM | Cross-platform deployment | Slightly behind MLX |
| Ollama | Easiest setup, OpenAI-compatible API | llama.cpp backend · best DX for agent integration |
| llama.cpp | Portable C++ with Metal backend | Good, but Metal is an add-on — not native like MLX |
| llamafile | Single-file distribution (cosmopolitan libc) | Same as llama.cpp · convenience, not speed |
DS R1-70B Q8 (70 GB) + Phi-4 (10 GB) = 80 GB, leaving 48 GB for KV cache + context. 614 GB/s bandwidth gives ~18 t/s on 70B. Neural accelerators cut TTFT by 4x vs M4.