macOS-native MLX server with smart caching. Claude Code, OpenClaw, and Cursor respond in 5 seconds, not 90.
Coding agents invalidate the KV cache dozens of times per session. 4mac persists every cache block to SSD — so when the agent circles back to a previous prefix, it's restored from disk in milliseconds, not recomputed from scratch.
Cache blocks are persisted to disk in safetensors format. Two-tier architecture: hot blocks stay in RAM, cold blocks go to SSD with LRU policy. Previously seen prefixes are restored across requests and server restarts — never recomputed.
Handles concurrent requests through mlx-lm's BatchGenerator. Up to 4.14x generation speedup at 8x concurrency. No more queuing behind a single request.
Start, stop, and monitor the server from your menu bar. Web dashboard for model management, chat, and real-time metrics. Signed, notarized, with in-app auto-update. Not Electron.
LLM, VLM, embedding, and reranker models loaded simultaneously. LRU eviction when memory runs low. Browse and download models directly from the admin dashboard.
Compatible with Claude Code, OpenClaw, Cursor, and any OpenAI-compatible client. Native /v1/messages Anthropic endpoint. Web dashboard generates the exact config command for each tool.
Supports all major tool calling formats: JSON, Qwen, Gemma, GLM, MiniMax. MCP tool integration and tool result trimming for oversized outputs. Configurable per model.
All benchmarks on M3 Ultra 512GB. Single request and continuous batching across four popular models.
MiniMax-M2.5-8bit · M3 Ultra 512GB
| CONTEXT | PROMPT TPS | TOKEN TPS | PEAK MEM |
|---|---|---|---|
| 1k | 588 tok/s | 34.0 tok/s | 227 GB |
| 4k | 704 tok/s | 30.3 tok/s | 228 GB |
| 8k | 663 tok/s | 26.3 tok/s | 229 GB |
| 32k | 426 tok/s | 14.9 tok/s | 235 GB |
pp1024 / tg128 · no cache reuse
| BATCH | TOKEN TPS | SPEEDUP |
|---|---|---|
| 1x | 34.0 tok/s | 1.00x |
| 2x | 49.7 tok/s | 1.46x |
| 4x | 109.8 tok/s | 3.23x |
| 8x | 126.3 tok/s | 3.71x |
Qwen3.5-122B-A10B-4bit · M3 Ultra 512GB
| CONTEXT | PROMPT TPS | TOKEN TPS | PEAK MEM |
|---|---|---|---|
| 1k | 768 tok/s | 56.6 tok/s | 65.5 GB |
| 8k | 941 tok/s | 54.0 tok/s | 69 GB |
| 16k | 886 tok/s | 48.3 tok/s | 71 GB |
| 32k | 765 tok/s | 42.4 tok/s | 73 GB |
pp1024 / tg128 · no cache reuse
| BATCH | TOKEN TPS | SPEEDUP |
|---|---|---|
| 1x | 56.6 tok/s | 1.00x |
| 2x | 92.1 tok/s | 1.63x |
| 4x | 135.1 tok/s | 2.39x |
| 8x | 190.2 tok/s | 3.36x |
Qwen3-Coder-Next-8bit · M3 Ultra 512GB
| CONTEXT | PROMPT TPS | TOKEN TPS | PEAK MEM |
|---|---|---|---|
| 1k | 1,462 tok/s | 58.7 tok/s | 80 GB |
| 8k | 2,009 tok/s | 54.9 tok/s | 83 GB |
| 16k | 1,896 tok/s | 52.3 tok/s | 83 GB |
| 32k | 1,624 tok/s | 45.1 tok/s | 85 GB |
pp1024 / tg128 · no cache reuse
| BATCH | TOKEN TPS | SPEEDUP |
|---|---|---|
| 1x | 58.7 tok/s | 1.00x |
| 2x | 100.5 tok/s | 1.71x |
| 4x | 164.0 tok/s | 2.79x |
| 8x | 243.3 tok/s | 4.14x |
GLM-5-4bit · M3 Ultra 512GB
| CONTEXT | PROMPT TPS | TOKEN TPS | PEAK MEM |
|---|---|---|---|
| 1k | 187 tok/s | 16.7 tok/s | 392 GB |
| 4k | 180 tok/s | 13.7 tok/s | 394 GB |
| 16k | 117 tok/s | 12.0 tok/s | 403 GB |
| 32k | 78 tok/s | 10.7 tok/s | 415 GB |
pp1024 / tg128 · no cache reuse
| BATCH | TOKEN TPS | SPEEDUP |
|---|---|---|
| 1x | 16.7 tok/s | 1.00x |
| 2x | 23.7 tok/s | 1.42x |
| 4x | 47.0 tok/s | 2.81x |
| 8x | 60.3 tok/s | 3.61x |
"The Qwen3.5 models running on 4mac is so fast that it makes running local AI on Mac worthwhile. It is so much faster than LMStudio and the tool calling is so much more reliable."
4mac is built exclusively for Apple Silicon using the native MLX framework, maximizing Unified Memory bandwidth. Unlike Ollama and LM Studio (which rely on llama.cpp), 4mac provides a deeply integrated macOS native experience with Paged SSD KV caching for continuous coding agent sessions.
Any Apple Silicon Mac (M1/M2/M3/M4 series). For larger 70B+ models, we recommend at least 64GB of Unified Memory.
Yes. 4mac provides a drop-in API replacement for OpenAI and Anthropic natively. Just point your proxy URL to http://localhost:8000/v1 and it works out of the box with full tool-calling support.
No! 4mac respects your existing LM Studio downloads folder. You can browse and load all your previously downloaded safe-tensors effortlessly.
We support virtually all modern architectures uploaded to HuggingFace in MLX format, including Qwen, Llama 3, Mistral, GLM, MiniMax, and deep-seek.
Download the DMG or install from source. Reuses your existing LM Studio model directory — no re-download needed.
Drag to Applications. The welcome screen walks you through model directory, server start, and first model download. Signed and notarized.
Download DMG