Local AI, no more waiting
on your Mac.

macOS-native MLX server with smart caching. Claude Code, OpenClaw, and Cursor respond in 5 seconds, not 90.

Download DMG
Apple Silicon  ·  macOS 15+

Serving Stats

TOTAL TOKENS PROCESSED
8,785,934
CACHED TOKENS
8,021,760
CACHE EFFICIENCY
91.3%
941
tok/s prompt processing
3.36x
throughput with batching
<5s
TTFT from 2nd turn
SSD KV cache (no eviction)
WHY 4MAC

Built for the way
agents actually work.

Coding agents invalidate the KV cache dozens of times per session. 4mac persists every cache block to SSD — so when the agent circles back to a previous prefix, it's restored from disk in milliseconds, not recomputed from scratch.

01 — CORE

Paged SSD KV caching

Cache blocks are persisted to disk in safetensors format. Two-tier architecture: hot blocks stay in RAM, cold blocks go to SSD with LRU policy. Previously seen prefixes are restored across requests and server restarts — never recomputed.

02 — THROUGHPUT

Continuous batching

Handles concurrent requests through mlx-lm's BatchGenerator. Up to 4.14x generation speedup at 8x concurrency. No more queuing behind a single request.

03 — APP

Native macOS menu bar app

Start, stop, and monitor the server from your menu bar. Web dashboard for model management, chat, and real-time metrics. Signed, notarized, with in-app auto-update. Not Electron.

04 — MODELS

Multi-model serving

LLM, VLM, embedding, and reranker models loaded simultaneously. LRU eviction when memory runs low. Browse and download models directly from the admin dashboard.

05 — API

OpenAI + Anthropic drop-in

Compatible with Claude Code, OpenClaw, Cursor, and any OpenAI-compatible client. Native /v1/messages Anthropic endpoint. Web dashboard generates the exact config command for each tool.

06 — TOOLS

Tool calling + MCP

Supports all major tool calling formats: JSON, Qwen, Gemma, GLM, MiniMax. MCP tool integration and tool result trimming for oversized outputs. Configurable per model.

PERFORMANCE

Real numbers,
real hardware.

All benchmarks on M3 Ultra 512GB. Single request and continuous batching across four popular models.

Single request performance

MiniMax-M2.5-8bit  ·  M3 Ultra 512GB

CONTEXT PROMPT TPS TOKEN TPS PEAK MEM
1k 588 tok/s 34.0 tok/s 227 GB
4k 704 tok/s 30.3 tok/s 228 GB
8k 663 tok/s 26.3 tok/s 229 GB
32k 426 tok/s 14.9 tok/s 235 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

BATCH TOKEN TPS SPEEDUP
1x 34.0 tok/s 1.00x
2x 49.7 tok/s 1.46x
4x 109.8 tok/s 3.23x
8x 126.3 tok/s 3.71x

Single request performance

Qwen3.5-122B-A10B-4bit  ·  M3 Ultra 512GB

CONTEXTPROMPT TPSTOKEN TPSPEAK MEM
1k768 tok/s56.6 tok/s65.5 GB
8k941 tok/s54.0 tok/s69 GB
16k886 tok/s48.3 tok/s71 GB
32k765 tok/s42.4 tok/s73 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

BATCHTOKEN TPSSPEEDUP
1x56.6 tok/s1.00x
2x92.1 tok/s1.63x
4x135.1 tok/s2.39x
8x190.2 tok/s3.36x

Single request performance

Qwen3-Coder-Next-8bit  ·  M3 Ultra 512GB

CONTEXTPROMPT TPSTOKEN TPSPEAK MEM
1k1,462 tok/s58.7 tok/s80 GB
8k2,009 tok/s54.9 tok/s83 GB
16k1,896 tok/s52.3 tok/s83 GB
32k1,624 tok/s45.1 tok/s85 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

BATCHTOKEN TPSSPEEDUP
1x58.7 tok/s1.00x
2x100.5 tok/s1.71x
4x164.0 tok/s2.79x
8x243.3 tok/s4.14x

Single request performance

GLM-5-4bit  ·  M3 Ultra 512GB

CONTEXTPROMPT TPSTOKEN TPSPEAK MEM
1k187 tok/s16.7 tok/s392 GB
4k180 tok/s13.7 tok/s394 GB
16k117 tok/s12.0 tok/s403 GB
32k78 tok/s10.7 tok/s415 GB

Continuous batching

pp1024 / tg128  ·  no cache reuse

BATCHTOKEN TPSSPEEDUP
1x16.7 tok/s1.00x
2x23.7 tok/s1.42x
4x47.0 tok/s2.81x
8x60.3 tok/s3.61x

"The Qwen3.5 models running on 4mac is so fast that it makes running local AI on Mac worthwhile. It is so much faster than LMStudio and the tool calling is so much more reliable."

— GitHub comment, issue #62
FAQ

Common questions.

4mac is built exclusively for Apple Silicon using the native MLX framework, maximizing Unified Memory bandwidth. Unlike Ollama and LM Studio (which rely on llama.cpp), 4mac provides a deeply integrated macOS native experience with Paged SSD KV caching for continuous coding agent sessions.

Any Apple Silicon Mac (M1/M2/M3/M4 series). For larger 70B+ models, we recommend at least 64GB of Unified Memory.

Yes. 4mac provides a drop-in API replacement for OpenAI and Anthropic natively. Just point your proxy URL to http://localhost:8000/v1 and it works out of the box with full tool-calling support.

No! 4mac respects your existing LM Studio downloads folder. You can browse and load all your previously downloaded safe-tensors effortlessly.

We support virtually all modern architectures uploaded to HuggingFace in MLX format, including Qwen, Llama 3, Mistral, GLM, MiniMax, and deep-seek.

GET STARTED

Up and running
in two minutes.

Download the DMG or install from source. Reuses your existing LM Studio model directory — no re-download needed.

macOS App

Recommended

Drag to Applications. The welcome screen walks you through model directory, server start, and first model download. Signed and notarized.

Download DMG