| Family | Generation | Chip | CPU cores | GPU cores | Neural Engine | RAM | Mem. Bandwidth | Max Storage | Launch Price | Current Price | Refurb Price | Used (Market) | Released |
|---|
| Generation | Chips | Released | Discontinued | Max RAM | Max GPU cores | Max Bandwidth | Fastest chip for LLMs |
|---|---|---|---|---|---|---|---|
| 2022 | M1 Max / M1 Ultra | Mar 2022 | Jun 2023 | 128GB | 64 | 800GB/s | M1 Ultra (64c GPU) |
| 2023 | M2 Max / M2 Ultra | Jun 2023 | Mar 2025 | 192GB | 76 | 800GB/s | M2 Ultra (76c GPU) |
| 2025 | M4 Max / M3 Ultra | Mar 2025 | current | 512GB* | 80 | 819GB/s | M3 Ultra (80c GPU) |
| Model | Params | Best runnable quant | Est. footprint | Est. speed | Max context | Fit | Notes |
|---|
sysctl iogpu.wired_limit_mb) and use a short context
Won't fit at any practical quant
| Model (total / active params) |
|---|
| Metric | A | B |
|---|
| Machine | Price | / GB RAM | / GB/s bandwidth | / tok/s (8B Q4) | Largest @ Q4 | / B params it runs | 3-yr TCO / 1M tok |
|---|
Every figure in this guide comes from the dataset embedded in this single file, last verified against primary sources on 2 July 2026. This self-test validates the dataset's internal integrity — useful after any future data edit.
Memory footprint for a model at a given quantization is estimated as:
footprint_GB ≈ params_billions × bytes_per_param + overhead_GB
Bytes-per-parameter by quantization (blended GGUF/MLX averages):
| Quant | Bits/param | Bytes/param | Typical use |
|---|---|---|---|
| FP16 | 16 | 2.00 | Reference / training-quality inference |
| Q8_0 | ~8.5 | 1.06 | Near-lossless, 2x smaller than FP16 |
| Q6_K | ~6.6 | 0.75 | Very close to Q8 quality |
| Q5_K_M | ~5.7 | 0.69 | Good balance |
| Q4_K_M | ~4.8 | 0.58 | The default "sweet spot" for local inference |
| Q3_K_M | ~3.9 | 0.45 | Noticeable quality loss, last resort for big models |
Overhead: KV-cache + runtime buffers, approximated as max(2GB, 12% of weight size) — enough for a moderate (8K–32K token) context window. Long-context use (100K+ tokens) needs meaningfully more.
Usable memory: Apple Silicon's unified memory is shared between CPU, GPU and OS. macOS reserves a chunk (more on lower-RAM machines) and by default limits how much the GPU can "wire" for its own use. The community rule of thumb — and what this tool uses as the "comfortable" line — is ~75% of total RAM. Power users raise the GPU's ceiling with sudo sysctl iogpu.wired_limit_mb=N to push past that, which is what the "tight" tier (up to ~90%) assumes.
MoE models (Mixtral, Qwen3-MoE, Llama 4, DeepSeek-V3/R1, GPT-OSS) must keep every expert resident in memory even though only a fraction activate per token — so footprint uses total params, not active params. Active params instead drive tokens/sec, not whether it fits at all.
These are engineering estimates for planning purposes, not a guarantee — actual footprint varies by inference engine (llama.cpp vs MLX vs vLLM), context length, batch size, and OS version. Always leave more headroom if you're also running other apps.
Single-user token generation on Apple Silicon is memory-bandwidth-bound, not compute-bound: producing each token requires streaming the model's active parameters through memory once. That gives a simple estimate:
tokens/sec ≈ efficiency × bandwidth_GBps ÷ (active_params_billions × bytes_per_param)
The efficiency constant (0.48) is calibrated, not guessed, against two published real-world benchmarks:
Solving both benchmarks for a shared constant lands close to 0.48, and using it to back-check a third, independent case (M1 Max/400GB/s running a 7B model) lands in the same ballpark as commonly-reported numbers — reasonable cross-validation for a single-constant model. Real throughput still varies ±30% by inference engine, context length, batch size, and thermal state, and this estimate covers decode/generation speed only.
Prompt processing (time-to-first-token), shown in each model's deep-dive, follows different math: prefill is compute-bound, scaling with GPU cores and per-core throughput rather than memory bandwidth, and is nearly independent of quantization. Modeled as 130 × gpu_cores × generation_factor ÷ active_params_B, calibrated against llama.cpp pp512 community benchmarks (M1 Max 32-core ≈ 590 tok/s on a 7B; M2 Ultra 76-core ≈ 1,600 tok/s — both inside reported ranges), with per-generation GPU gains of ×1.15/1.3/1.5 for M2/M3/M4 and a deliberately conservative ×2.2 for M5's neural-accelerator GPUs versus Apple's ~3.5× marketing claim. Treat as ±40% — this is the roughest estimate in the tool, but it makes the long-prompt (RAG, whole-codebase) wait visible instead of hidden.
Switching the currency selector (top right) does not apply a flat FX rate. Instead each chip variant (M1 Max, M1 Ultra, M2 Max, M2 Ultra, M4 Max, M3 Ultra) has its own multiplier, computed as regional MSRP ÷ US MSRP, averaged across that chip's two GPU-core tiers, using Apple's own historical regional store prices (not a generic EUR/GBP/CAD/AUD exchange rate). This matters because Apple's international pricing bakes in local VAT/GST and rounds to local psychological price points — it isn't just FX math, and the ratio drifts across generations (e.g. AUD went from ~1.55× USD in 2022 to ~1.72–1.76× by 2025 as the currency weakened and Apple's margin assumptions shifted).
Known limitation: Apple doesn't publish a per-RAM-tier regional breakdown, so RAM/storage upcharges are estimated by applying the same chip-level ratio rather than an independently-sourced regional upcharge — treat non-base configurations as a close approximation, and the base (cheapest) configuration per chip as the most accurate figure.
The Compare table's Current Price column reflects the June 2026 Apple price increase, driven by a global DRAM/NAND shortage:
Why the Refurb Price column can only ever be an estimate: Apple's Certified Refurbished store (apple.com/shop/refurbished/mac/mac-studio) is a client-side-rendered storefront — its listings and prices are injected by JavaScript, not present in the page's HTML, so there is no static "official refurb price list" to read or hardcode. It's also not a catalog: it's a rotating pool of returned/open-box units that sells out per-configuration constantly — one source noted the store had zero Mac Studio units in stock at all as of June 2026. So unlike Launch Price (a fixed historical fact) or Current Price (a live number Apple actually publishes for new units), there is no ground-truth "the refurb price" to look up — it fluctuates with whatever inventory happens to exist right now.
What this tool does instead: it models Refurb Price off each generation's launch price (not the shortage-inflated current price) using an age-based discount, calibrated against real spot-checked listings found via search — e.g. an M1 Max 64GB/1TB unit listed at $1,909 (≈20% off its $2,399 launch price), an M2 Max 64GB/1TB unit at $2,139 (≈11% off), and an M4 Max 64GB/1TB unit at $2,459 (≈5% off its pre-hike $2,599 launch price). Notably, refurb prices tracked the original launch price, not the 2026 hiked price — depreciation from age dominates over the shortage-driven price hike. This is a small sample, not a full dataset, so treat every Refurb Price figure as directional, and check Apple's live store for what's actually in stock and its real price.
Once a model's weights are loaded, whatever unified memory remains within the 75% "comfortable" budget is available for KV-cache — i.e. actual usable context length. The Max context column in "Find LLMs for a Config" shows the smaller of (a) how many tokens fit in the remaining RAM, and (b) the model's own trained maximum context — labeled RAM-bound or model max accordingly.
KV-cache-per-token is modeled from each model's typical published attention architecture (layers × KV-heads × head-dimension), not independently audited for all 27 catalog entries — treat it as directional. Three families are called out specifically because their whole design point is KV-cache efficiency, so the gap vs. a naive per-parameter estimate is real and large: DeepSeek-V3/R1 uses Multi-head Latent Attention (compressing the cache far below standard attention), Qwen3.6-27B uses a hybrid linear-attention design built explicitly to cut long-context KV cost, and Falcon 180B uses multi-query attention (cheap per token, but paired with a short native 2K context regardless).
Llama 4 Scout/Maverick's headline 10M/1M-token context windows are architecturally real but essentially unreachable on any machine in this guide — the KV cache required would demand far more RAM than any configuration offers. This tool's RAM-bound figure will show the realistic ceiling, not the marketing number.
The picker panel's Running Cost card multiplies each chip's idle/load wattage by your chosen hours-at-load and electricity rate. Wattage figures are per-chip (RAM tier barely moves power draw) and represent realistic sustained LLM-inference draw — token generation is memory-bandwidth/GPU bound, not the kind of all-cores-plus-GPU synthetic stress test that produces the much higher "max wattage" figures sometimes quoted in reviews (e.g. 330W+ on M4 Max, 370W+ on M2 Ultra). The one direct real-workload anchor found: M3 Ultra confirmed running DeepSeek R1 671B at under 200W — used to calibrate the Ultra-tier load figure. Electricity rate defaults are rough per-currency ballparks, not verified per-country figures — edit the field to your real local rate.
The "Local vs cloud API" section answers the pre-purchase question directly: if you generated the same tokens through a hosted API instead of buying the machine, what would it cost, and how long until the hardware pays for itself?
How it's computed: your hours-at-load are assumed to be spent generating tokens at this machine's estimated tok/s for the chosen model (the same bandwidth-bound speed model documented above). Local cost per million tokens is the electricity burned at load wattage over the time it takes to produce them; the monthly local figure additionally includes idle draw for the rest of the day. Cloud cost uses typical hosted output-token prices per model. Payback = today's hardware price (current store price if orderable, used-market midpoint otherwise) ÷ monthly savings.
Cloud price confidence: three anchors are directly verified as of July 2026 — DeepSeek R1 at $2.50/M output and Llama 3.3 70B at $0.32/M (both OpenRouter), and GPT-OSS 120B at $0.60/M (Together AI and Fireworks, identical). Every other model is a tier estimate interpolated by size/class, and open-model hosting prices genuinely vary 2–5× between providers and variants (Qwen3-235B spans $0.10–$1.82/M depending on variant and host) — so treat non-anchor comparisons as directional. Cloud USD prices are converted to the display currency with rough fixed FX rates, separate from the regional-MSRP multipliers used for hardware.
What the payback number deliberately ignores: input-token costs (often the larger share for RAG/long-context workloads — including them would favor local even more), model quality differences between what fits locally and frontier hosted models, the privacy/offline value of local inference, hardware resale value, and macOS utility beyond LLM duty. It's a planning ballpark for the "should I buy this machine at all?" question, not an accounting model.
When the "Recommend a Config" wizard finds that no single machine has enough RAM for your selected models, it suggests clustering multiple units of the cheapest $/GB configuration using a tool like EXO, llama.cpp's RPC backend, or MLX's distributed inference support. This is a real, community-used technique — unified memory really does pool additively across networked machines — but it is not Apple-supported, requires real setup work, and generation speed drops well below this tool's single-box tok/s estimates once inference is bound by the Thunderbolt/network link between machines rather than one machine's internal memory bus. The wizard deliberately does not attempt to estimate clustered tok/s, since it depends heavily on network topology and software stack rather than hardware specs this tool can model.
Coverage beyond the Mac Studio is limited to LLM-relevant configurations (16GB+) and carries a lower confidence tier than the Studio data:
The Value panel ranks every machine by price-per-unit on the basis you choose. Most metrics are straight division (price ÷ RAM, ÷ bandwidth, ÷ estimated tok/s on the Llama 3.1 8B Q4 reference, ÷ parameters of the largest model that fits comfortably at Q4). The 3-year TCO per 1M tokens goes further: (price + 3 years of electricity at your Running Cost inputs) ÷ (3 years of tokens generated at that duty cycle on the reference model) — a genuinely comparable cost-per-token against cloud APIs, inheriting the speed model's ±30% and the power model's ±30% error bars, so treat it as directional.
The questionnaire's score is deliberately simple enough to print: score = (largest-fit params at Q4 + w × tok/s on the 8B reference) ÷ (3-yr TCO ÷ 1000), where w weights speed higher for interactive uses (chat, creative: 1.5; coding: 1.0) and model size higher for agentic/reasoning/long-context work (0.7/0.5/0.3). It is a value heuristic, not a benchmark — two machines within ~15% of each other's score are effectively tied, and every price caveat flagged elsewhere (refurb/used estimates, modeled RAM-tier upcharges) applies here too.
Everything — leaderboards, questionnaire scoring, all 75 machines × 35 models — computes locally in your browser from the dataset embedded in this single HTML file. There is no server, no API, and no network request beyond the two web fonts.
This tool is fact-checked against primary sources (Apple's own spec pages and regional stores) as of July 2026, but a few things would make it more trustworthy over time:
support.apple.com tech-spec page (linked below) rather than a secondary aggregator, since aggregators occasionally mix up generations or GPU tiers.Mac Studio specifications and USD pricing:
Regional pricing (used to derive currency multipliers):
Selected LLM catalog sources:
Speed-estimate calibration benchmarks: