How much RAM do I need to run a 70B model on a Mac?

About 62GB of unified memory runs a 70B-class model comfortably at Q4_K_M quantization (weights plus KV-cache overhead, keeping ~25% of RAM free for macOS). A 64GB machine is tight; 96GB or more is comfortable.

Can a Mac Studio run DeepSeek?

A 512GB M3 Ultra Mac Studio runs DeepSeek-V3/R1 (671B) at ~Q3 quantization around 24 tok/s, and DeepSeek V4 Flash (284B) at Q8 around 29 tok/s. The 1.6T-parameter V4 Pro does not fit any single Mac at practical quantizations.

What is the best value Mac for local LLMs?

On the used market, a used M1 Ultra Mac Studio 64GB delivers roughly the cheapest tokens over a 3-year life (~$2 per million tokens including electricity), while a used 512GB M3 Ultra is the cheapest per GB of RAM and per billion parameters of model capability. The site's Value panel computes this live across 75 configurations.

Is a Mac cheaper than cloud APIs for running LLMs?

For steady daily use, often yes: hardware amortized over three years plus electricity can undercut per-token cloud pricing for comparable open-weight models, especially on used machines. For occasional use, cloud APIs stay cheaper. The guide computes the break-even from your usage hours and electricity rate.

Mac Studio × Local LLMs

Every machine, every configuration

Each row is one buildable configuration as originally offered by Apple (family × chip variant × RAM tier) — Mac Studio, Mac mini, and 14" MacBook Pro. Click column headers to sort; deeper orange = higher tier within that column.

Hardware

Family	Generation	Chip	CPU cores	GPU cores	Neural Engine	RAM	Mem. Bandwidth	Max Storage	Launch Price	Current Price	Refurb Price	Used (Market)	Released

verified directly confirmed on Apple's current store est. modeled from the verified base-config price hike for that chip Discontinued— no longer sold new by Apple Pulled (2026)— this exact RAM tier was withdrawn during the 2026 memory shortage

Refurb prices are modeled off each generation's launch price (not the shortage-inflated current price) with an age-based discount calibrated from real spot-checked listings — Apple doesn't publish a fixed refurb price list, and inventory rotates constantly (it was reportedly empty for Mac Studio entirely in June 2026), so treat these as rough planning estimates and check apple.com/shop/refurbished/mac/mac-studio for live stock and real prices. Used (Market) shows a range modeled from real eBay/Back Market/Swappa spot-checks — this is open-market private-party pricing with no certification, so it's the widest, least reliable range in the table; treat it as a ballpark for negotiation, not a quote.

Generation overview

Generation	Chips	Released	Discontinued	Max RAM	Max GPU cores	Max Bandwidth	Fastest chip for LLMs
2022	M1 Max / M1 Ultra	Mar 2022	Jun 2023	128GB	64	800GB/s	M1 Ultra (64c GPU)
2023	M2 Max / M2 Ultra	Jun 2023	Mar 2025	192GB	76	800GB/s	M2 Ultra (76c GPU)
2025	M4 Max / M3 Ultra	Mar 2025	current	512GB*	80	819GB/s	M3 Ultra (80c GPU)

*512GB was the original 2025 launch ceiling on M3 Ultra (32-core CPU/80-core GPU tier). As of mid-2026 Apple temporarily limits new orders to 96GB due to a memory-chip shortage — see banner above.

What can your Mac actually run?

Pick a configuration — Mac Studio, Mac mini, or MacBook Pro. The calculator estimates real memory footprint per quantization level (weights + KV-cache/runtime overhead) against ~75% of unified memory kept free for macOS + apps — the commonly recommended safety margin for llama.cpp / MLX / Ollama.

Hardware

Generation

Chip / GPU config

Unified Memory

Running cost

Rough electricity cost for this configuration, based on realistic sustained LLM-inference power draw (not synthetic max-load figures).

Hours/day at load

Electricity rate

Local vs cloud API

If you generated the same tokens through a hosted API instead of buying this machine, what would it cost — and how long until the hardware pays for itself?

Compare against hosted

Load/idle wattage is per-chip, calibrated from real reviews (M3 Ultra confirmed <200W running DeepSeek R1 671B; other chips from published idle/max-load teardowns) — see Methodology. Default electricity rate is a rough residential ballpark for the selected currency; edit it to your real local rate. The cloud comparison assumes your "hours/day at load" are spent generating tokens at this machine's estimated speed for the chosen model, priced against typical hosted output rates (three directly verified anchors, the rest tier estimates — see Methodology). Cloud USD prices are converted at rough fixed FX rates, and the payback figure uses what you'd pay for the machine today (current price if orderable, used-market midpoint otherwise). Payback ignores input-token costs, quality differences, privacy value, and resale value — it's a planning ballpark, not an accounting model.

Popular picks by use case

Model	Params	Best runnable quant	Est. footprint	Est. speed	Max context	Fit	Notes

Comfortable — fits in ~75% of RAM, room for decent context Tight — needs >75%, raise the GPU wired-memory limit (sysctl iogpu.wired_limit_mb) and use a short context Won't fit at any practical quant

Full compatibility matrix

Best quantization each RAM tier can run "comfortably" for every model in the catalog, one hardware family at a time. Purple params = mixture-of-experts (MoE) — memory is driven by total parameters even though only the active subset computes per token, so MoE models need more RAM than their speed suggests. ★ marks the most widely-used models.

Hardware

Model (total / active params)

fits comfortably fits tight (raise wired limit) — won't fit

Which Mac should you buy?

Check off the models you want to run comfortably. This finds the cheapest configuration — across every generation, ever sold — that has enough unified memory for all of them at once.

Mode

Minimum quality floor

Cheapest by

Config duel

Pin any two configurations and compare them head-to-head. The stronger value in each row is marked ▲ with the margin; price rows favor the cheaper machine.

Configuration A

Configuration B

Metric	A	B

Derived rows (largest comfortable model, est. speed) use the same fit/speed models documented in Method — treat them with the same caveats. Current/refurb/used prices carry their usual verified/est. flags.

Value analysis

Cheapest machine per unit of everything that matters for local LLMs — computed live in your browser from the same dataset as every other panel (no server involved). Pick a price basis; the podium shows the outright winner per metric, and the table ranks all machines (click headers to sort).

Price basis

Machine	Price	/ GB RAM	/ GB/s bandwidth	/ tok/s (8B Q4)	Largest @ Q4	/ B params it runs	3-yr TCO / 1M tok

TCO = price on the chosen basis + 3 years of electricity at your Running Cost inputs (hours/day at load, rate — set on the Find LLMs panel), divided by 3 years of tokens generated at that duty cycle on the 8B reference model. Refurb/used prices carry their usual est. caveats; ▲ marks the best value in each column.

Best value for you

Six questions, scored entirely in your browser against all machines. The score is transparent: capability (largest model it runs at Q4, weighted by bandwidth for your use case) per unit of 3-year total cost.

Data diagnostics

Every figure in this guide comes from the dataset embedded in this single file, last verified against primary sources on 2 July 2026. This self-test validates the dataset's internal integrity — useful after any future data edit.

How the numbers are computed

Memory footprint for a model at a given quantization is estimated as:

footprint_GB ≈ params_billions × bytes_per_param + overhead_GB

Bytes-per-parameter by quantization (blended GGUF/MLX averages):

Quant	Bits/param	Bytes/param	Typical use
FP16	16	2.00	Reference / training-quality inference
Q8_0	~8.5	1.06	Near-lossless, 2x smaller than FP16
Q6_K	~6.6	0.75	Very close to Q8 quality
Q5_K_M	~5.7	0.69	Good balance
Q4_K_M	~4.8	0.58	The default "sweet spot" for local inference
Q3_K_M	~3.9	0.45	Noticeable quality loss, last resort for big models

Overhead: KV-cache + runtime buffers, approximated as max(2GB, 12% of weight size) — enough for a moderate (8K–32K token) context window. Long-context use (100K+ tokens) needs meaningfully more.

Usable memory: Apple Silicon's unified memory is shared between CPU, GPU and OS. macOS reserves a chunk (more on lower-RAM machines) and by default limits how much the GPU can "wire" for its own use. The community rule of thumb — and what this tool uses as the "comfortable" line — is ~75% of total RAM. Power users raise the GPU's ceiling with sudo sysctl iogpu.wired_limit_mb=N to push past that, which is what the "tight" tier (up to ~90%) assumes.

MoE models (Mixtral, Qwen3-MoE, Llama 4, DeepSeek-V3/R1, GPT-OSS) must keep every expert resident in memory even though only a fraction activate per token — so footprint uses total params, not active params. Active params instead drive tokens/sec, not whether it fits at all.

These are engineering estimates for planning purposes, not a guarantee — actual footprint varies by inference engine (llama.cpp vs MLX vs vLLM), context length, batch size, and OS version. Always leave more headroom if you're also running other apps.

Speed estimates (tokens/sec)

Single-user token generation on Apple Silicon is memory-bandwidth-bound, not compute-bound: producing each token requires streaming the model's active parameters through memory once. That gives a simple estimate:

tokens/sec ≈ efficiency × bandwidth_GBps ÷ (active_params_billions × bytes_per_param)

The efficiency constant (0.48) is calibrated, not guessed, against two published real-world benchmarks:

M2 Ultra (800GB/s) running Llama-family 70B dense @ Q4_K_M → ~8–12 tok/s community-reported
M3 Ultra (819GB/s) running DeepSeek R1 671B (37B active, MoE) @ Q4 → ~16–18 tok/s (llama.cpp and MLX benchmarks)

Solving both benchmarks for a shared constant lands close to 0.48, and using it to back-check a third, independent case (M1 Max/400GB/s running a 7B model) lands in the same ballpark as commonly-reported numbers — reasonable cross-validation for a single-constant model. Real throughput still varies ±30% by inference engine, context length, batch size, and thermal state, and this estimate covers decode/generation speed only.

Prompt processing (time-to-first-token), shown in each model's deep-dive, follows different math: prefill is compute-bound, scaling with GPU cores and per-core throughput rather than memory bandwidth, and is nearly independent of quantization. Modeled as 130 × gpu_cores × generation_factor ÷ active_params_B, calibrated against llama.cpp pp512 community benchmarks (M1 Max 32-core ≈ 590 tok/s on a 7B; M2 Ultra 76-core ≈ 1,600 tok/s — both inside reported ranges), with per-generation GPU gains of ×1.15/1.3/1.5 for M2/M3/M4 and a deliberately conservative ×2.2 for M5's neural-accelerator GPUs versus Apple's ~3.5× marketing claim. Treat as ±40% — this is the roughest estimate in the tool, but it makes the long-prompt (RAG, whole-codebase) wait visible instead of hidden.

Currency conversion

Switching the currency selector (top right) does not apply a flat FX rate. Instead each chip variant (M1 Max, M1 Ultra, M2 Max, M2 Ultra, M4 Max, M3 Ultra) has its own multiplier, computed as regional MSRP ÷ US MSRP, averaged across that chip's two GPU-core tiers, using Apple's own historical regional store prices (not a generic EUR/GBP/CAD/AUD exchange rate). This matters because Apple's international pricing bakes in local VAT/GST and rounds to local psychological price points — it isn't just FX math, and the ratio drifts across generations (e.g. AUD went from ~1.55× USD in 2022 to ~1.72–1.76× by 2025 as the currency weakened and Apple's margin assumptions shifted).

Known limitation: Apple doesn't publish a per-RAM-tier regional breakdown, so RAM/storage upcharges are estimated by applying the same chip-level ratio rather than an independently-sourced regional upcharge — treat non-base configurations as a close approximation, and the base (cheapest) configuration per chip as the most accurate figure.

Current & refurbished pricing

The Compare table's Current Price column reflects the June 2026 Apple price increase, driven by a global DRAM/NAND shortage:

verified — the M4 Max 14-core/32-core/36GB ($1,999→$2,499) and M3 Ultra 28-core/60-core/96GB ($3,999→$5,299) base configurations are directly confirmed on Apple's current store.
est. — every other 2025-generation RAM/GPU tier applies that same chip's verified hike percentage (M4 Max ≈ +25.0%, M3 Ultra ≈ +32.5%) to its original launch price, since Apple hasn't published a full itemized current price list mid-shortage.
Pulled (2026) — M4 Max 128GB and M3 Ultra 256GB/512GB were confirmed withdrawn from sale entirely (March–May 2026) and are not priced.
Discontinued — all 2022 (M1) and 2023 (M2) configurations are no longer sold new by Apple; their Launch Price is the last price they were ever sold at.

Why the Refurb Price column can only ever be an estimate: Apple's Certified Refurbished store (apple.com/shop/refurbished/mac/mac-studio) is a client-side-rendered storefront — its listings and prices are injected by JavaScript, not present in the page's HTML, so there is no static "official refurb price list" to read or hardcode. It's also not a catalog: it's a rotating pool of returned/open-box units that sells out per-configuration constantly — one source noted the store had zero Mac Studio units in stock at all as of June 2026. So unlike Launch Price (a fixed historical fact) or Current Price (a live number Apple actually publishes for new units), there is no ground-truth "the refurb price" to look up — it fluctuates with whatever inventory happens to exist right now.

What this tool does instead: it models Refurb Price off each generation's launch price (not the shortage-inflated current price) using an age-based discount, calibrated against real spot-checked listings found via search — e.g. an M1 Max 64GB/1TB unit listed at $1,909 (≈20% off its $2,399 launch price), an M2 Max 64GB/1TB unit at $2,139 (≈11% off), and an M4 Max 64GB/1TB unit at $2,459 (≈5% off its pre-hike $2,599 launch price). Notably, refurb prices tracked the original launch price, not the 2026 hiked price — depreciation from age dominates over the shortage-driven price hike. This is a small sample, not a full dataset, so treat every Refurb Price figure as directional, and check Apple's live store for what's actually in stock and its real price.

Context window calculator

Once a model's weights are loaded, whatever unified memory remains within the 75% "comfortable" budget is available for KV-cache — i.e. actual usable context length. The Max context column in "Find LLMs for a Config" shows the smaller of (a) how many tokens fit in the remaining RAM, and (b) the model's own trained maximum context — labeled RAM-bound or model max accordingly.

KV-cache-per-token is modeled from each model's typical published attention architecture (layers × KV-heads × head-dimension), not independently audited for all 27 catalog entries — treat it as directional. Three families are called out specifically because their whole design point is KV-cache efficiency, so the gap vs. a naive per-parameter estimate is real and large: DeepSeek-V3/R1 uses Multi-head Latent Attention (compressing the cache far below standard attention), Qwen3.6-27B uses a hybrid linear-attention design built explicitly to cut long-context KV cost, and Falcon 180B uses multi-query attention (cheap per token, but paired with a short native 2K context regardless).

Llama 4 Scout/Maverick's headline 10M/1M-token context windows are architecturally real but essentially unreachable on any machine in this guide — the KV cache required would demand far more RAM than any configuration offers. This tool's RAM-bound figure will show the realistic ceiling, not the marketing number.

Running cost (electricity) estimate

The picker panel's Running Cost card multiplies each chip's idle/load wattage by your chosen hours-at-load and electricity rate. Wattage figures are per-chip (RAM tier barely moves power draw) and represent realistic sustained LLM-inference draw — token generation is memory-bandwidth/GPU bound, not the kind of all-cores-plus-GPU synthetic stress test that produces the much higher "max wattage" figures sometimes quoted in reviews (e.g. 330W+ on M4 Max, 370W+ on M2 Ultra). The one direct real-workload anchor found: M3 Ultra confirmed running DeepSeek R1 671B at under 200W — used to calibrate the Ultra-tier load figure. Electricity rate defaults are rough per-currency ballparks, not verified per-country figures — edit the field to your real local rate.

Local vs cloud API comparison

The "Local vs cloud API" section answers the pre-purchase question directly: if you generated the same tokens through a hosted API instead of buying the machine, what would it cost, and how long until the hardware pays for itself?

How it's computed: your hours-at-load are assumed to be spent generating tokens at this machine's estimated tok/s for the chosen model (the same bandwidth-bound speed model documented above). Local cost per million tokens is the electricity burned at load wattage over the time it takes to produce them; the monthly local figure additionally includes idle draw for the rest of the day. Cloud cost uses typical hosted output-token prices per model. Payback = today's hardware price (current store price if orderable, used-market midpoint otherwise) ÷ monthly savings.

Cloud price confidence: three anchors are directly verified as of July 2026 — DeepSeek R1 at $2.50/M output and Llama 3.3 70B at $0.32/M (both OpenRouter), and GPT-OSS 120B at $0.60/M (Together AI and Fireworks, identical). Every other model is a tier estimate interpolated by size/class, and open-model hosting prices genuinely vary 2–5× between providers and variants (Qwen3-235B spans $0.10–$1.82/M depending on variant and host) — so treat non-anchor comparisons as directional. Cloud USD prices are converted to the display currency with rough fixed FX rates, separate from the regional-MSRP multipliers used for hardware.

What the payback number deliberately ignores: input-token costs (often the larger share for RAG/long-context workloads — including them would favor local even more), model quality differences between what fits locally and frontier hosted models, the privacy/offline value of local inference, hardware resale value, and macOS utility beyond LLM duty. It's a planning ballpark for the "should I buy this machine at all?" question, not an accounting model.

Multi-Mac clustering

When the "Recommend a Config" wizard finds that no single machine has enough RAM for your selected models, it suggests clustering multiple units of the cheapest $/GB configuration using a tool like EXO, llama.cpp's RPC backend, or MLX's distributed inference support. This is a real, community-used technique — unified memory really does pool additively across networked machines — but it is not Apple-supported, requires real setup work, and generation speed drops well below this tool's single-box tok/s estimates once inference is bound by the Thunderbolt/network link between machines rather than one machine's internal memory bus. The wizard deliberately does not attempt to estimate clustered tok/s, since it depends heavily on network topology and software stack rather than hardware specs this tool can model.

Mac mini & MacBook Pro coverage

Coverage beyond the Mac Studio is limited to LLM-relevant configurations (16GB+) and carries a lower confidence tier than the Studio data:

Base configuration prices are verified against launch coverage (Mac mini M1 $699/M2 $599/M2 Pro $1,299/M4 $599/M4 Pro $1,399; MacBook Pro 14" M1 Pro $1,999, M3 Pro $1,999, M4 $1,599, M4 Pro 14c $2,499). Non-base RAM tiers apply Apple's standard published upcharges for that era rather than an itemized verified price per tier — treat them as modeled.
MacBook Pro rows use 14-inch pricing (the cheapest chassis for each chip); the 16-inch versions of the same chip/RAM cost ~$200–500 more.
Bandwidth figures are chip-accurate and matter for speed estimates: note the M3 Pro's cut to 150GB/s (below the M1/M2 Pro's 200) and the M3 Max's split (300GB/s on the 14-core bin, 400 on the 16-core).
Laptop power draw is the roughest figure here — no direct LLM-workload measurements were found for MBP chips, and laptop chassis throttle sustained load below desktop levels; treat running-cost numbers as ±30%.
Regional-currency multipliers reuse the era-matched Mac Studio ratios (only the Studio line's regional MSRPs were researched); the mini M4's 2026 price rise to $799 is verified, other current-price hikes for these lines are estimates flagged in-table.

Value analysis & questionnaire

The Value panel ranks every machine by price-per-unit on the basis you choose. Most metrics are straight division (price ÷ RAM, ÷ bandwidth, ÷ estimated tok/s on the Llama 3.1 8B Q4 reference, ÷ parameters of the largest model that fits comfortably at Q4). The 3-year TCO per 1M tokens goes further: (price + 3 years of electricity at your Running Cost inputs) ÷ (3 years of tokens generated at that duty cycle on the reference model) — a genuinely comparable cost-per-token against cloud APIs, inheriting the speed model's ±30% and the power model's ±30% error bars, so treat it as directional.

The questionnaire's score is deliberately simple enough to print: score = (largest-fit params at Q4 + w × tok/s on the 8B reference) ÷ (3-yr TCO ÷ 1000), where w weights speed higher for interactive uses (chat, creative: 1.5; coding: 1.0) and model size higher for agentic/reasoning/long-context work (0.7/0.5/0.3). It is a value heuristic, not a benchmark — two machines within ~15% of each other's score are effectively tied, and every price caveat flagged elsewhere (refurb/used estimates, modeled RAM-tier upcharges) applies here too.

Everything — leaderboards, questionnaire scoring, all 75 machines × 35 models — computes locally in your browser from the dataset embedded in this single HTML file. There is no server, no API, and no network request beyond the two web fonts.

How to improve data accuracy further

This tool is fact-checked against primary sources (Apple's own spec pages and regional stores) as of July 2026, but a few things would make it more trustworthy over time:

Re-verify against Apple's live configurator before a purchase decision — RAM/GPU tiers and prices have already shifted once in 2026 due to the memory-chip shortage (see banner), and could shift again.
Treat LLM parameter counts as publisher-reported, not independently audited — official model cards (Hugging Face / Meta / Qwen / DeepSeek / OpenAI) are the ground truth if a figure looks off, especially for newer releases.
MoE active-parameter counts are the least standardized figure in this catalog — different publishers define "active" slightly differently (per-token vs per-forward-pass), so treat those as directional, not exact.
Quantization byte-per-parameter figures are blended averages across common GGUF/MLX quant schemes — the actual bytes/param for a specific quant file can vary a few percent depending on which layers are kept at higher precision.
Regional prices are modeled, not itemized — see the Currency conversion note above for the exact limitation.
If you spot a figure that's drifted from Apple's current published specs, the most reliable fix is to re-pull the relevant support.apple.com tech-spec page (linked below) rather than a secondary aggregator, since aggregators occasionally mix up generations or GPU tiers.

Sources

Mac Studio specifications and USD pricing:

Regional pricing (used to derive currency multipliers):

Selected LLM catalog sources:

Speed-estimate calibration benchmarks: