The Local LLM Hardware Calculator estimates how much GPU VRAM and system RAM you need to run any open-source language model on your own hardware. Select a model size and quantization level to see which consumer and professional GPUs can handle it.
Model Configuration
Hardware Requirements
Configure model and click Calculate
GPU Compatibility
Run a calculation to see GPU compatibility.
How to Use the Local LLM Hardware Calculator
Running a local LLM lets you keep data private, avoid API costs, and iterate quickly without network latency. The biggest constraint is GPU VRAM — models must fit in GPU memory to run at full speed. This local LLM hardware calculator tells you exactly how much VRAM each model configuration requires and which GPUs can handle it.
Step 1: Choose Your Model Size
Select from common sizes like 7B, 13B, or 70B, or enter a custom parameter count. Larger models generally produce better output but need more VRAM. For most home use cases, a 7B–13B model at Q4_K_M quantization hits the sweet spot between quality and hardware requirements.
Step 2: Pick a Quantization Level
Quantization compresses the model weights, trading a small amount of quality for dramatically less VRAM. Q4_K_M is the most popular choice — it uses about 0.5 bytes per parameter, cutting memory usage to roughly one-eighth of FP32 with minimal quality degradation (typically <2% on benchmarks). FP16 gives the best quality but doubles VRAM needs versus Q8.
Step 3: Set Context Length and Batch Size
The KV cache grows with context length and batch size. A 7B model at 4K context uses about 0.5 GB for KV cache, but at 32K context that jumps to 4 GB. Batch size multiplies KV cache linearly — serving 8 concurrent users needs 8× the KV cache VRAM of a single user.
Reading the GPU Compatibility Table
Green checkmarks show GPUs with sufficient VRAM. Red marks indicate GPUs that won't fit the model. If no single GPU works, consider CPU offloading with llama.cpp (much slower) or multi-GPU setups using tensor parallelism. Apple Silicon unified memory acts as both RAM and VRAM, making M2 Ultra (192 GB) uniquely capable for large models.
Practical Quick Reference
7B model at Q4_K_M: ~4.5 GB VRAM — fits an RTX 3060 (12 GB). 13B at Q4_K_M: ~8 GB VRAM — fits an RTX 4070 (12 GB). 70B at Q4_K_M: ~38 GB VRAM — requires A100 80GB or multi-GPU setup. All estimates include ~500 MB system overhead.
FAQ
How much VRAM do I need for Llama 3 8B?
At Q4_K_M quantization (the sweet spot), Llama 3 8B needs roughly 5-6 GB of VRAM. At FP16, you'd need about 16 GB. A 12 GB GPU like the RTX 3060 or RTX 4070 comfortably handles Q4_K_M, while FP16 requires a 24 GB card like the RTX 4090 or A5000.
How much VRAM do I need for a 70B model?
At Q4_K_M quantization, a 70B model needs around 38-40 GB of VRAM. This exceeds all consumer GPUs, so you'd need a data center GPU (A100 80GB or H100 80GB) or split across multiple consumer GPUs. Alternatively, use Apple Silicon M2 Ultra (192GB unified memory) for full FP16 inference.
What is quantization and how does it affect performance?
Quantization reduces model precision to save memory. FP32 uses 4 bytes per parameter, FP16 uses 2 bytes, Q8 uses 1 byte, and Q4 uses about 0.5 bytes. Lower quantization = less VRAM but slightly lower quality. Q4_K_M is the recommended balance: ~50% VRAM savings with minimal quality loss, typically under 1-2% on benchmarks.
Can I run a large model split across multiple GPUs?
Yes, using tools like llama.cpp or vLLM you can split layers across multiple GPUs via tensor parallelism. Two RTX 4090s (48 GB combined) can run Llama 70B at Q4_K_M. Performance scales roughly linearly with GPU count but NVLink or high-bandwidth interconnects significantly help.
How accurate are these VRAM estimates?
These are close approximations. Actual usage varies by implementation (llama.cpp vs vLLM vs Transformers), batch size, and system overhead. Expect real usage to be within 10-15% of these estimates. Always leave 1-2 GB headroom for OS and driver overhead.
Can I use system RAM instead of VRAM?
Yes, if your GPU doesn't have enough VRAM, frameworks like llama.cpp can offload layers to CPU RAM (using the -ngl flag to control how many layers go to GPU). This works but is much slower — expect 5-20x slower inference depending on how much is offloaded to RAM.
Is this tool free?
Yes, completely free with no signup required. All calculations run in your browser.