← All GPUs

RTX 4090

24GB VRAM  ·  Ada Lovelace (CUDA)

ℹ️ VRAM estimates assume GGUF format and 2048-token context. Running at 8k+ context adds 2–4GB. Tokens/sec = generation (decode) speed only.
Model Quant VRAM Tok/s Data Fits
llama3.2-3b Q4_K_M 2GB 210 verified
gemma3-4b Q4_K_M 3GB 190 verified
qwen2.5-7b Q4_K_M 4.7GB 155 verified
mistral-7b Q4_K_M 4.7GB 152 verified
llama3.1-8b Q4_K_M 5GB 148 verified
deepseek-r1-7b Q4_K_M 4.7GB 145 estimate
gemma3-12b Q4_K_M 7.5GB 95 verified
qwen2.5-14b Q4_K_M 9GB 92 verified
phi-4-14b Q4_K_M 9GB 88 verified
mistral-22b Q4_K_M 14GB 58 verified
gemma3-27b Q4_K_M 16.5GB 48 verified
qwen2.5-32b Q4_K_M 19.5GB 44 verified Tight
nemotron-34b Q4_K_M 20.7GB 38 estimate Tight
llama3.1-70b Q4_K_M 42GB estimate ✗ No
qwen2.5-72b Q4_K_M 43.5GB estimate ✗ No
Last updated: 2026-06-11 · Source · Improve this data