Question 1

What LLMs can run on RTX 4090?

Accepted Answer

12 Q4_K_M models in this dataset are marked as fitting on RTX 4090. Start with the table rows marked as fitting, then compare VRAM and decode tok/s.

Question 2

What is the fastest local LLM on RTX 4090?

Accepted Answer

llama3.2-3b is the fastest listed model on RTX 4090 at 210 decode tokens per second.

Question 3

How much VRAM does RTX 4090 have for local LLMs?

Accepted Answer

RTX 4090 has 24GB VRAM. Models near 80% or more of that capacity are marked as tight fits.

Question 4

Can RTX 4090 run 70B local LLMs?

Accepted Answer

RTX 4090 is not marked as fitting the listed 70B-class Q4_K_M models in this dataset. Check the fit column before trying larger context windows.

Question 5

What quantization should I use on RTX 4090?

Accepted Answer

Use Q4_K_M as the baseline for this site. It is the comparison format used across the current benchmark table and keeps VRAM requirements predictable.

Model	Quant	VRAM	Tok/s	Data	Fits
llama3.2-3b	Q4_K_M	2GB	210	Community benchmark	✓
gemma3-4b	Q4_K_M	3GB	190	Community benchmark	✓
qwen2.5-7b	Q4_K_M	4.7GB	155	Community benchmark	✓
mistral-7b	Q4_K_M	4.7GB	152	Community benchmark	✓
llama3.1-8b	Q4_K_M	5GB	148	Community benchmark	✓
deepseek-r1-7b	Q4_K_M	4.7GB	145	estimate	✓
gemma3-12b	Q4_K_M	7.5GB	95	Community benchmark	✓
qwen2.5-14b	Q4_K_M	9GB	92	Community benchmark	✓
phi-4-14b	Q4_K_M	9GB	88	Community benchmark	✓
mistral-24b	Q4_K_M	14.4GB	58	Community benchmark	✓
gemma3-27b	Q4_K_M	16.5GB	48	Community benchmark	✓
qwen2.5-32b	Q4_K_M	19.5GB	44	Community benchmark	Tight
llama3.1-70b	Q4_K_M	42GB	—	estimate	✗ No
nemotron-51b	Q4_K_M	30.6GB	—	estimate	✗ No
qwen2.5-72b	Q4_K_M	43.5GB	—	estimate	✗ No

RTX 4090

You might also compare

RTX 4090 local LLM FAQ