LLM VRAM
Calculator
HF version of the calculator found at https://novaml.ai/vram/
Reset
HuggingFace Model Path
Load Model
HF Token (Optional)
Your Hardware (Optional)
Model Specifications
-
Parameters
-
Hidden Size
-
Layers
-
Attn Heads
-
Inference Configuration
🔒 = recalculate with this parameter
Quantization Method
✓ Optimal
FP16 (16.0 bpw)
Q8_0 (8.5 bpw)
Q6_K (6.59 bpw)
Q5_K_M (5.69 bpw)
Q5_K_S (5.54 bpw)
Q4_K_M (4.85 bpw)
Q4_K_S (4.58 bpw)
Q4_0 (4.55 bpw)
Q3_K_M (3.91 bpw)
Q3_K_S (3.5 bpw)
Q2_K (3.35 bpw)
IQ3_XXS (3.06 bpw)
IQ2_XXS (2.06 bpw)
KV Cache Precision
✓ Optimal
FP16 (Standard)
Q8_0 (Compressed)
Q4_0 (Highly Compressed)
Context Length
✓ Optimal
Batch Size
Framework
llama.cpp (Efficient)
ExLlamaV2 (Very Efficient)
vLLM (Production)
HuggingFace Transformers (Heavy)
Flash Attention
Vision Adapter
Estimation Results
Estimated Usage
0.0
GB
Model
Context
Overhead
Overload
Compatibility Matrix
All GPUs
Consumer
Datacenter