I've been told that 1.5 times the model size on disk is the general rule of thumb for system VRAM requirements when running models, but today I actually see that it can be trickier than that.
[Actual VRAM Used] = [model overhead Multiplier] x [model Size on Disk] + [System Idle Constant]
A = M*S + C
With my RTX 5090 sitting about minding it's own business, it'll idle away at ~4.4/31.5 GB.
So let 4.4 be our Constant, C.
Now, here are some models to test out.
Okay. Let's start test our theory first with phi3:latest. S = 2.2 GB.
A =1.5*2.2 +4.4
A =7.7
Okay. So we should observe the 5090 load up to 7.7 in VRAM utilization with phi3:latest.
What's this? 10.5 GB of VRAM is it's actual requirement? That's a bit more than x1.5 What about others?
Estimated A =7.7
Actual A = 10.2?
Clearly, 1.5*model size is incorrect then…let's see, if A = M*S + C then in that case
Actual Size Multiplier=(Actual Utilization−Constant for Idling)/(Size on Disk)
Actual M =(10.2 − 4.4)/2.2= 2.6364
And here I was promised efficiency? Wait, let's check some others.
What about llama2:13b then?
Estimated A =1.5*7.4 +4.4
Estimated A =15.5
Huh? 18.7? But in that case the multiplier was
Actual M =(18.7 − 4.4)/7.4= 1.93
So more like that?
Okay, let's go big with Gemma3:12b then.
Estimated A =1.5*8.1+4.4=16.55
Actual A = 14.3
Actual Multiplier =(14.3 − 4.4)/8.1=1.222
Huh. I guess Gemma3:12b is pretty efficient then.
Okay, let's go big with Gemma3:27b-it-qat then.
Estimated A =1.5*18+4.4=31.4
Interesting. ASSUMING it's hidden overhead bloats it by 1.5, this will BARELY fit into our state-of-the-art GPU. If it's overhead multiplier is MORE than that, it's about to brick this machine.
Actual M = (24-4.4)/18=1.0889
Huh. It's actually very efficient! Must be the magic of quantization?
For the record, I also tested yi:34b, mistral:latest, and deepseek-coder:33B for their actual usage on disk and the their efficiency as well.
So clearly there's more to it than a simple adage of "take 1.5 and multiply it by the size on disk." It's going to take up it's own size plus some kind of multiplier, but there's obviously more to understand how to make an accurate estimate. It's definitely a metric to consider; efficiency in VRAM usage will ultimately lead to efficiencies in cost afterall. More to explore later.
QUICK TIP:
If you want to explore this for yourselves, be sure to keep this linux command in mind with Ollama:
curl http://localhost:11434/api/generate -d '{"model": "MODEL_NAME", "keep_alive": 0}' {"model":"yi:34b","created_at":"2025-05-23T03:57:25.9162788Z","response":"","done":true,"done_reason":"unload"}
Normally Ollama will free the RAM after five minutes. The command above is how to manually unload a model by setting "keep alive" to 0.
No comments:
Post a Comment