How to calculate cost-per-tokens output of local model compared to enterprise model API access
-
You're good. I'm trying to get larger context windows on my models so trying to figure that out and balance token throughput. I do appreciate your insights into the different use cases.
Have you tried larger 70b models? Or compared against larger MoE models?
wrote last edited by [email protected]I have not tried any models larger than very low quant qwen 32b . My personal limits for partial offloading speeds are 1 tps and the 32b models encroach on that. Once I get my vram upgraded from 8gb to 16-24gb ill test the waters with higher parameters and hit some new limits to benchmark
I haven't tried out MoE models either, I keep hearing about them. AFAIK they're popular with people because you can do advanced partial offloading strategies between different experts to really bump the token generation. So playing around with them has been on my ml bucket list for awhile.
-
I do all my local LLM-ing on an M1 Max macbook pro with a power draw of around 40-60 Watts (which for my use cases is probably about 10 minutes a day in total). I definitely believe we can be more efficient running these models at home.
I wish I’d sprung for the max when I bought my M1 Pro, but I am glad I splurged on memory. Really aside from LLM workloads this thing is still excellent.
Agree we can be doing a lot more, the recent generation of local models are fantastic.
Gemma 3n and Phi 4 (non reasoning) are my local workhorses lately.
-
Neat, would like to toss the numbers of my 3090 and 3080 in there.
wrote last edited by [email protected]I would recommend you get a cheap wattage meter that plugs inbetween wall outlet and PSU powering your cards for 10-15$ (the 30$ name brand kill-a-watts are overpriced and unneeded IMO). You can try to get rough approximations doing some math with your cards listed TPD specs added together but that doesn't account for motherboard, cpu, ram, drives, so on all and the real change between idle and load. With a meter you can just kind of watch the total power draw with all that stuff factored in, take note of increase and max out as your rig inferences a bit. Have the comfort of being reasonably confident in the actual numbers. Then you can plug the values in a calculation
-
I have not tried any models larger than very low quant qwen 32b . My personal limits for partial offloading speeds are 1 tps and the 32b models encroach on that. Once I get my vram upgraded from 8gb to 16-24gb ill test the waters with higher parameters and hit some new limits to benchmark
I haven't tried out MoE models either, I keep hearing about them. AFAIK they're popular with people because you can do advanced partial offloading strategies between different experts to really bump the token generation. So playing around with them has been on my ml bucket list for awhile.
Dude! That's so dope. I would really like your insights in how you tuned MoE. That would be a game changer as you can swap out unnecessary layers from the GPU and still get the benefit of using a bigger model and stuff.
Yeah it's a little hard to do inference with these limited VRAM situations and larger contexts. That's a massive pain
-
Dude! That's so dope. I would really like your insights in how you tuned MoE. That would be a game changer as you can swap out unnecessary layers from the GPU and still get the benefit of using a bigger model and stuff.
Yeah it's a little hard to do inference with these limited VRAM situations and larger contexts. That's a massive pain
wrote last edited by [email protected]I don't have a lot of knowledge on the topic but happy to point you in good direction for reference material. I heard about tensor layer offloading first from here a few months ago. In that post is linked another to MoE expert layer offloadingI highly recommend you read through both post. MoE offloading it was based off
The gist of the Tensor Cores strategy is Instead of offloading entire layers with --gpulayers, you use --overridetensors to keep specific large tensors (particularly FFN tensors) on CPU while moving everything else to GPU.
This works because:
- Attention tensors: Small, benefit greatly from GPU parallelization
- FFN tensors: Large, can be efficiently processed on CPU with basic matrix multiplication
You need to figure out which cores exactly need to be offloaded for your model looking at weights and cooking up regex according to the post.
Heres an example of a kobold startup flags for doing this. The key part is the override tensors flag and the regex contained in it
python ~/koboldcpp/koboldcpp.py --threads 10 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU" ... [18:44:54] CtxLimit:39294/40960, Amt:597/2048, Init:0.24s, Process:68.69s (563.34T/s), Generate:56.27s (10.61T/s), Total:124.96s
The exact specifics of how you determine which tensors for each model and the associated regex is a little beyond my knowledge but the people who wrote the tensor post did a good job trying to explain that process in detail. Hope this helps.
-
I don't have a lot of knowledge on the topic but happy to point you in good direction for reference material. I heard about tensor layer offloading first from here a few months ago. In that post is linked another to MoE expert layer offloadingI highly recommend you read through both post. MoE offloading it was based off
The gist of the Tensor Cores strategy is Instead of offloading entire layers with --gpulayers, you use --overridetensors to keep specific large tensors (particularly FFN tensors) on CPU while moving everything else to GPU.
This works because:
- Attention tensors: Small, benefit greatly from GPU parallelization
- FFN tensors: Large, can be efficiently processed on CPU with basic matrix multiplication
You need to figure out which cores exactly need to be offloaded for your model looking at weights and cooking up regex according to the post.
Heres an example of a kobold startup flags for doing this. The key part is the override tensors flag and the regex contained in it
python ~/koboldcpp/koboldcpp.py --threads 10 --usecublas --contextsize 40960 --flashattention --port 5000 --model ~/Downloads/MODELNAME.gguf --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU" ... [18:44:54] CtxLimit:39294/40960, Amt:597/2048, Init:0.24s, Process:68.69s (563.34T/s), Generate:56.27s (10.61T/s), Total:124.96s
The exact specifics of how you determine which tensors for each model and the associated regex is a little beyond my knowledge but the people who wrote the tensor post did a good job trying to explain that process in detail. Hope this helps.
Damn! Thank you so much. This is very helpful and a great starting point for me to mess about to make the most of my LLM setup. Appreciate it!!
-
Damn! Thank you so much. This is very helpful and a great starting point for me to mess about to make the most of my LLM setup. Appreciate it!!
wrote last edited by [email protected]Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.
If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.
-
Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090.
If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.
I need to mess with tabbyapi. Doesn't help that there's like 2 tabbys, one is tabbyapi and the other is tabbyml.
I am guessing tool support is at its infancy stage. -
I need to mess with tabbyapi. Doesn't help that there's like 2 tabbys, one is tabbyapi and the other is tabbyml.
I am guessing tool support is at its infancy stage.wrote last edited by [email protected]Tabby supports tool usage. It's all just prompting to the underlying LLM, so you can get some frontend to hit the API and do whatever is needed, but I think it does have some kind of native prompt wrapper too.
It is confusing because there are 2 TabbyAPI formats now: exl2 (optimal around 4-5bpw), older and more mature (but now unsupported), and exl3, optimal down to ~3bpw (and usable even below), but slower on some GPUs.
-
Tabby supports tool usage. It's all just prompting to the underlying LLM, so you can get some frontend to hit the API and do whatever is needed, but I think it does have some kind of native prompt wrapper too.
It is confusing because there are 2 TabbyAPI formats now: exl2 (optimal around 4-5bpw), older and more mature (but now unsupported), and exl3, optimal down to ~3bpw (and usable even below), but slower on some GPUs.
Thank you for all your insight!! This is really helpful