zai-org/GLM-4.5-Air · Hugging Face
LocalLLaMA
21
Posts
5
Posters
0
Views
-
I'm just gonna try vllm, seems like ik_llama.cpp doesnt have a quick docker method
wrote last edited by [email protected]It should work in any generic cuda container, but yeah it’s more of a hobbyist engine. Honestly I just run it raw since it’s dependency free, except for system CUDA.
Vllm absolutely cannot CPU offload AFAIK, but small models will fit in your vram with room to spare.