Consumer GPUs to run LLMs

[email protected]

I would prefer to have GPUs for under $600 if possible

Unfortunately not possible for a new nvidia card (you want CUDA) with 16GB VRAM. You can get them for ~$750 if you're patient. This deal was available for awhile earlier today:
https://us-store.msi.com/Graphics-Cards/NVIDIA-GPU/GeForce-RTX-50-Series/GeForce-RTX-5070-Ti-16G-SHADOW-3X-OC
Or you could try to find a 16GB 4070Ti Super like I got. It runs Deepseek 14B and stuff like Stable Diffusion no problem.

[email protected]

I got it working with my 6800XT. I'm running deep seek r1 14b (somewhere around there) and the deep seek coder V2. I have a link to a blog with those instructions

https://gotosocial.michaeldileo.org/@mdileo/statuses/01JQA4M4Q33PMCADH9M2AWQSS8

[email protected]

Using 7900XTX with LMS. Speed are everwhere, driver dependent. With QwQ-32B-Q4_K_M, I got about 20 tok/s, with all VRAM filled.

[email protected]

I tried to run Gemma 3 27B Q4K and was surprised how quickly the VRAM requirements blew up proportional to context window, especially compared to other models (all quantized) at similar size like Qwq 32B.

[email protected]

Do you have 2 PCIE X16 slots on your motherboard (speaking in terms of electrical connections)?

[email protected]

Thank you. Are 14B models the biggest you can run comfortably?

[email protected]

I am OK with either Nvidia or AMD especially if Ollama supports it. With that said I have heard that AMD takes some manual effort whilst Nvidia is easier. Depends on how difficult ROCM is

[email protected]

I don't mind multiple GPUs but my motherboard doesn't have 2+ electrically connected X16 slots. I could build a new homeserver (I've been thinking about it) but consumer platforms simply don't have the PCIE lanes for 2 actual x16 slots. I'd have to go back to Broadwell Xeons for that, which are really power hungry. Oh well, I don't think it matters considering how power hungry GPUs are now.

[email protected]

The 7900XTX was $1000 when it launched, I wouldn't mind it used either.

[email protected]

The coder model has only that one. The ones bigger than that are like 20GB+, and my GPU has 16GB. I've only tried two models, but it looked like the size balloons after that, so that may be the biggest models that I can run.

[email protected]

Wait how does that work? How is 24GB enough for a 38B model?

[email protected]

Do you have any recommendations for running the Mistral small model? I'm very interested in it alongside CodeLlama, OogaBooga and others

[email protected]

I didn't know that. I thought just one ROCM binary to install, run Ollama and that's it. Thanks for the explanation

[email protected]

I haven't looked into the issue of PCIe lanes and the GPU.

I don't think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.

So assuming the kernel modules and hardware support the more narrow bandwidth, it should work... I think. There are laptops that have options for an external FireWire GPU too, so I don't think the PCIe bus is too baked in.

[email protected]

I haven't tried those, so not really, but with open web UI, you can download and run anything, just make sure it fits in your vram so it doesn't run on the CPU. The deep seek one is decent. I find that i like chatgpt 4-o better, but it's still good.

[email protected]

In general how much VRAM do I need for 14B and 24B models?

[email protected]

It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).

[email protected]

Oh and I typically get 16-20 tok/s running a 32b model on Ollama using Open WebUI. Also I have experienced issues with 4-bit quantization for the K/V cache on some models myself so just FYI

[email protected]

Hopefully once Trump crashes economy we will see some bankruptcies and markets flooded with commercial GPUs as AI companies go under.

[email protected]

They would run with 8x speed each. Should not be too much of a bottleneck though, I don't expect the performance to suffer noticeably more than 5% from this. Annoying, but getting a CPU+Board with 32 lanes or more would throw off the price/performance ratio.

agnos.is Forums

Consumer GPUs to run LLMs