Consumer GPUs to run LLMs
-
Anything under 16 is a no go. Your number of CPU cores are important. Use Oobabooga Textgen for an advanced llama.cpp setup that splits between the CPU and GPU. You'll need at least 64 GB of RAM or be willing to offload layers using the NVME with deepspeed. I can run up to a 72b model with 4 bit quantization in GGUF with a 12700 laptop with a mobile 3080Ti which has 16GB of VRAM (mobile is like that).
I prefer to run a 8×7b mixture of experts model because only 2 of the 8 are ever running at the same time. I am running that in 4 bit quantized GGUF and it takes 56 GB total to load. Once loaded it is about like a 13b model for speed but is ~90% of the capabilities of a 70b. The streaming speed is faster than my fastest reading pace.
A 70b model streams at my slowest tenable reading pace.
Both of these options are exponentially more capable than any of the smaller model sizes even if you screw around with training. Unfortunately, this streaming speed is still pretty slow for most advanced agentic stuff. Maybe if I had 24 to 48gb it would be different, I cannot say. If I was building now, I would be looking at what hardware options have the largest L1 cache, the most cores that include the most advanced AVX instructions. Generally, anything with efficiency cores are removing AVX and because the CPU schedulers in kernels are usually unable to handle this asymmetry consumer junk has poor AVX support. It is quite likely that all the problems Intel has had in recent years has been due to how they tried to block consumer stuff from accessing the advanced P-core instructions that were only blocked in microcode. It requires disabling the e-cores or setting up a CPU set isolation in Linux or BSD distros.
You need good Linux support even if you run windows. Most good and advanced stuff with AI will be done with WSL if you haven't ditched doz for whatever reason. Use https://linux-hardware.org/ to see support for devices.
The reason I mentioned avoid consumer e-cores is because there have been some articles piping up lately about all p-core hardware.
The pain constraint for the CPU is the L2 to L1 cache bus width. Researching this deeply may be beneficial.
Splitting the load between multiple GPUs may be an option too. As of a year ago, the cheapest option for a 16 GB GPU in a machine was a second hand 12th gen Intel laptop with a 3080Ti by a considerable margin when all of it is added up. It is noisy, gets hot, and I hate it many times, wishing I had gotten a server like setup for AI, but I have something and that is what matters.
I don't mind multiple GPUs but my motherboard doesn't have 2+ electrically connected X16 slots. I could build a new homeserver (I've been thinking about it) but consumer platforms simply don't have the PCIE lanes for 2 actual x16 slots. I'd have to go back to Broadwell Xeons for that, which are really power hungry. Oh well, I don't think it matters considering how power hungry GPUs are now.
-
I recommend a used 3090, as that has 24 GB of VRAM and generally can be found for $800ish or less (at least when I last checked, in February). It’s much cheaper than a 4090 and while admittedly more expensive than the inexpensive 24GB Nvidia Tesla card (the P40?) it also has much better performance and CUDA support.
I have dual 3090s so my performance won’t translate directly to what a single GPU would get, but it’s pretty easy to find stats on 3090 performance.
The 7900XTX was $1000 when it launched, I wouldn't mind it used either.
-
Thank you. Are 14B models the biggest you can run comfortably?
The coder model has only that one. The ones bigger than that are like 20GB+, and my GPU has 16GB. I've only tried two models, but it looked like the size balloons after that, so that may be the biggest models that I can run.
-
3090 has 24gb and rolls the 38b models beautifully.
Wait how does that work? How is 24GB enough for a 38B model?
-
The coder model has only that one. The ones bigger than that are like 20GB+, and my GPU has 16GB. I've only tried two models, but it looked like the size balloons after that, so that may be the biggest models that I can run.
Do you have any recommendations for running the Mistral small model? I'm very interested in it alongside CodeLlama, OogaBooga and others
-
Using 7900XTX with LMS. Speed are everwhere, driver dependent. With QwQ-32B-Q4_K_M, I got about 20 tok/s, with all VRAM filled.
I didn't know that. I thought just one ROCM binary to install, run Ollama and that's it. Thanks for the explanation
-
I don't mind multiple GPUs but my motherboard doesn't have 2+ electrically connected X16 slots. I could build a new homeserver (I've been thinking about it) but consumer platforms simply don't have the PCIE lanes for 2 actual x16 slots. I'd have to go back to Broadwell Xeons for that, which are really power hungry. Oh well, I don't think it matters considering how power hungry GPUs are now.
I haven't looked into the issue of PCIe lanes and the GPU.
I don't think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.
So assuming the kernel modules and hardware support the more narrow bandwidth, it should work... I think. There are laptops that have options for an external FireWire GPU too, so I don't think the PCIe bus is too baked in.
-
Do you have any recommendations for running the Mistral small model? I'm very interested in it alongside CodeLlama, OogaBooga and others
I haven't tried those, so not really, but with open web UI, you can download and run anything, just make sure it fits in your vram so it doesn't run on the CPU. The deep seek one is decent. I find that i like chatgpt 4-o better, but it's still good.
-
I haven't tried those, so not really, but with open web UI, you can download and run anything, just make sure it fits in your vram so it doesn't run on the CPU. The deep seek one is decent. I find that i like chatgpt 4-o better, but it's still good.
In general how much VRAM do I need for 14B and 24B models?
-
In general how much VRAM do I need for 14B and 24B models?
It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).
-
It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).
Oh and I typically get 16-20 tok/s running a 32b model on Ollama using Open WebUI. Also I have experienced issues with 4-bit quantization for the K/V cache on some models myself so just FYI
-
Not sure if this is the right place, if not please let me know.
GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.
Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).
What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.
My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.
Thanks
You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.
Hopefully once Trump crashes economy we will see some bankruptcies and markets flooded with commercial GPUs as AI companies go under.
-
Do you have 2 PCIE X16 slots on your motherboard (speaking in terms of electrical connections)?
They would run with 8x speed each. Should not be too much of a bottleneck though, I don't expect the performance to suffer noticeably more than 5% from this. Annoying, but getting a CPU+Board with 32 lanes or more would throw off the price/performance ratio.
-
Wait how does that work? How is 24GB enough for a 38B model?
Look up “LLM quantization.” The idea is that each parameter is a number; by default they use 16 bits of precision, but if you scale them into smaller sizes, you use less space and have less precision, but you still have the same parameters. There’s not much quality loss going from 16 bits to 8, but it gets more noticeable as you get lower and lower. (That said, there’s are ternary bit models being trained from scratch that use 1.58 bits per parameter and are allegedly just as good as fp16 models of the same parameter count.)
If you’re using a 4-bit quantization, then you need about half that number in VRAM. Q4_K_M is better than Q4, but also a bit larger. Ollama generally defaults to Q4_K_M. If you can handle a higher quantization, Q6_K is generally best. If you can’t quite fit it, Q5_K_M is generally better than any other option, followed by Q5_K_S.
For example, Llama3.3 70B, which has 70.6 billion parameters, has the following sizes for some of its quantizations:
- q4_K_M (the default): 43 GB
- fp16: 141 GB
- q8: 75 GB
- q6_K: 58 GB
- q5_k_m: 50 GB
- q4: 40 GB
- q3_K_M: 34 GB
- q2_K: 26 GB
This is why I run a lot of Q4_K_M 70B models on two 3090s.
Generally speaking, there’s not a perceptible quality drop going to Q6_K from 8 bit quantization (though I have heard this is less true with MoE models). Below Q6, there’s a bit of a drop between it and 5 and then 4, but the model’s still decent. Below 4-bit quantizations you can generally get better results from a smaller parameter model at a higher quantization.
TheBloke on Huggingface has a lot of GGUF quantization repos, and most, if not all of them, have a blurb about the different quantization types and which are recommended. When Ollama.com doesn’t have a model I want, I’m generally able to find one there.
-
Not sure if this is the right place, if not please let me know.
GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.
Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).
What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.
My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.
Thanks
You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.
With an AMD RX 6800 + 32gb DDR4, I can run up to a 34b model at an acceptable speed.
-
I am OK with either Nvidia or AMD especially if Ollama supports it. With that said I have heard that AMD takes some manual effort whilst Nvidia is easier. Depends on how difficult ROCM is
With Ollama, all you have do is copy an extra folder of ROCm files. Not hard at all.
-
They would run with 8x speed each. Should not be too much of a bottleneck though, I don't expect the performance to suffer noticeably more than 5% from this. Annoying, but getting a CPU+Board with 32 lanes or more would throw off the price/performance ratio.
I have an alternative for you if your power bills are cheap: X99 motherboard + CPU combos from China
-
Not sure if this is the right place, if not please let me know.
GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.
Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).
What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.
My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.
Thanks
You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.
You also could cool into high core count CPUs.
-
You didn't, I did. The starting models cap at 24, but you can spec up the biggest one up to 64GB. I should have clicked through to the customization page before reporting what was available.
That is still cheaper than a 5090, so it's not that clear cut. I think it depends on what you're trying to set up and how much money you're willing to burn. Sometimes literally, the Mac will also be more power efficient than a honker of an Nvidia 90 class card.
Honestly, all I have for recommendations is that I'd rather scale up than down. I mean, unless you also want to play kickass games at insane framerates with path tracing or something. Then go nuts with your big boy GPUs, who cares.
But for LLM stuff strictly I'd start by repurposing what I have around, hitting a speed limit and then scaling up to maybe something with a lot of shared RAM (including a Mac Mini if you're into those) and keep rinsing and repeating. I don't know that I personally am in the market for AI-specific muti-thousand APUs with a hundred plus gigs of RAM yet.
Just FYI: The "Mac Studio" when equipped with 32-core M3Ultra processors can have up to 512GB of RAM.
It costs like 15k after taxes, so not exactly the scope of this thread, but it exists.
-
Just FYI: The "Mac Studio" when equipped with 32-core M3Ultra processors can have up to 512GB of RAM.
It costs like 15k after taxes, so not exactly the scope of this thread, but it exists.
Yeah, for sure. That I was aware of.
We were focusing on the Mini instead because... well, if the OP is fretting about going for a big GPU I'm assuming we're talking user-level costs here. The Mini's reputation comes from starting at 600 bucks for 16 gigs of fast shared RAM, which is competitive with consumer GPUs as a standalone system. I wanted to correct the record about the 24Gig starter speccing up to 64 because the 64 gig one is still in the 2K range, which is lower than the realistic market prices of 4090s and 5090s, so if my priority was running LLMs there would be some thinking to do about which option makes most sense in the 500-2K price range.
I am much less aware of larger options and their relative cost to performance because... well, I may not hate LLMs as much as is popular around the Internet, but I'm no roaming cryptobro, either, and I assume neither is anybody else in this conversation.