Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Selfhosted
  3. Consumer GPUs to run LLMs

Consumer GPUs to run LLMs

Scheduled Pinned Locked Moved Selfhosted
selfhosted
39 Posts 17 Posters 96 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U [email protected]

    Using 7900XTX with LMS. Speed are everwhere, driver dependent. With QwQ-32B-Q4_K_M, I got about 20 tok/s, with all VRAM filled.

    M This user is from outside of this forum
    M This user is from outside of this forum
    [email protected]
    wrote on last edited by
    #23

    I didn't know that. I thought just one ROCM binary to install, run Ollama and that's it. Thanks for the explanation

    1 Reply Last reply
    0
    • M [email protected]

      I don't mind multiple GPUs but my motherboard doesn't have 2+ electrically connected X16 slots. I could build a new homeserver (I've been thinking about it) but consumer platforms simply don't have the PCIE lanes for 2 actual x16 slots. I'd have to go back to Broadwell Xeons for that, which are really power hungry. Oh well, I don't think it matters considering how power hungry GPUs are now.

      j4k3@lemmy.worldJ This user is from outside of this forum
      j4k3@lemmy.worldJ This user is from outside of this forum
      [email protected]
      wrote on last edited by
      #24

      I haven't looked into the issue of PCIe lanes and the GPU.

      I don't think it should matter with a smaller PCIe bus, in theory, if I understand correctly (unlikely). The only time a lot of data is transferred is when the model layers are initially loaded. Like with Oobabooga when I load a model, most of the time my desktop RAM monitor widget does not even have the time to refresh and tell me how much memory was used on the CPU side. What is loaded in the GPU is around 90% static. I have a script that monitors this so that I can tune the maximum number of layers. I leave overhead room for the context to build up over time but there are no major changes happening aside from initial loading. One just sets the number of layers to offload on the GPU and loads the model. However many seconds that takes is irrelevant startup delay that only happens once when initiating the server.

      So assuming the kernel modules and hardware support the more narrow bandwidth, it should work... I think. There are laptops that have options for an external FireWire GPU too, so I don't think the PCIe bus is too baked in.

      1 Reply Last reply
      0
      • M [email protected]

        Do you have any recommendations for running the Mistral small model? I'm very interested in it alongside CodeLlama, OogaBooga and others

        R This user is from outside of this forum
        R This user is from outside of this forum
        [email protected]
        wrote on last edited by
        #25

        I haven't tried those, so not really, but with open web UI, you can download and run anything, just make sure it fits in your vram so it doesn't run on the CPU. The deep seek one is decent. I find that i like chatgpt 4-o better, but it's still good.

        M 1 Reply Last reply
        0
        • R [email protected]

          I haven't tried those, so not really, but with open web UI, you can download and run anything, just make sure it fits in your vram so it doesn't run on the CPU. The deep seek one is decent. I find that i like chatgpt 4-o better, but it's still good.

          M This user is from outside of this forum
          M This user is from outside of this forum
          [email protected]
          wrote on last edited by
          #26

          In general how much VRAM do I need for 14B and 24B models?

          F 1 Reply Last reply
          0
          • M [email protected]

            In general how much VRAM do I need for 14B and 24B models?

            F This user is from outside of this forum
            F This user is from outside of this forum
            [email protected]
            wrote on last edited by
            #27

            It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).

            F 1 Reply Last reply
            0
            • F [email protected]

              It really depends on how you quantize the model and the K/V cache as well. This is a useful calculator. https://smcleod.net/vram-estimator/ I can comfortably fit most 32b models quantized to 4-bit (usually KVM or IQ4XS) on my 3090’s 24 GB of VRAM with a reasonable context size. If you’re going to be needing a much larger context window to input large documents etc then you’d need to go smaller with the model size (14b, 27b etc) or get a multi GPU set up or something with unified memory and a lot of ram (like the Mac Minis others are mentioning).

              F This user is from outside of this forum
              F This user is from outside of this forum
              [email protected]
              wrote on last edited by
              #28

              Oh and I typically get 16-20 tok/s running a 32b model on Ollama using Open WebUI. Also I have experienced issues with 4-bit quantization for the K/V cache on some models myself so just FYI

              1 Reply Last reply
              0
              • M [email protected]

                Not sure if this is the right place, if not please let me know.

                GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.

                Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).

                What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.

                My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.

                Thanks


                You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.

                S This user is from outside of this forum
                S This user is from outside of this forum
                [email protected]
                wrote on last edited by
                #29

                Hopefully once Trump crashes economy we will see some bankruptcies and markets flooded with commercial GPUs as AI companies go under.

                1 Reply Last reply
                0
                • M [email protected]

                  Do you have 2 PCIE X16 slots on your motherboard (speaking in terms of electrical connections)?

                  natanox@discuss.tchncs.deN This user is from outside of this forum
                  natanox@discuss.tchncs.deN This user is from outside of this forum
                  [email protected]
                  wrote on last edited by
                  #30

                  They would run with 8x speed each. Should not be too much of a bottleneck though, I don't expect the performance to suffer noticeably more than 5% from this. Annoying, but getting a CPU+Board with 32 lanes or more would throw off the price/performance ratio.

                  M 1 Reply Last reply
                  0
                  • M [email protected]

                    Wait how does that work? How is 24GB enough for a 38B model?

                    H This user is from outside of this forum
                    H This user is from outside of this forum
                    [email protected]
                    wrote on last edited by
                    #31

                    Look up “LLM quantization.” The idea is that each parameter is a number; by default they use 16 bits of precision, but if you scale them into smaller sizes, you use less space and have less precision, but you still have the same parameters. There’s not much quality loss going from 16 bits to 8, but it gets more noticeable as you get lower and lower. (That said, there’s are ternary bit models being trained from scratch that use 1.58 bits per parameter and are allegedly just as good as fp16 models of the same parameter count.)

                    If you’re using a 4-bit quantization, then you need about half that number in VRAM. Q4_K_M is better than Q4, but also a bit larger. Ollama generally defaults to Q4_K_M. If you can handle a higher quantization, Q6_K is generally best. If you can’t quite fit it, Q5_K_M is generally better than any other option, followed by Q5_K_S.

                    For example, Llama3.3 70B, which has 70.6 billion parameters, has the following sizes for some of its quantizations:

                    • q4_K_M (the default): 43 GB
                    • fp16: 141 GB
                    • q8: 75 GB
                    • q6_K: 58 GB
                    • q5_k_m: 50 GB
                    • q4: 40 GB
                    • q3_K_M: 34 GB
                    • q2_K: 26 GB

                    This is why I run a lot of Q4_K_M 70B models on two 3090s.

                    Generally speaking, there’s not a perceptible quality drop going to Q6_K from 8 bit quantization (though I have heard this is less true with MoE models). Below Q6, there’s a bit of a drop between it and 5 and then 4, but the model’s still decent. Below 4-bit quantizations you can generally get better results from a smaller parameter model at a higher quantization.

                    TheBloke on Huggingface has a lot of GGUF quantization repos, and most, if not all of them, have a blurb about the different quantization types and which are recommended. When Ollama.com doesn’t have a model I want, I’m generally able to find one there.

                    M 1 Reply Last reply
                    0
                    • M [email protected]

                      Not sure if this is the right place, if not please let me know.

                      GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.

                      Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).

                      What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.

                      My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.

                      Thanks


                      You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.

                      swelter_spark@reddthat.comS This user is from outside of this forum
                      swelter_spark@reddthat.comS This user is from outside of this forum
                      [email protected]
                      wrote on last edited by
                      #32

                      With an AMD RX 6800 + 32gb DDR4, I can run up to a 34b model at an acceptable speed.

                      1 Reply Last reply
                      0
                      • M [email protected]

                        I am OK with either Nvidia or AMD especially if Ollama supports it. With that said I have heard that AMD takes some manual effort whilst Nvidia is easier. Depends on how difficult ROCM is

                        swelter_spark@reddthat.comS This user is from outside of this forum
                        swelter_spark@reddthat.comS This user is from outside of this forum
                        [email protected]
                        wrote on last edited by
                        #33

                        With Ollama, all you have do is copy an extra folder of ROCm files. Not hard at all.

                        1 Reply Last reply
                        0
                        • natanox@discuss.tchncs.deN [email protected]

                          They would run with 8x speed each. Should not be too much of a bottleneck though, I don't expect the performance to suffer noticeably more than 5% from this. Annoying, but getting a CPU+Board with 32 lanes or more would throw off the price/performance ratio.

                          M This user is from outside of this forum
                          M This user is from outside of this forum
                          [email protected]
                          wrote on last edited by
                          #34

                          I have an alternative for you if your power bills are cheap: X99 motherboard + CPU combos from China

                          1 Reply Last reply
                          0
                          • M [email protected]

                            Not sure if this is the right place, if not please let me know.

                            GPU prices in the US have been a horrific bloodbath with the scalpers recently. So for this discussion, let's keep it to MSRP and the lucky people who actually managed to afford those insane MSRPs + managed to actually find the GPU they wanted.

                            Which GPU are you using to run what LLMs? How is the performance of the LLMs you have selected? On an average, what size of LLMs are you able to run smoothly on your GPU (7B, 14B, 20-24B etc).

                            What GPU do you recommend for decent amount of VRAM vs price (MSRP)? If you're using the TOTL RX 7900XTX/4090/5090 with 24+ GB of RAM, comment below with some performance estimations too.

                            My use-case: code assistants for Terraform + general shell and YAML, plain chat, some image generation. And to be able to still pay rent after spending all my savings on a GPU with a pathetic amount of VRAM (LOOKING AT BOTH OF YOU, BUT ESPECIALLY YOU NVIDIA YOU JERK). I would prefer to have GPUs for under $600 if possible, but I want to also run models like Mistral small so I suppose I don't have a choice but spend a huge sum of money.

                            Thanks


                            You can probably tell that I'm not very happy with the current PC consumer market but I decided to post in case we find any gems in the wild.

                            possiblylinux127@lemmy.zipP This user is from outside of this forum
                            possiblylinux127@lemmy.zipP This user is from outside of this forum
                            [email protected]
                            wrote on last edited by
                            #35

                            You also could cool into high core count CPUs.

                            1 Reply Last reply
                            0
                            • mudman@fedia.ioM [email protected]

                              You didn't, I did. The starting models cap at 24, but you can spec up the biggest one up to 64GB. I should have clicked through to the customization page before reporting what was available.

                              That is still cheaper than a 5090, so it's not that clear cut. I think it depends on what you're trying to set up and how much money you're willing to burn. Sometimes literally, the Mac will also be more power efficient than a honker of an Nvidia 90 class card.

                              Honestly, all I have for recommendations is that I'd rather scale up than down. I mean, unless you also want to play kickass games at insane framerates with path tracing or something. Then go nuts with your big boy GPUs, who cares.

                              But for LLM stuff strictly I'd start by repurposing what I have around, hitting a speed limit and then scaling up to maybe something with a lot of shared RAM (including a Mac Mini if you're into those) and keep rinsing and repeating. I don't know that I personally am in the market for AI-specific muti-thousand APUs with a hundred plus gigs of RAM yet.

                              S This user is from outside of this forum
                              S This user is from outside of this forum
                              [email protected]
                              wrote on last edited by
                              #36

                              Just FYI: The "Mac Studio" when equipped with 32-core M3Ultra processors can have up to 512GB of RAM.

                              It costs like 15k after taxes, so not exactly the scope of this thread, but it exists.

                              mudman@fedia.ioM 1 Reply Last reply
                              0
                              • S [email protected]

                                Just FYI: The "Mac Studio" when equipped with 32-core M3Ultra processors can have up to 512GB of RAM.

                                It costs like 15k after taxes, so not exactly the scope of this thread, but it exists.

                                mudman@fedia.ioM This user is from outside of this forum
                                mudman@fedia.ioM This user is from outside of this forum
                                [email protected]
                                wrote on last edited by
                                #37

                                Yeah, for sure. That I was aware of.

                                We were focusing on the Mini instead because... well, if the OP is fretting about going for a big GPU I'm assuming we're talking user-level costs here. The Mini's reputation comes from starting at 600 bucks for 16 gigs of fast shared RAM, which is competitive with consumer GPUs as a standalone system. I wanted to correct the record about the 24Gig starter speccing up to 64 because the 64 gig one is still in the 2K range, which is lower than the realistic market prices of 4090s and 5090s, so if my priority was running LLMs there would be some thinking to do about which option makes most sense in the 500-2K price range.

                                I am much less aware of larger options and their relative cost to performance because... well, I may not hate LLMs as much as is popular around the Internet, but I'm no roaming cryptobro, either, and I assume neither is anybody else in this conversation.

                                S 1 Reply Last reply
                                0
                                • mudman@fedia.ioM [email protected]

                                  Yeah, for sure. That I was aware of.

                                  We were focusing on the Mini instead because... well, if the OP is fretting about going for a big GPU I'm assuming we're talking user-level costs here. The Mini's reputation comes from starting at 600 bucks for 16 gigs of fast shared RAM, which is competitive with consumer GPUs as a standalone system. I wanted to correct the record about the 24Gig starter speccing up to 64 because the 64 gig one is still in the 2K range, which is lower than the realistic market prices of 4090s and 5090s, so if my priority was running LLMs there would be some thinking to do about which option makes most sense in the 500-2K price range.

                                  I am much less aware of larger options and their relative cost to performance because... well, I may not hate LLMs as much as is popular around the Internet, but I'm no roaming cryptobro, either, and I assume neither is anybody else in this conversation.

                                  S This user is from outside of this forum
                                  S This user is from outside of this forum
                                  [email protected]
                                  wrote on last edited by
                                  #38

                                  4090s are what price now? Didn't keep track, I'm astonished. never thought I'd see the day when Apples RAM pricing is seen as competitive.

                                  1 Reply Last reply
                                  0
                                  • H [email protected]

                                    Look up “LLM quantization.” The idea is that each parameter is a number; by default they use 16 bits of precision, but if you scale them into smaller sizes, you use less space and have less precision, but you still have the same parameters. There’s not much quality loss going from 16 bits to 8, but it gets more noticeable as you get lower and lower. (That said, there’s are ternary bit models being trained from scratch that use 1.58 bits per parameter and are allegedly just as good as fp16 models of the same parameter count.)

                                    If you’re using a 4-bit quantization, then you need about half that number in VRAM. Q4_K_M is better than Q4, but also a bit larger. Ollama generally defaults to Q4_K_M. If you can handle a higher quantization, Q6_K is generally best. If you can’t quite fit it, Q5_K_M is generally better than any other option, followed by Q5_K_S.

                                    For example, Llama3.3 70B, which has 70.6 billion parameters, has the following sizes for some of its quantizations:

                                    • q4_K_M (the default): 43 GB
                                    • fp16: 141 GB
                                    • q8: 75 GB
                                    • q6_K: 58 GB
                                    • q5_k_m: 50 GB
                                    • q4: 40 GB
                                    • q3_K_M: 34 GB
                                    • q2_K: 26 GB

                                    This is why I run a lot of Q4_K_M 70B models on two 3090s.

                                    Generally speaking, there’s not a perceptible quality drop going to Q6_K from 8 bit quantization (though I have heard this is less true with MoE models). Below Q6, there’s a bit of a drop between it and 5 and then 4, but the model’s still decent. Below 4-bit quantizations you can generally get better results from a smaller parameter model at a higher quantization.

                                    TheBloke on Huggingface has a lot of GGUF quantization repos, and most, if not all of them, have a blurb about the different quantization types and which are recommended. When Ollama.com doesn’t have a model I want, I’m generally able to find one there.

                                    M This user is from outside of this forum
                                    M This user is from outside of this forum
                                    [email protected]
                                    wrote on last edited by
                                    #39

                                    Thank you for your comment, I will save it. This really cleared it up

                                    1 Reply Last reply
                                    0
                                    • System shared this topic on
                                    Reply
                                    • Reply as topic
                                    Log in to reply
                                    • Oldest to Newest
                                    • Newest to Oldest
                                    • Most Votes


                                    • Login

                                    • Login or register to search.
                                    • First post
                                      Last post
                                    0
                                    • Categories
                                    • Recent
                                    • Tags
                                    • Popular
                                    • World
                                    • Users
                                    • Groups