Can you self-host AI at parity with chatgpt?
-
have you ever tried? i run LLMs on my ancient thinkpad.
-
-
Excellent, I'll try the $8 option
-
you did not specify what type of model your trying to run. like deepseekr1 has various models if your trying to run the massive ones its not gonna work. You need to use a smaller model. I have a RX 6600 and run the 14b parameter model it does well.
-
to be clear btw your CPU basically doesnt matter as far as i know. Just the GPU should be getting used any old CPU works. You CAN run it on a CPU but its gonna be very slow. But yeah the RX 6600 was decently cheap i got it for like 150$ so its not super expensive to run one of these models.
-
-
-
-
96 GB+ of RAM is relatively easy, but for LLM inference you want VRAM. You can achieve that on a consumer PC by using multiple GPUs, although performance will not be as good as having a single GPU with 96GB of VRAM. Swapping out to RAM during inference slows it down a lot.
On archs with unified memory (like Apple's latest machines), the CPU and GPU share memory, so you could actually find a system with very high memory directly accessible to the GPU. Mac Pros can be configured with up to 192GB of memory, although I doubt it'd be worth it as the GPU probably isn't powerful enough.
Also, the 83GB number I gave was with a hypothetical 1 bit quantization of Deepseek R1, which (if it's even possible) would probably be really shitty, maybe even shittier than Llama 7B.
but how can one enter TB zone?
Data centers use NVLink to connect multiple Nvidia GPUs. Idk what the limits are, but you use it to combine multiple GPUs to pool resources much more efficiently and at a much larger scale than would be possible on consumer hardware. A single Nvidia H200 GPU has 141 GB of VRAM, so you could link them up to build some monster data centers.
Nivida also sells prebuilt machines like the HGX B200 which can have 1.4TB of memory in a single system. That's less than the 2.6TB for unquantized deepseek, but for inference only applications, you could definitely quantize it enough to fit within that limit with little to no quality loss... so if you're really interested and really rich, you could probably buy one of those for your home lab.
-
Also no ROCm support afaik, so it's running completely on CPU
-
This is the point everyone downvoting me seems to be missing. OP wanted something comparable to the responsiveness of chat.chatgpt.com... Which is simply not possible without insane hardware. Like sure, if you don't care about token generation you can install an LLM on incredibly underpowered hardware and it technically works, but that's not at all what OP was asking for. They wanted a comparable experience. Which requires a lot of money.
-
Yeah I definitely get your point (and I didn’t downvote you, for the record). But I will note that ChatGPT generates text way faster than most people can read, and 4 tokens/second, while perhaps slower than reading speed for some
people, is not that bad in my experience.