Skip to content

LocalLLaMA

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they’re still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

62 Topics 407 Posts
  • Running Local LLMs with Ollama on openSUSE Tumbleweed

    localllama
    1
    1
    0 Votes
    1 Posts
    0 Views
    No one has replied
  • 2 Votes
    3 Posts
    0 Views
    eyekaytee@aussie.zoneE
    yep, it would pretty much move the only major AI company in Europe over to America
  • 30 Votes
    13 Posts
    1 Views
    B
    Late reply, but if you are looking into this, ik_llama.cpp is explicitly optimized for expert offloading. I can get like 16 t/s with a Hunyuan 70B on a 3090. If you want long context for models that fit in veam your last stop is TabbyAPI. I can squeeze in 128K context from a 32B in 24GB VRAM, easy… I could probably do 96K with 2 parallel slots, though unfortunately most models are pretty terrible past 32K.
  • need help understanding if this setup is even feasible.

    localllama
    19
    10 Votes
    19 Posts
    3 Views
    B
    The LLM “engine” is mostly detached from the UI. kobold.cpp is actually pretty great, and you can still use it with TabbyAPI (what you run for exllama) and the llama.cpp server. I personally love this for writing and testing though: https://github.com/lmg-anon/mikupad And Open Web UI for more general usage. There’s a big backlog of poorly documented knowledge too, heh, just ask if you’re wondering how to cram a specific model in. But the “jist” of the optimal engine rules are: For MoE models (like Qwen3 30B), try ik_llama.cpp, which is a fork specifically optimized for big MoEs partially offloaded to CPU. For Gemma 3 specifically, use the regular llama.cpp server since it seems to be the only thing supporting the sliding window attention (which makes long context easy). For pretty much anything else, if it’s supported by exllamav3 and you have a 3060, it's optimal to use that (via its server, which is called TabbyAPI). And you can use its quantized cache (try Q6/5) to easily get long context.
  • 26 Votes
    2 Posts
    0 Views
    B
    More open models are good! Granite needs a competitor. I do hope they try an 'exotic' architecture. It doesn't have to be novel; another bitnet or Jamba/Falcon hybrid model would be sick. Is there anywhere I can submit suggestions, heh.
  • 2 Votes
    1 Posts
    0 Views
    No one has replied
  • 5 Votes
    1 Posts
    0 Views
    No one has replied
  • LLMs and their efficiency, can they really replace humans?

    localllama
    11
    1
    2 Votes
    11 Posts
    0 Views
    H
    LLMs are great at automating tasks where we know the solution. And there are a lot of workflows that fall in this category. They are horrible at solving new problems, but that is not where the opportunity for LLMs is anyway.
  • Current best local models for tool use?

    localllama
    8
    9 Votes
    8 Posts
    1 Views
    H
    For VLMs I love Moondream2. It's a tiny model that packs a punch way above its size. Llama.cpp supports it.
  • Local Voiceover/Audiobook generation

    localllama
    7
    17 Votes
    7 Posts
    2 Views
    smokeydope@lemmy.worldS
    Nice post Hendrik thanks for sharing your knowledge and helping people out
  • I'm using open web ui, does anybody else have a better interface?

    localllama
    2
    16 Votes
    2 Posts
    0 Views
    C
    jan.ai ?
  • 28 Votes
    5 Posts
    0 Views
    B
    I don’t want to be ungrateful complaining that they dont give us everything. For sure. But I guess it’s still kinda… interesting? Like you’d think Qwen3, Gemma 3, Falcon H1, Nemotron 49B and such would pressure them to release Medium, but I guess there are factors that help them sell it. As stupid as this is, they’re European and specifically not Chinese. In the business world, there’s this mostly irrational fear that the Deepseek or Qwen weights by themselves will jump out of their cage and hack you, heh.
  • My AI Skeptic Friends Are All Nuts

    localllama
    17
    1
    12 Votes
    17 Posts
    2 Views
    I
    Apart from the arguments that yes, vibe coders exist and they will be cheaper to employ, creating huge long term problems, with a generational gap in senior programmers, who are the ones maintaining open source projects. heinous environmental impact, and I mean heinous. This is my biggest problem honestly. you're betting that llms will improve faster than programmers forget "the craft". Llms are wide, not deep, and the less programmers care about boilerplate and how things actually work, the less material for the llms - >feedback loop - > worse llms etc. I use llms, hell I designed a workshop for my employer on how programmers can use llms, cursor, etc. But I don't think we're quite aware how we are screwing ourselves long term.
  • Updated guidelines for c/LocaLLama (new rules)

    localllama
    10
    24 Votes
    10 Posts
    0 Views
    B
    Agreed, agreed, agreed. Thanks Some may seem arbitrary, but things like the NFT/crypto comparison are so politically charged and ripe for abuse that it's a good to nip that in the bud. The only one I have mixed feelings on is: Rule: No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they’re still using the same algorithms since <over 10 years ago>. Reason: There are grains of truth to the reductionist statement that llms rely on mathematical statistics and probability for their outputs. The reasoning is true. I agree. But it does feel a bit uninclusive to outsiders who, to be frank, know nothing about LLMs. Commenters shouldn't drive by and drop reductionist hate, but that's also kinda the nature of Lemmy, heh. So... maybe be a little lax with that rule, I guess? Like give people a chance to be corrected unless they're outright abusive.
  • I'm excited for dots.llm (142BA14B)!

    localllama
    4
    1
    11 Votes
    4 Posts
    0 Views
    B
    This is like a perfect model for a Strix Halo mini PC. Man, I really want one of those Framework Desktops now...
  • What to Integrate With My AI

    localllama
    8
    16 Votes
    8 Posts
    0 Views
    swelter_spark@reddthat.comS
    I use Kobold as a backend for the FluentRead browser plugin, so I can do local language translation.
  • 32 Votes
    16 Posts
    0 Views
    B
    That’s a premade 8x 7900 XTX PC. All standard and off the shelf. I dunno anything about Geohot, all I know is people have been telling me how cool Tinygrad is for years with seemingly nothing to show for it other than social media hype, while other, sometimes newer, PyTorch alternatives like TVM, GGML, the MLIR efforts and such are running real workloads.
  • Don't overlook llama.cpp's rpc-server feature.

    localllama
    5
    4 Votes
    5 Posts
    0 Views
    T
    It loads the rpc machine’s part of the model across the network every time you start the server, I have to correct myself. It appears newer versions of rpc-server have a cache option and you can point them to a locally stored version of the model to avoid the network cost.
  • Noob experience using local LLM as a D&D style DM.

    localllama
    18
    12 Votes
    18 Posts
    0 Views
    T
    Mistral (24B) models are really bad at long context, but this is not always the case. I find that Qwen 32B and Gemma 27B are solid at 32K It looks like the Harbinger RPG model I'm using (from Latitude Games) is based on Mistral 24B, so maybe it inherits that limitation? I like it in other ways. It was trained on RPG games, which seems to help it for my use case. I did try some general purpose / vanilla models and felt they were not as good at D&D type scenarios. It looks like Latitude also has a 70B Wayfarer model. Maybe it would do better at bigger contexts. I have several networked machines with 40GB VRAM between all them, and I can just squeak I4Q_XS x 70B into that unholy assembly if I run 24000 context (before the SWA patch, so maybe more now). I will try it! The drawback is speed. 70B models are slow on my setup, about 8 t/s at startup.
  • 6 Votes
    2 Posts
    0 Views
    T
    At this point, I hope Nv etc realize that even if selling AI cards to data centers gets them 10 times the profit per unit, it really is best for them in the long run to have a healthy and vibrant gamer and enthusiast market too. It's never good to have all your eggs in one basket.