Faster Ollama alternative
-
I'm currently shopping around for something a bit faster than ollama and because I could not get it to use a different context and output length, which seems to be a known and long ignored issue. Somehow everything I’ve tried so far did miss one or more critical features, like:
- "Hot" model replacement, so loading and unloading models on demand
- Function calling
- Support of most models
- OpenAI API compatibility (to work well with Open WebUI)
I'd be happy about any recommendations!
-
S [email protected] shared this topic
-
[email protected]replied to [email protected] last edited by
Ummm... did you try
/set parameter num_context #
and/set parameter num_predict #
? Are you using a model that actually supports the context length that you desire...? -
[email protected]replied to [email protected] last edited by
I don't think you are going to find anything faster. Ollama is pretty much as fast as it gets
-
[email protected]replied to [email protected] last edited by
I don't think it's OpenAI compatible, but deepseek is faster.
-
[email protected]replied to [email protected] last edited by
Yeah, but there are many open issues on GitHub related to these settings not working right. I’m using the API, and just couldn’t get it to work. I used a request to generate a json file, and it never generated one longer than about 500 lines. With the same model on vllm, it worked instantly and generated about 2000 lines
-
[email protected]replied to [email protected] last edited by
There are many projects out there optimizing the speed significantly. Ollama is unbeaten in the convenience though
-
[email protected]replied to [email protected] last edited by
It's not, by far. But vllm or SGLang don't support a lot of the requested features unfortunately.
-
[email protected]replied to [email protected] last edited by
Btw, Ollama is a software to run AI models. Deepseek is just a company. Or a model file or a service. But that's not what OP is looking for. They want to run a model. And that needs software like Ollama.
-
[email protected]replied to [email protected] last edited by
I'm also aware of LocalAI with automatic model swapping and OpenAI compatible API.
But unless I'm mistaken, they all use ggml behind the scenes? So you might want to look for something that uses vllm or exllama or something if you want a completely different backend.
-
[email protected]replied to [email protected] last edited by
Try llamafile from Mozilla.
-
[email protected]replied to [email protected] last edited by
Vllm unfortunately doesn't support switching the model without a restart.
-
[email protected]replied to [email protected] last edited by
Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference.
-
[email protected]replied to [email protected] last edited by
I would not recommend LocalAI. There documentation is somewhat lacking and it’s an all in one utility with many moving parts. The parts also tend to break, quite often.
-
[email protected]replied to [email protected] last edited by
It was multiple models, mainly 32-70B