Specialize LLM

[email protected]

Hi, I'm not too informed about LLMs so I'll appreciate any correction to what I might be getting wrong.
I have a collection of books I would like to train an LLM on so I could use it as a quick source of information on the topics covered by the books.
Is this feasible?

[email protected]

The easiest option for a layperson is retrieval augmented generation, or RAG. Basically you encode your books and upload them into a special kind of database and then tell a regular base model LLM to check the data when making an answer. I know ChatGPT has a built in UI for this (and maybe anthropic too) but you can also build something out using Langchain or OpenWebUi and the model of your choice.

The next step up from there is fine tuning, where you kinda retrain a base model on your books. This is more complex and time consuming but can give more nuanced answers. It’s often done in combination with RAG for particularly large bodies of information.

[email protected]

And as far as I know people do fine-tuning so it picks up on the style of writing and things like that, for example to mimick an author, or specifics of a genre. I'd say to just fetch facts from a pile of text, RAG would be the easier approach. It depends on the use-case, the collection of books, however. Fine-tuning is definitely a thing people do as well.

[email protected]

Umm, fine-tuning the model that makes the embeddings, right? Or is there an API for messing with the generative AI somewhere? Or are we assuming that newbie has a lot of compute resources? And they would have to use the generative model to create queries for their passages as well, right?

I would try something like

Guides | RAGFlow - https://ragflow.io/docs/dev/category/guides

or a similar tool.

Edit: not for fine-tuning, just to get started. Local models, RAG, your books are your knowledge base

[email protected]

It is indeed possible! The nerd speak for what you want to do is 'finetune training with a dataset' the dataset being your books. Its a non-trivial task that takes setup and money to pay a training provider to use their compute. There are no gaurentees it will come out the way you want on first bake either.

A soft version of this thats the big talk right now is RAG which is essentially a way for your llm to call and reference an external dataset to recall information into its active context. Its a useful tool worth looking into much easier and cheaper than model training but while your model can recall information with RAG it won't really be able to build an internal understanding of that information within its abstraction space. Like being able to recall a piece of information vs internally understanding the concepts its trying to convey. RAG is for wrote memorization, training is for deeper abstraction space mapping

[email protected]

Would you recommend fine-tuning over RAG to improve domain specific performance, my end goal would be a small, efficient and very specialised LLM to help get info on the contents of the books (all of them are about the same topic, from different povs and authors)

[email protected]

I would receommend you read over the work of the person who finetuned a mistral model on many us army field guides to understand what fine tuning on a lot of books to bake in knowledge looks like.

If you are a newbie just learning how this technology works I would suggest trying to get RAG working with a small model and one or two books converted to a big text file just to see how it works. One you have a little more experience and if you are financially well off to the point 1-2 thousand dollars to train a model is who-cares whatever play money to you then go for finetuning.

agnos.is Forums

Specialize LLM