Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Privacy
  3. How to run LLaMA (and other LLMs) on Android.

How to run LLaMA (and other LLMs) on Android.

Scheduled Pinned Locked Moved Privacy
privacy
19 Posts 7 Posters 93 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • llama@lemmy.dbzer0.comL [email protected]

    cross-posted from: https://lemmy.dbzer0.com/post/36841328

    Hello, everyone! I wanted to share my experience of successfully running LLaMA on an Android device. The model that performed the best for me was llama3.2:1b on a mid-range phone with around 8 GB of RAM. I was also able to get it up and running on a lower-end phone with 4 GB RAM. However, I also tested several other models that worked quite well, including qwen2.5:0.5b , qwen2.5:1.5b , qwen2.5:3b , smallthinker , tinyllama , deepseek-r1:1.5b , and gemma2:2b. I hope this helps anyone looking to experiment with these models on mobile devices!


    Step 1: Install Termux

    1. Download and install Termux from the Google Play Store or F-Droid

    Step 2: Set Up proot-distro and Install Debian

    1. Open Termux and update the package list:

      pkg update && pkg upgrade
      
    2. Install proot-distro

      pkg install proot-distro
      
    3. Install Debian using proot-distro:

      proot-distro install debian
      
    4. Log in to the Debian environment:

      proot-distro login debian
      

      You will need to log-in every time you want to run Ollama. You will need to repeat this step and all the steps below every time you want to run a model (excluding step 3 and the first half of step 4).


    Step 3: Install Dependencies

    1. Update the package list in Debian:

      apt update && apt upgrade
      
    2. Install curl:

      apt install curl
      

    Step 4: Install Ollama

    1. Run the following command to download and install Ollama:

      curl -fsSL https://ollama.com/install.sh | sh
      
    2. Start the Ollama server:

      ollama serve &
      

      After you run this command, do ctrl + c and the server will continue to run in the background.


    Step 5: Download and run the Llama3.2:1B Model

    1. Use the following command to download the Llama3.2:1B model:
      ollama run llama3.2:1b
      
      This step fetches and runs the lightweight 1-billion-parameter version of the Llama 3.2 model .

    Running LLaMA and other similar models on Android devices is definitely achievable, even with mid-range hardware. The performance varies depending on the model size and your device's specifications, but with some experimentation, you can find a setup that works well for your needs. I’ll make sure to keep this post updated if there are any new developments or additional tips that could help improve the experience. If you have any questions or suggestions, feel free to share them below!

    – llama

    K This user is from outside of this forum
    K This user is from outside of this forum
    [email protected]
    wrote on last edited by
    #5

    you only fry your phone with this. very bad idea

    llama@lemmy.dbzer0.comL 1 Reply Last reply
    0
    • P [email protected]

      Most open/local models require a fraction of the resources of chatgpt. But they are usually not AS good in a general sense. But they often are good enough, and can sometimes surpass ChatGPT in specific domains.

      C This user is from outside of this forum
      C This user is from outside of this forum
      [email protected]
      wrote on last edited by
      #6

      Do you know about anything libre? I'm curious to try something. Better if self-hosted (?)

      P 1 Reply Last reply
      0
      • C [email protected]

        Do you know about anything libre? I'm curious to try something. Better if self-hosted (?)

        P This user is from outside of this forum
        P This user is from outside of this forum
        [email protected]
        wrote on last edited by
        #7

        They're probably referring to the 671b parameter version of deepseek. You can indeed self host it. But unless you've got a server rack full of data center class GPUs, you'll probably set your house on fire before it generates a single token.

        If you want a fully open source model, I recommend Qwen 2.5 or maybe deepseek v2. There's also OLmo2, but I haven't really tested it.

        Mistral small 24b also just came out and is Apache licensed. That is something I'm testing now.

        C 1 Reply Last reply
        0
        • K [email protected]

          you only fry your phone with this. very bad idea

          llama@lemmy.dbzer0.comL This user is from outside of this forum
          llama@lemmy.dbzer0.comL This user is from outside of this forum
          [email protected]
          wrote on last edited by
          #8

          Not true. If you load a model that is below your phone's hardware capabilities it simply won't open. Stop spreading fud.

          K projectmoonP 2 Replies Last reply
          0
          • P [email protected]

            They're probably referring to the 671b parameter version of deepseek. You can indeed self host it. But unless you've got a server rack full of data center class GPUs, you'll probably set your house on fire before it generates a single token.

            If you want a fully open source model, I recommend Qwen 2.5 or maybe deepseek v2. There's also OLmo2, but I haven't really tested it.

            Mistral small 24b also just came out and is Apache licensed. That is something I'm testing now.

            C This user is from outside of this forum
            C This user is from outside of this forum
            [email protected]
            wrote on last edited by
            #9

            But unless you've got a server rack full of data center class GPUs, you'll probably set your house on fire before it generates a single token.

            Its cold outside and I don't want to spend money on keeping my house warm so I could.. Try

            I'll check them out! Thank you

            P 1 Reply Last reply
            0
            • C [email protected]

              But unless you've got a server rack full of data center class GPUs, you'll probably set your house on fire before it generates a single token.

              Its cold outside and I don't want to spend money on keeping my house warm so I could.. Try

              I'll check them out! Thank you

              P This user is from outside of this forum
              P This user is from outside of this forum
              [email protected]
              wrote on last edited by
              #10

              Lol, there are smaller versions of Deepseek-r1. These aren't the "real" Deepseek model, but they are distilled from other foundation models (Qwen2.5 and Llama3 in this case).

              For the 671b parameter file, the medium-quality version weighs in at 404 GB. That means you need 404 GB of RAM/VRAM just to load the thing. Then you need preferably ALL of that in VRAM (i.e. GPU memory) to get it to generate anything fast.

              For comparison, I have 16 GB of VRAM and 64 GB of RAM on my desktop. If I run the 70b parameter version of Llama3 at Q4 quant (medium quality-ish), it's a 40 GB file. It'll run, but mostly on the CPU. It generates ~0.85 tokens per second. So a good response will take 10-30 minutes. Which is fine if you have time to wait, but not if you want an immediate response. If I had two beefy GPUs with 24 GB VRAM each, that'd be 48 total GB and I could run the whole model in VRAM and it'd be very fast.

              C 1 Reply Last reply
              0
              • P [email protected]

                Lol, there are smaller versions of Deepseek-r1. These aren't the "real" Deepseek model, but they are distilled from other foundation models (Qwen2.5 and Llama3 in this case).

                For the 671b parameter file, the medium-quality version weighs in at 404 GB. That means you need 404 GB of RAM/VRAM just to load the thing. Then you need preferably ALL of that in VRAM (i.e. GPU memory) to get it to generate anything fast.

                For comparison, I have 16 GB of VRAM and 64 GB of RAM on my desktop. If I run the 70b parameter version of Llama3 at Q4 quant (medium quality-ish), it's a 40 GB file. It'll run, but mostly on the CPU. It generates ~0.85 tokens per second. So a good response will take 10-30 minutes. Which is fine if you have time to wait, but not if you want an immediate response. If I had two beefy GPUs with 24 GB VRAM each, that'd be 48 total GB and I could run the whole model in VRAM and it'd be very fast.

                C This user is from outside of this forum
                C This user is from outside of this forum
                [email protected]
                wrote on last edited by
                #11

                No house on fire 😞

                Thanks! I'll check it out

                1 Reply Last reply
                0
                • C [email protected]

                  And what's the purpose of running it locally? Just curious. Is there's anything really libre or better?

                  Is there any difference between LLaMA or any libre model and ChatGPT (the first and popular I know)

                  llama@lemmy.dbzer0.comL This user is from outside of this forum
                  llama@lemmy.dbzer0.comL This user is from outside of this forum
                  [email protected]
                  wrote on last edited by
                  #12

                  For me the biggest benefits are:

                  • Your queries don't ever leave your computer
                  • You don't have to trust a third party with your data
                  • You know exactly what you're running
                  • You can tweak most models to your liking
                  • You can upload sensitive information to it and not worry about it
                  • It works entirely offline
                  • You can run several models
                  C 1 Reply Last reply
                  0
                  • llama@lemmy.dbzer0.comL [email protected]

                    Not true. If you load a model that is below your phone's hardware capabilities it simply won't open. Stop spreading fud.

                    K This user is from outside of this forum
                    K This user is from outside of this forum
                    [email protected]
                    wrote on last edited by
                    #13

                    that's not how it works. Your phone can easily overheat if you use it too much, even if your device can handle it. Smartphones don't have cooling like pcs and laptops (except some rog phone and stuff). If you don't want to fry your processor, only run LLMs on high-end gaming pcs with All in one water cooling

                    llama@lemmy.dbzer0.comL M 2 Replies Last reply
                    0
                    • llama@lemmy.dbzer0.comL [email protected]

                      Not true. If you load a model that is below your phone's hardware capabilities it simply won't open. Stop spreading fud.

                      projectmoonP Offline
                      projectmoonP Offline
                      projectmoon
                      wrote on last edited by
                      #14

                      @[email protected] Depends on the inference engine. Some of them will try to load the model until it blows up and runs out of memory. Which can cause its own problems. But it won't overheat the phone, no. But if you DO use a model that the phone can run, like any intense computation, it can cause the phone to heat up. Best not run a long inference prompt while the phone is in your pocket, I think.

                      llama@lemmy.dbzer0.comL 1 Reply Last reply
                      0
                      • K [email protected]

                        that's not how it works. Your phone can easily overheat if you use it too much, even if your device can handle it. Smartphones don't have cooling like pcs and laptops (except some rog phone and stuff). If you don't want to fry your processor, only run LLMs on high-end gaming pcs with All in one water cooling

                        llama@lemmy.dbzer0.comL This user is from outside of this forum
                        llama@lemmy.dbzer0.comL This user is from outside of this forum
                        [email protected]
                        wrote on last edited by
                        #15

                        Of course that is something to be mindful of, but that's not what the person in the original comment said. It does run, but you need to be aware of the limitations and potential consequences. That goes without saying, though.

                        Don't overdo it and your phone will be just fine.

                        K 1 Reply Last reply
                        0
                        • projectmoonP projectmoon

                          @[email protected] Depends on the inference engine. Some of them will try to load the model until it blows up and runs out of memory. Which can cause its own problems. But it won't overheat the phone, no. But if you DO use a model that the phone can run, like any intense computation, it can cause the phone to heat up. Best not run a long inference prompt while the phone is in your pocket, I think.

                          llama@lemmy.dbzer0.comL This user is from outside of this forum
                          llama@lemmy.dbzer0.comL This user is from outside of this forum
                          [email protected]
                          wrote on last edited by [email protected]
                          #16

                          Thanks for your comment. That for sure is something to look out for. It is really important to know what you're running and what possible limitations there could be.

                          1 Reply Last reply
                          0
                          • llama@lemmy.dbzer0.comL [email protected]

                            For me the biggest benefits are:

                            • Your queries don't ever leave your computer
                            • You don't have to trust a third party with your data
                            • You know exactly what you're running
                            • You can tweak most models to your liking
                            • You can upload sensitive information to it and not worry about it
                            • It works entirely offline
                            • You can run several models
                            C This user is from outside of this forum
                            C This user is from outside of this forum
                            [email protected]
                            wrote on last edited by
                            #17

                            The biggest problem:

                            • I don't have enough RAM/GPU to run it on a server

                            But it looks interesting

                            1 Reply Last reply
                            0
                            • K [email protected]

                              that's not how it works. Your phone can easily overheat if you use it too much, even if your device can handle it. Smartphones don't have cooling like pcs and laptops (except some rog phone and stuff). If you don't want to fry your processor, only run LLMs on high-end gaming pcs with All in one water cooling

                              M This user is from outside of this forum
                              M This user is from outside of this forum
                              [email protected]
                              wrote on last edited by
                              #18

                              This is so horrifically wrong, I don't even know where to start.

                              The short version is that phone and computer makers aren't stupid and they will kill things or shutdown when overheating happens. If you were a phone maker, why tf would you allow someone to fry their own phone?

                              My laptop has shut itself off when I was trying to compile code while playing video games, while playing twitch. My android phone has killed apps when I try to do too much as well.

                              1 Reply Last reply
                              0
                              • llama@lemmy.dbzer0.comL [email protected]

                                Of course that is something to be mindful of, but that's not what the person in the original comment said. It does run, but you need to be aware of the limitations and potential consequences. That goes without saying, though.

                                Don't overdo it and your phone will be just fine.

                                K This user is from outside of this forum
                                K This user is from outside of this forum
                                [email protected]
                                wrote on last edited by
                                #19

                                my phone was fried last week, it needed soc reballing. From watching videos and browsing the web at the same time. Most hardware developers don't pay attention to cooling and these stuff run on hopes and dreams. Plus auto switchoff is only a software solution, and software can have bugs

                                1 Reply Last reply
                                0
                                • System shared this topic on
                                Reply
                                • Reply as topic
                                Log in to reply
                                • Oldest to Newest
                                • Newest to Oldest
                                • Most Votes


                                • Login

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • World
                                • Users
                                • Groups