Tomáš Repčík - 7. 12. 2025

Running LLMs on Smartphones

How to leverage on-device models for mobile development

brain in phone

The models have been intended for large servers with powerful GPUs.

However, not everything can call for server-side models, especially when it comes to mobile applications.

That is why many companies created open-source smaller models that can run on-device.

For example, Google’s Gemma models, Qwen 3 series, Granite 4 series from IBM and many others.

The easiest way to see how they feel, you can try them out on your computer by downloading them with Ollama.

If you are a developer, most probably, your computer is powerful enough to run these smaller models.

If you do not have really on budget-friendly Android phone, you can even try to run them on your phone.

Let’s see how to use them in mobile applications.

This is not full guide as this topic changes rapidly, but rather how to overcome initial hurdles. For specific implementation, please refer to official documentation of respective libraries.

📧 Get more content like this in your inbox

Google Edge AI

Google shares AI/ML implementations via Edge AI - SDK that allows you to run different models directly on your device.

With no compilation needed, you can integrate on-device models into your mobile applications.

Even though it sounds easy, I had a few issues during my first tries.

Compatibility of Models

For the Edge AI the biggest issue for me where to figure out the differences between the models.

I was searching for broader support of models, and I found different models with different extensions.

Some of them have .task and some .litertlm extensions.

I had impression that they should be compatible with Edge AI, but I was not able to load them properly.

As it turns out, only .task models are supported by Edge AI.

LiteRT models are actually former TensorFlow Lite models that use a completely different API. It is a more generalized API for running models on a device, not only LLMs. You can convert PyTorch models or TensorFlow models to LiteRT format quite easily.

For implementing litertlm models, you need to use LiteRT.

Inference issues

When I was trying to run certain models on my Pixel 7a, I was running into inference issues.

Mainly, when GPU acceleration was enabled, I was getting weird results.

For example, running Gemma 3n 4B with GPU acceleration was giving me gibberish output with <pad><pad><pad>... tokens.

When I disabled GPU acceleration, the model was running fine, but inference was much slower afterwards.

Another issue is the battery drain, because running models on the device is a power-intensive task.

Session Management

Be wary, when you are running models on device, you need to manage sessions properly.

If you create a conversation session and do not dispose it properly, you might run into memory leaks and app crashes.

Or the network might have context from previous sessions, which you do not want.

Cactus

If you have multi-platform application, for example in Flutter/React Native/KMP, the easiest way to use LLMs is to use Cactus.

Cactus supports specific subset of models, because they optimize them for mobile devices.

They run their own specific format of models, which is able to use their own kernelt to do all the calculations.

All the inference is run on the CPU purely, so no GPU acceleration is used. Surprisingly, the performance can match GPU accelerated inference on some devices.

It is because they make optimized the calculation on CPU level and use nocopy techniques to speed up the inference.

The integration is quite straightforward, you just need to add their SDK to your project and load/download the model. With few lines of code you have running model in your app.

During the installation I have not encountered any issues. My recommendation is to run the model in separated isolate/thread, so it does not block the main UI thread.

Llama.cpp on Mobile

I have been trying to work with Llama.cpp models on mobile devices as well. It is possible to compile Llama.cpp for Android and iOS, but it requires some effort.

Unfortunately, I have not been able to get it working properly, but that is mostly up to me. After compiling the library for mobile devices, I have not managed to create bindings for the app.

Maybe if I have more time, I will try to get it working properly.

I managed to run Llama.cpp on my desktop computer and it works quite well for smaller models. This one is really on me from the most part, as I might have missed some steps during the compilation/binding process.

Good practices

If you plan to build an app with on-device models, keep a few best practices in mind.

Model size: Try to use the smallest model that fits your needs. Smaller models are faster and consume less battery.
Session management: Always dispose of sessions properly to avoid memory leaks.
Background processing: Run inference in background threads to keep UI responsive.
Download of models: To avoid issues with app size, download models on first run or on demand.
Hugging Face Integration: Downloading models from Hugging Face directly in the app can be tricky due to authentication. Do not place access tokens directly into the app and rather, consider hosting models on your own server.
Licensing: Make sure to comply with model licenses when distributing them with your app.

Conclusion

Running LLMs on mobile devices is becoming more feasible with advancements in model optimization and mobile SDKs.

However, it still requires careful consideration of model selection, performance optimization, and user experience.

When you are not constrained to run models on a device, consider a hybrid approach, where smaller models run on the device for quick responses and larger models on the server for complex tasks.

Socials

Thanks for reading this article!

For more content like this, follow me here or on X or LinkedIn.

Subscribe for more