Skip to content

GGUFloader/Mobile-AI-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

LocalChatbot - On-Device AI Assistant for Arm

๐Ÿ† Arm AI Developer Challenge 2025 Submission

Download APK

โš ๏ธ Before Installing APK: Disable Google Play Protect temporarily: Settings โ†’ Google โ†’ Security โ†’ Google Play Protect โ†’ Turn off "Scan apps with Play Protect" (You can re-enable it after installation)

Your AI, Your Model, Your Device โ€” A fully offline, privacy-first AI chatbot that lets you run ANY large language model locally on Arm-powered Android devices. Choose the perfect model for your hardware and task.

๐Ÿ“บ Demo Video

Watch Demo Video

๐Ÿ‘† Click the image above to watch the demo on YouTube

LocalChatbot Logo

๐ŸŽฏ Project Overview

LocalChatbot brings the power of generative AI directly to your pocket with unprecedented flexibility. Unlike other mobile AI apps that lock you into a single model, LocalChatbot empowers users to:

  • ๐Ÿ“ฑ Choose models that match your device โ€” Run lightweight 0.5B models on budget phones or powerful 7B+ models on flagships
  • ๐Ÿ”„ Switch models for different tasks โ€” Use a coding-optimized model for programming, a creative model for writing, or a general assistant for everyday questions
  • ๐Ÿ’พ Optimize for your resources โ€” Select quantization levels (Q4, Q5, Q8) based on available RAM and desired quality
  • ๐ŸŒ Stay completely offline โ€” All inference happens on-device with zero data transmission

The app leverages Arm's NEON SIMD instructions for optimized inference, making it possible to run quantized LLMs on mobile devices with impressive performance.

Why This Project Should Win

Criteria How LocalChatbot Delivers
Technological Implementation Native C++/JNI integration with llama.cpp, ARM NEON SIMD optimizations, multi-engine architecture supporting GGUF and ExecuTorch formats
User Experience Flexible model selection lets users choose models based on their device resources; Material 3 design, streaming responses, floating system-wide assistant
Potential Impact Democratizes on-device AI โ€” users with ANY Arm phone can run AI by selecting appropriate models for their hardware
WOW Factor One app, unlimited models โ€” load a tiny 0.5B model on a budget phone or a powerful 7B model on a flagship, all offline!

Key Innovations

  1. Flexible Model Selection โญ: Users can load ANY GGUF model based on their device capabilities โ€” choose lightweight models for older phones or powerful models for flagship devices. Switch models anytime for different tasks (coding, writing, Q&A)
  2. True Privacy: All AI processing happens locally โ€” your conversations never leave your device
  3. Arm-Native Optimization: Built from the ground up to leverage Arm NEON SIMD for maximum performance
  4. System-Wide AI Access: Floating assistant and text selection integration bring AI to any app
  5. Multi-Engine Support: Supports both GGUF (llama.cpp) and ExecuTorch (.pte) model formats
  6. Production-Ready UX: Polished Material 3 UI with streaming responses, resource monitoring, and intuitive controls

โœจ Key Features

๐Ÿ”„ Flexible Model Selection (Key Feature!)

  • Load any GGUF model โ€” no hardcoded models, full user control
  • Resource-aware choices: Pick models that fit YOUR device's RAM and CPU
  • Task-specific models: Use a coding model for programming, a chat model for conversations
  • Hot-swap models: Change models without reinstalling the app
  • Wide compatibility: From 0.5B models on budget phones to 7B+ on flagships

๐Ÿค– Local LLM Inference

  • Run quantized GGUF models (Q4, Q5, Q8) directly on device
  • ExecuTorch support for Meta's optimized mobile models
  • Streaming token generation for responsive UX
  • Conversation context management

๐ŸŽˆ Floating AI Assistant

  • System-wide floating button accessible from any app
  • Draggable chat window overlay
  • Drag-to-close gesture for easy dismissal
  • Persistent across app switches

๐Ÿ“ Text Selection Integration

  • Select text anywhere โ†’ "Ask AI" appears in context menu
  • Instant AI analysis of selected content
  • Copy response to clipboard

๐Ÿ“Š Real-Time Resource Monitoring

  • Live CPU usage tracking
  • Memory consumption display
  • Native heap monitoring for model memory
  • Toggle stats on/off for performance

โšก Arm Architecture Optimizations

  • ARM64-v8a native build with NEON SIMD enabled
  • Compiler flags: -O3 -ffast-math -march=armv8-a+simd
  • Greedy sampling (temperature=0) for fastest inference
  • Minimal context window for mobile efficiency

๐Ÿ› ๏ธ Technical Implementation

Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      LocalChatbot App                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  UI Layer (Jetpack Compose + Material 3)                    โ”‚
โ”‚  โ”œโ”€โ”€ ChatScreen - Main chat interface                       โ”‚
โ”‚  โ”œโ”€โ”€ SettingsScreen - Model & inference settings            โ”‚
โ”‚  โ”œโ”€โ”€ FloatingAssistantService - System overlay              โ”‚
โ”‚  โ””โ”€โ”€ ProcessTextActivity - Text selection handler           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Business Logic                                              โ”‚
โ”‚  โ”œโ”€โ”€ ChatViewModel - UI state management                    โ”‚
โ”‚  โ”œโ”€โ”€ ModelRunner - Inference orchestration                  โ”‚
โ”‚  โ””โ”€โ”€ ResourceMonitor - Performance tracking                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Inference Layer                                             โ”‚
โ”‚  โ”œโ”€โ”€ EngineManager - Multi-engine abstraction               โ”‚
โ”‚  โ”œโ”€โ”€ GGUFEngine - llama.cpp integration                     โ”‚
โ”‚  โ””โ”€โ”€ ExecuTorchEngine - Meta ExecuTorch support             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Native Layer (C++ / JNI)                                    โ”‚
โ”‚  โ”œโ”€โ”€ llama-android.cpp - JNI bindings                       โ”‚
โ”‚  โ””โ”€โ”€ llama.cpp - Optimized inference engine                 โ”‚
โ”‚      โ””โ”€โ”€ GGML with ARM NEON SIMD                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Arm-Specific Optimizations

Why Arm NEON Matters: ARM NEON is a SIMD (Single Instruction, Multiple Data) architecture extension that allows parallel processing of multiple data elements. For LLM inference, this means:

  • Matrix multiplications run 4-8x faster
  • Quantized model operations are hardware-accelerated
  • Memory bandwidth is used more efficiently

CMake Configuration:

# ARM NEON SIMD optimizations (critical for mobile performance)
set(GGML_NEON ON CACHE BOOL "Enable ARM NEON" FORCE)

# Disable unnecessary features for mobile
set(LLAMA_CURL OFF)
set(GGML_OPENMP OFF)  # Single-threaded for battery efficiency

# Performance flags
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -ffast-math -fno-finite-math-only")

Gradle NDK Configuration:

ndk {
    abiFilters += listOf("arm64-v8a")  // ARM64 only - no x86 bloat
}
externalNativeBuild {
    cmake {
        arguments += "-DGGML_NEON=ON"
        cppFlags += listOf("-O3", "-ffast-math", "-march=armv8-a+simd")
    }
}

Inference Optimizations:

  • Greedy sampling (temperature=0, top_k=1) for fastest token selection
  • 2048 token context window optimized for mobile memory
  • Single-turn conversation history to minimize prompt size
  • Streaming generation for perceived responsiveness

๐Ÿ“ฑ Setup Instructions

Prerequisites

  • Android Studio Hedgehog (2023.1.1) or newer
  • Android NDK 25.0 or newer (installed via SDK Manager)
  • CMake 3.22.1+ (installed via SDK Manager)
  • An Arm64 Android device (arm64-v8a) with Android 8.0+
  • ~2GB free storage for models

Build Steps

  1. Clone the repository:

    git clone https://github.com/GGUFloader/Mobile-AI-Assistant.git
    cd Mobile-AI-Assistant
  2. Open in Android Studio:

    • Open the Mobile-AI-Assistant folder in Android Studio
    • Wait for Gradle sync to complete (llama.cpp is automatically downloaded by Gradle)
    • If prompted, install any missing SDK components
    • If prompted, install any missing SDK components
  3. Build the project:

    # On Windows
    gradlew.bat assembleRelease
    
    # On macOS/Linux
    ./gradlew assembleRelease

    Or use Android Studio: Build โ†’ Build Bundle(s) / APK(s) โ†’ Build APK(s)

  4. Install on device:

    adb install app/build/outputs/apk/release/app-release.apk

Running the App

  1. Download a GGUF model (choose based on your device RAM):

    Model Size RAM Required Download
    Dhi Qwen2.5 0.5B Q4_K_M โญ ~400MB 512MB Download
    TinyLlama 1.1B Q4_K_M 637MB 1GB HuggingFace
    Phi-2 Q4_K_M 1.6GB 2GB HuggingFace
    Gemma 2B Q4_K_M 1.4GB 2GB HuggingFace

    ๐Ÿ’ก Recommended for Testing: The Dhi Qwen2.5 0.5B model is lightweight and perfect for quick testing on most devices.

  2. Transfer model to device:

    # Recommended model for testing
    adb push Dhi_Qwen2P5_0_5B_Q4_K_M.gguf /sdcard/Download/
    
    # Or any other model you downloaded
    adb push TinyLlama-1.1B-Chat-v1.0-Q4_K_M.gguf /sdcard/Download/
  3. Load model in app:

    • Open LocalChatbot
    • Tap the folder icon to select your model file
    • Wait for model to load (progress shown)
  4. Start chatting!

    • Type your message and tap send
    • Enable floating button for system-wide access

๐ŸŽฎ Usage Guide

Main Chat Interface

  • Type messages in the input field
  • Tap send or press enter to generate response
  • Watch streaming tokens appear in real-time
  • Tap stop button to cancel generation mid-stream
  • Toggle stats icon to see CPU/memory usage

Floating Assistant (System-Wide AI)

  1. Enable "Floating Button" toggle in main screen
  2. Grant overlay permission when prompted
  3. Tap floating bubble from any app to open chat
  4. Drag bubble to reposition
  5. Drag bubble to top of screen to dismiss

Text Selection AI

  1. Select text in any app (browser, email, notes, etc.)
  2. Tap "Ask AI" from the context menu
  3. View AI response in popup dialog
  4. Copy response to clipboard with one tap

Settings

  • Adjust inference parameters (temperature, top_k, top_p)
  • Configure context window size
  • Switch between inference engines
  • Manage loaded models

๐Ÿ“Š Performance Benchmarks

Tested on various Arm devices:

Device SoC Model Tokens/sec Memory
Pixel 7 Tensor G2 (Cortex-X1) TinyLlama 1.1B Q4 8-12 t/s ~800MB
Pixel 7 Tensor G2 Phi-2 Q4 4-6 t/s ~1.8GB
Samsung S23 Snapdragon 8 Gen 2 TinyLlama 1.1B Q4 10-15 t/s ~800MB
OnePlus 12 Snapdragon 8 Gen 3 Gemma 2B Q4 8-10 t/s ~1.6GB

Performance varies based on device thermal state and background processes

๐Ÿ”ง Project Structure

LocalChatbot/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ src/main/
โ”‚   โ”‚   โ”œโ”€โ”€ java/com/example/localchatbot/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ MainActivity.kt              # App entry point
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ModelRunner.kt               # Inference orchestration
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ChatApplication.kt           # App-wide state
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ data/
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ChatMessage.kt           # Message data class
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ChatHistoryRepository.kt # Persistence
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ inference/
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ InferenceEngine.kt       # Engine interface
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ EngineManager.kt         # Multi-engine support
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ GGUFEngine.kt            # llama.cpp wrapper
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ExecuTorchEngine.kt      # ExecuTorch wrapper
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ LlamaCpp.kt              # JNI interface
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ overlay/
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ FloatingAssistantService.kt  # Floating bubble
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ AssistantChatActivity.kt     # Overlay chat UI
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ProcessTextActivity.kt       # Text selection
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ui/
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ChatScreen.kt            # Main chat UI
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ChatViewModel.kt         # UI state management
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ SettingsScreen.kt        # Settings UI
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ theme/Theme.kt           # Material 3 theming
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ settings/
โ”‚   โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ InferenceSettings.kt     # Inference params
โ”‚   โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ EngineSettings.kt        # Engine config
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ util/
โ”‚   โ”‚   โ”‚       โ””โ”€โ”€ ResourceMonitor.kt       # CPU/memory tracking
โ”‚   โ”‚   โ”œโ”€โ”€ cpp/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ CMakeLists.txt               # Native build config
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ llama-android.cpp            # JNI bindings
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ llama.cpp/                   # Inference engine (submodule)
โ”‚   โ”‚   โ”œโ”€โ”€ res/                             # Android resources
โ”‚   โ”‚   โ””โ”€โ”€ AndroidManifest.xml              # App manifest
โ”‚   โ””โ”€โ”€ build.gradle.kts                     # App build config
โ”œโ”€โ”€ build.gradle.kts                         # Root build config
โ”œโ”€โ”€ settings.gradle.kts                      # Gradle settings
โ”œโ”€โ”€ LICENSE                                  # MIT License
โ””โ”€โ”€ README.md                                # This file

๐Ÿ”’ Privacy & Security

LocalChatbot is designed with privacy as a core principle:

  • 100% Offline: No internet permission required, no data transmission
  • Local Processing: All AI inference happens on-device
  • No Analytics: Zero tracking, telemetry, or data collection
  • Open Source: Full transparency โ€” audit the code yourself
  • Your Data Stays Yours: Conversations are stored locally and never leave your device

๐Ÿš€ Future Roadmap

  • Voice input/output support (on-device speech recognition)
  • Multiple conversation threads with history
  • In-app model download manager
  • Prompt templates library
  • Home screen widget for quick access
  • Wear OS companion app
  • RAG support for document Q&A

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“„ License

MIT License - See LICENSE file for details.

This project is open source and free to use, modify, and distribute.

๐Ÿ™ Acknowledgments

  • llama.cpp - The incredible inference engine that makes this possible
  • ExecuTorch - Meta's mobile inference framework
  • Arm - For the amazing mobile architecture and NEON SIMD
  • Jetpack Compose - Modern Android UI toolkit

Built with โค๏ธ for the Arm AI Developer Challenge 2025

About

lightweight, mobile-optimized AI assistant.

Resources

License

Stars

Watchers

Forks

Packages

No packages published