Stream multimodal AI inference at 5–10 tokens/sec (on Arm SME2-enabled devices). Watch Gemma 3N vision models convert math to LaTeX in real-time. Zero cloud APIs. Pure ARM64.
Img2Latex converts handwritten math, lecture notes, and problem sets into professional PDFs—100% on-device with ARM-optimized AI. Built for the Arm AI Developer Challenge 2025, this app demonstrates production-grade multimodal LLM deployment with real-time performance monitoring on iOS.
- 100% On-Device Processing: Zero cloud dependencies, all AI inference runs locally using MediaPipe Tasks GenAI
- Multimodal Vision AI: Gemma 3N with vision encoder/adapter for image-to-LaTeX conversion
- ARM64 Optimizations: Accelerate framework (vImage) for image preprocessing, INT4 quantized models
- ARM-optimized backends (via MediaPipe): MediaPipe integrates ARM kernels (KleidiAI/XNNPACK); SME2-compatible on supported devices
- Client-Side PDF Rendering: WKWebView + latex.js for zero-dependency PDF generation
- Real-Time Metrics: Live CPU, memory, battery, and thermal monitoring during AI inference
- Model Management: Download and switch between Gemma 3N 2B and 4B models at runtime
- Persistent History: SwiftData storage with search, favorites, and refinement tracking
- LaTeX Refinement: Iterative improvements with user feedback and history
Img2Latex demonstrates ARM-optimized on-device AI for multimodal document understanding:
AI/ML Stack:
- Gemma 3N (2B/4B) [1]: INT4 quantized vision-language models running entirely on ARM CPU
- MediaPipe Tasks GenAI 0.10.24 [2]: Inference runtime with integrated ARM backends (KleidiAI/XNNPACK)
- SME2 Compatibility: Automatic use of Scalable Matrix Extension 2 on iPhone 16/M4; NEON fallback on older devices
- Vision Pipeline: Extracts TFLite vision encoder/adapter from
.taskfiles; processes up to 5 images per inference
ARM64 Optimizations:
- Accelerate Framework [3]: vImage SIMD operations for parallel image downscaling (ARM NEON)
- INT4 Quantization: 4-bit weights optimized for ARM integer pipelines
- Thermal Management: ProcessInfo monitoring with adaptive performance tuning
- Zero GPU Dependency: Pure CPU inference leveraging ARM's efficient matrix operations via KleidiAI [4]
Architecture: SwiftUI app with MediaPipe inference service, client-side PDF rendering (WKWebView + latex.js), and SwiftData persistence
| Component | Technology | Purpose |
|---|---|---|
| AI Inference | MediaPipe Tasks GenAI 0.10.24 [2] | On-device LLM runtime (ARM-optimized via KleidiAI [4]/XNNPACK; SME2-compatible on supported devices) |
| Vision Processing | Gemma 3N Vision Encoder/Adapter [1] | Multimodal image understanding (extracted from .task files) |
| ARM Optimization | Accelerate Framework (vImage) [3] | SIMD-accelerated image downscaling |
| Model Format | INT4 Quantized TFLite | 4-bit quantization for memory efficiency |
| PDF Rendering | WKWebView + latex.js [5] | Client-side LaTeX compilation to PDF |
| Data Persistence | SwiftData | Type-safe local storage |
| UI Framework | SwiftUI + Charts | Declarative UI with real-time performance visualization |
Gemma 3N E2B and E4B models use selective parameter activation, achieving effective parameter counts of 2B and 4B while containing more total parameters. This technique reduces memory bandwidth and improves inference speed on mobile devices by activating only the necessary parameters for each token prediction.
The Gemma 3N models include vision encoder and adapter components for multimodal input:
- Vision Encoder: TFLite model that processes images into embeddings (extracted from
.taskarchive) - Vision Adapter: Bridges vision embeddings to the language model's input space
- Multi-Image Support: Handles up to 5 images per generation for document conversion
Our implementation extracts these components from the .task file using ZIPFoundation and configures MediaPipe's LlmInference with vision modality enabled:
options.visionEncoderPath = extractedVisionEncoderPath.path
options.visionAdapterPath = extractedVisionAdapterPath.path
options.maxImages = 5
options.enableVisionModality = trueImg2Latex leverages ARM64-specific features throughout the pipeline for maximum performance and energy efficiency. MediaPipe integrates ARM-optimized backends (KleidiAI [4]/XNNPACK) and uses SME2 automatically on supported hardware; older devices fall back to NEON.
Images are downscaled in parallel using the Accelerate framework's vImage API, which uses ARM NEON SIMD instructions:
private func downscaleCGImageAccelerate(_ src: CGImage, maxDimension: Int) -> CGImage? {
// ... format setup ...
vImageScale_ARGB8888(&srcBuf, &dstBuf, nil, vImage_Flags(kvImageHighQualityResampling))
return vImageCreateCGImageFromBuffer(&dstBuf, &format, nil, nil, vImage_Flags(kvImageNoAllocate), nil)?.takeRetainedValue()
}Benefits:
- 30-70% size reduction before inference
- Parallel processing via Swift TaskGroup (up to 10 images simultaneously)
- SIMD-accelerated resampling (ARM NEON intrinsics)
- Lower memory footprint during vision encoding
Note: Performance improvements enabled by KleidiAI integration through MediaPipe's XNNPACK backend [4].
Models are quantized to INT4 precision for deployment:
- Weights stored as 4-bit integers
- Dynamic dequantization during inference
- ARM64 optimized INT8/INT4 matrix multiplication kernels in MediaPipe
Adaptive parameters based on thermal state and user preference:
| Parameter | Normal Mode | Performance Mode |
|---|---|---|
| Max Tokens | 2000 | 1200 |
| Image Size | 1536px | 1024px |
| Top K | 40 | 60 |
| Top P | 0.90 | 0.95 |
| Temperature | 0.7 | 0.6 |
Real-time thermal monitoring with adaptive behavior:
NotificationCenter.default.addObserver(
forName: ProcessInfo.thermalStateDidChangeNotification,
object: nil,
queue: .main
) { [weak self] _ in
self?.thermalState = ProcessInfo.processInfo.thermalState
}Thermal states: .nominal, .fair, .serious, .critical → UI displays warnings and suggests enabling performance mode.
- Image Input: PhotosPicker or Camera capture (up to 10 images)
- Preprocessing: Parallel vImage downscaling (1024-1536px max dimension)
- Vision Encoding: TFLite vision encoder extracts image embeddings
- Vision Adaptation: Adapter projects embeddings to LLM input space
- Gemma 3N Inference: INT4 quantized model generates LaTeX (streaming, 30fps UI updates)
- LaTeX Extraction: Strip markdown formatting, validate structure
- PDF Rendering: WKWebView loads latex.js CDN, compiles LaTeX client-side
- Storage: SwiftData saves images, LaTeX, PDF data, and metadata
Real-time token generation with throttled UI updates:
for try await chunk in stream {
fullResponse += chunk
let now = Date()
if now.timeIntervalSince(lastUIUpdate) >= (1.0 / 30.0) {
await MainActor.run {
streamingLaTeX = fullResponse
currentTokensPerSecond = estimatedTokens / elapsedTime
}
lastUIUpdate = now
}
}Benefits:
- User sees progress immediately during streaming
- 30fps UI updates prevent main thread saturation
- Real-time metrics displayed during generation
The core service managing Gemma 3N inference, implemented as a singleton with @MainActor isolation.
Responsibilities:
- Model initialization and lifecycle management
- Image preprocessing with vImage (parallel TaskGroup)
- Streaming LaTeX generation with 30fps UI updates
- Performance metrics collection (generation time, tokens/sec, memory, battery)
- Model switching (2B ↔ 4B) at runtime
- Real-time system monitoring (CPU, memory, thermal state)
Key Methods:
func generateLaTeX(from images: [UIImage],
additionalPrompt: String?,
status: GenerationStatus) async throws -> String
func refineLaTeX(currentLaTeX: String,
userFeedback: String,
status: GenerationStatus) async throws -> String
func switchModel(to modelIdentifier: GemmaModelIdentifier) asyncPerformance Tracking:
GenerationMetrics: Captures timestamp, tokens, generation time, memory, battery impact- Historical metrics (last 50 generations) for analytics
- Live metrics updated every 1 second via Timer
WKWebView-based renderer using latex.js CDN for client-side PDF generation.
Features:
- Strips unsupported LaTeX packages (tikz, graphicx, geometry, fancyhdr)
- Injects HTML with latex.js CDN and LaTeX source
- Waits for JavaScript compilation (2s timeout)
- Uses
WKWebView.createPDF()for native PDF generation
Supported LaTeX:
\documentclass{article},\usepackage{amsmath},\usepackage{amssymb}- Math environments:
\[ \],$ $,equation,align,gather - Text formatting:
\textbf{},\textit{},\underline{} - Lists:
itemize,enumerate
Limitations (due to latex.js):
- No
tikzor graphics imports - No custom page geometry or headers/footers
- No complex table environments
Manages downloading, storage, and deletion of Gemma models from Cloudflare R2.
Features:
- Progress tracking with
URLSessionDownloadDelegate - Resume support for interrupted downloads
- Local storage in Documents directory
- Model size reporting and validation
Model Configuration (ModelConfig.swift):
static let availableModels: [GemmaModelIdentifier: DownloadableModelConfig] = [
.gemma2B: DownloadableModelConfig(
identifier: .gemma2B,
downloadURL: URL(string: "https://r2-url/gemma-3n-E2B-it-int4.task")!,
expectedSizeMB: 500.0
),
.gemma4B: DownloadableModelConfig(
identifier: .gemma4B,
downloadURL: URL(string: "https://r2-url/gemma-3n-E4B-it-int4.task")!,
expectedSizeMB: 900.0
)
]SwiftData-based persistence layer for generation history and refinements.
Models:
Generation: Stores images, LaTeX, PDF data, timestamp, title, favorites, refinement historyRefinementEntry: Tracks user feedback and previous LaTeX for each refinement iteration
Key Methods:
func saveGeneration(images: [UIImage], latex: String, pdfDocument: PDFDocument?, title: String?)
func loadGenerations(searchTerm: String?, favoritesOnly: Bool) -> [Generation]
func addRefinement(to generation: Generation, feedback: String, previousLaTeX: String, newLaTeX: String)
func getStatistics() -> StorageStatistics
func searchGenerations(query: String) -> [Generation]Storage Statistics:
- Total generations count
- Total storage size (images + PDFs)
- Average generation time
- Favorite count
Real-time and historical performance analytics UI using Charts framework.
Real-Time Metrics:
- Memory usage gauge (actual resident memory via
ProcessMetrics) - Battery level (UIDevice monitoring)
- Thermal state (ProcessInfo)
- Current tokens/second (live during generation)
Historical Charts:
- Line chart: Generation time over last 20 generations
- Bar chart: Tokens/second per generation
- Area chart: Memory usage during generation
- Sparklines: Real-time memory history (last 30 data points)
System Information:
- Model name and size
- Total generations
- Average tokens/second
- Peak memory usage
- System uptime
Timer-Based Polling (1-second intervals):
metricsTimer = Timer.scheduledTimer(withTimeInterval: 1.0, repeats: true) { [weak self] _ in
Task { @MainActor in
self?.updateRealTimeMetrics()
}
}Tracked Metrics:
- Memory: Resident memory in MB via
task_info(TASK_VM_INFO) - Battery:
UIDevice.current.batteryLevelwith change notifications - Thermal:
ProcessInfo.processInfo.thermalStatewith change notifications
Instruments Integration:
os_signpost(.begin, log: signpostLog, name: "ModelInit")
os_signpost(.event, log: signpostLog, name: "FirstToken")
os_signpost(.end, log: signpostLog, name: "PreprocessImages")Use Xcode Instruments to profile generation latency, memory pressure, and energy impact.
Generation metrics are stored for the last 50 generations:
struct GenerationMetrics: Identifiable {
let timestamp: Date
let modelIdentifier: GemmaModelIdentifier
let inputImageCount: Int
let outputTokenCount: Int
let generationTime: TimeInterval
let tokensPerSecond: Double
let memoryUsageMB: Double
let batteryLevelBefore: Int
let batteryLevelAfter: Int
let thermalState: ProcessInfo.ThermalState
}Aggregate statistics computed on-demand:
- Average generation time
- Average tokens/second
- Peak memory usage
- Total tokens generated
- User selects images via PhotosPicker or Camera
OnDeviceLLMService.generateLaTeX()processes images- LaTeX string returned and displayed in preview
- User triggers PDF generation (on-demand)
LaTeXRenderer.renderLaTeXToPDF()compiles to PDFStorageManager.saveGeneration()persists all data
- User provides feedback text (e.g., "Fix equation 3", "Add more detail")
OnDeviceLLMService.refineLaTeX()performs text-only inference- New LaTeX returned and displayed
StorageManager.addRefinement()createsRefinementEntrylinked toGeneration- Previous LaTeX and feedback stored for history
- User taps download in SettingsView
ModelDownloadManager.downloadModel()fetches from R2.taskfile saved to Documents directory- On initialization,
OnDeviceGemmaModelextracts vision components using ZIPFoundation TF_LITE_VISION_ENCODERandTF_LITE_VISION_ADAPTERextracted to cache- MediaPipe
LlmInferenceconfigured with all model paths
- User navigates to HistoryView
StorageManager.loadGenerations()queries SwiftData- Optional search/filter applied
- Results sorted by timestamp (descending)
- Tapping a generation shows detail view with LaTeX, PDF preview, refinement history
- macOS: Sonoma (14.0) or later
- Xcode: 15.0 or later
- iOS Target: 17.0 or later (iPhone/iPad with A12 Bionic or newer recommended)
- No GPU required: CPU-only inference on ARM via MediaPipe backends
- CocoaPods: Installed via
sudo gem install cocoapods
# 1. Clone the repository
git clone https://github.com/youneslaaroussi/Pic2PDF.git
cd Pic2PDF
# 2. Install dependencies
pod install
# 3. Open the workspace (not the .xcodeproj)
open Pic2PDF.xcworkspaceYou have two options for obtaining the Gemma 3N models:
- Build and run the app
- Navigate to Settings tab
- Tap Download next to Gemma 3N 2B or 4B
- Wait for download to complete
- Model is automatically initialized after download
Download pre-converted models from HuggingFace [1]:
- Gemma 3N 2B (INT4): google/gemma-3n-E2B-it-litert-preview
- Gemma 3N 4B (INT4): google/gemma-3n-E4B-it-litert-preview
Download the .task file and either:
- Place in app bundle (add to Xcode project)
- Copy to Documents directory:
~/Library/Developer/CoreSimulator/Devices/<UUID>/data/Containers/Data/Application/<UUID>/Documents/models/
- Select target device or simulator (iOS 17.0+)
- Build the project:
Cmd+B - Run:
Cmd+R - Wait for model initialization (first launch takes ~5-10 seconds)
- Select photos and tap Generate to test
Model not found error:
- Ensure you've downloaded a model via the in-app downloader
- Check Settings tab for model status
- Look for
.taskfiles in Documents/models/ directory
Slow inference:
- Enable Performance Mode in Settings
- Use Gemma 2B instead of 4B for faster inference
- Ensure device is not thermally throttled (thermal state shown in Stats)
PDF rendering fails:
- Check that LaTeX uses only supported packages
- Inspect LaTeX output for unsupported commands (tikz, graphicx, etc.)
- Try simplifying the LaTeX code
Managed via CocoaPods (see Podfile):
platform :ios, '17.0'
target 'Img2Latex' do
use_frameworks!
# MediaPipe for on-device AI inference
pod 'MediaPipeTasksGenAI', '0.10.24'
pod 'MediaPipeTasksGenAIC', '0.10.24'
# ZIPFoundation for extracting vision models from .task files
pod 'ZIPFoundation', '~> 0.9'
endDependency Details:
| Dependency | Version | Purpose |
|---|---|---|
| MediaPipeTasksGenAI | 0.10.24 | Swift API for LLM inference (ARM-optimized backends via KleidiAI [4]/XNNPACK; SME2 on supported devices) |
| MediaPipeTasksGenAIC | 0.10.24 | C implementation (required by GenAI) |
| ZIPFoundation | ~> 0.9 | Extract vision encoder/adapter from .task archives |
Img2Latex/
├── Img2LatexApp.swift # App entry point, manages AppState and StorageManager
├── AppState.swift # Onboarding state singleton
├── ContentView.swift # Main TabView with Generate/History/Analytics/Settings
│
├── OnDeviceLLMService.swift # Core LLM inference service (Gemma 3N)
├── LaTeXRenderer.swift # WKWebView + latex.js PDF generation
├── ModelDownloadManager.swift # R2 model downloads with progress tracking
├── StorageManager.swift # SwiftData persistence layer
├── ProcessMetrics.swift # System metrics utilities
│
├── Generation.swift # SwiftData models (Generation, RefinementEntry)
├── GenerationStatus.swift # Published state for generation progress
├── ModelConfig.swift # Downloadable model configuration (R2 URLs)
│
├── OnboardingView.swift # First-run onboarding flow
├── SettingsView.swift # Settings UI (performance mode, model management)
├── StatsView.swift # Analytics UI with Charts framework
├── HistoryView.swift # Generation history with search and favorites
├── LaTeXPreviewWithActionsView.swift # LaTeX editor with refinement and PDF export
├── LaTeXWebView.swift # LaTeX rendering preview component
├── PDFViewer.swift # PDF preview component
│
└── Podfile # CocoaPods dependency specification
Benchmarks depend on device (chip, thermal state) and input complexity. To measure on your hardware:
- Build a Release configuration
- Enable signposts (already in code: ModelInit, FirstToken, PreprocessImages)
- Profile with Xcode Instruments: Points of Interest + Allocations + Energy Log
- Record:
- Model initialization time
- Time to first token and tokens/second (estimated in UI)
- Peak resident memory (shown in Stats)
- Battery delta per generation
- LoRA Support: Fine-tuned models for specific document types (lecture notes, problem sets, diagrams)
- Batch Processing: Process multiple document sets in sequence
- OCR Fallback: Hybrid approach with Vision framework for non-handwritten text
- Export Formats: Markdown, HTML, DOCX alongside PDF
- iCloud Sync: Automatic backup of generation history
- Shortcuts Integration: Generate PDFs from Shortcuts app
- Document Scanner: Built-in multi-page scanning with perspective correction
- Google MediaPipe Team: For the incredible MediaPipe Tasks GenAI framework and pre-converted Gemma 3N models
- Google Gemma Team: For open-sourcing the Gemma model family with vision capabilities
- Arm: For hosting the AI Developer Challenge 2025 and promoting on-device AI innovation
- Michael Bui (latex.js): For the client-side LaTeX compiler that powers PDF rendering
For issues, questions, or contributions:
Built with ❤️ for the Arm AI Developer Challenge 2025
-
Gemma 3N Models - Google DeepMind's multimodal vision-language models
https://huggingface.co/google/gemma-3n-E2B -
MediaPipe Tasks GenAI - Google AI Edge's on-device inference framework
https://ai.google.dev/edge/mediapipe/framework/getting_started/ios -
Apple Accelerate Framework - High-performance vector and matrix operations
https://developer.apple.com/documentation/accelerate -
KleidiAI - Arm's optimized micro-kernels for AI workloads on ARM CPUs
GitHub: https://github.com/ARM-software/kleidiai
Integration Announcement: https://newsroom.arm.com/blog/kleidiai-integration-mediapipe -
LaTeX.js - Client-side LaTeX to HTML compiler
https://latex.js.org/usage.html#library
