Skip to content

Conversation

@deepshnv
Copy link

@deepshnv deepshnv commented Dec 5, 2025

Efficient VLM inference using llama-mtmd-cli for high resolution images while having lower GPU VRAM requirements. Implemented 3 optis to enable this: i) offload vision model weights(only) to CPU and stream to device at runtime ii) reordering LLM model init so that the CLIP model is done with encoding the image and has freed-up the VRAM memory iii) tiled flash attention to avoid 2GB/INT_MAX limit ggml_cuda_cpy for larger images

…es while having lower GPU VRAM requirements. Implemented 3 optis to enable this: i) offload vision model weights(only) to CPU and stream to device at runtime ii) reordering LLM model init so that the CLIP model is done with encoding the image and has freed-up the VRAM memory iii) tiled flash attention to avoid 2GB/INT_MAX limit ggml_cuda_cpy for larger images
@ngxson ngxson changed the title Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) (CUDA-only) Efficient inference using llama-mtmd-cli for high resolution images with reduced GPU VRAM usage (#17801) Dec 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant