Add support for R-4B multimodal model #17840

infil00p · 2025-12-07T05:28:37Z

This commit adds support for the R-4B model (YannQi/R-4B), a multimodal large language model with auto-thinking capabilities.

Changes:

convert_hf_to_gguf.py: Added RVisionModel and RTextModel classes to handle the R model architecture (RForConditionalGeneration)
- RVisionModel uses LFM2 projector type with scale_factor=1 (no patch merging)
- RTextModel extends Qwen3Model for the language component
- Proper tensor name mapping for the projector (pre_norm, linear_1, linear_2)
tools/mtmd/clip.cpp: Modified build_patch_merge_permute() to support scale_factor=1, which skips patch merging for models that don't need it
- R model uses SigLIP vision encoder with 729 tokens (27x27 patches)
- Projector: LayerNorm → Linear → GELU → Linear (no patch downsampling)

Architecture:

Base text model: Qwen3-4B
Vision encoder: SigLIP (384x384, patch size 14)
Projector: 2-layer MLP with pre-normalization (no patch merging)
Feature selection: full (keeps all 729 vision tokens)

Tested with llama-mtmd-cli and successfully generates English responses with Chinese internal reasoning ( tags).

Make sure to read the contributing guidelines before submitting a PR

This commit adds support for the R-4B model (YannQi/R-4B), a multimodal large language model with auto-thinking capabilities. Changes: - convert_hf_to_gguf.py: Added RVisionModel and RTextModel classes to handle the R model architecture (RForConditionalGeneration) - RVisionModel uses LFM2 projector type with scale_factor=1 (no patch merging) - RTextModel extends Qwen3Model for the language component - Proper tensor name mapping for the projector (pre_norm, linear_1, linear_2) - tools/mtmd/clip.cpp: Modified build_patch_merge_permute() to support scale_factor=1, which skips patch merging for models that don't need it - R model uses SigLIP vision encoder with 729 tokens (27x27 patches) - Projector: LayerNorm → Linear → GELU → Linear (no patch downsampling) Architecture: - Base text model: Qwen3-4B - Vision encoder: SigLIP (384x384, patch size 14) - Projector: 2-layer MLP with pre-normalization (no patch merging) - Feature selection: full (keeps all 729 vision tokens) Tested with llama-mtmd-cli and successfully generates English responses with Chinese internal reasoning (<think> tags).

github-actions bot added examples python python script changes labels Dec 7, 2025

loci-dev mentioned this pull request Dec 7, 2025

UPSTREAM PR #17840: Add support for R-4B multimodal model auroralabs-loci/llama.cpp#475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for R-4B multimodal model #17840

Add support for R-4B multimodal model #17840

infil00p commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add support for R-4B multimodal model #17840

Are you sure you want to change the base?

Add support for R-4B multimodal model #17840

Conversation

infil00p commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant