Skip to content

Conversation

@infil00p
Copy link

@infil00p infil00p commented Dec 7, 2025

This commit adds support for the R-4B model (YannQi/R-4B), a multimodal large language model with auto-thinking capabilities.

Changes:

  • convert_hf_to_gguf.py: Added RVisionModel and RTextModel classes to handle the R model architecture (RForConditionalGeneration)

    • RVisionModel uses LFM2 projector type with scale_factor=1 (no patch merging)
    • RTextModel extends Qwen3Model for the language component
    • Proper tensor name mapping for the projector (pre_norm, linear_1, linear_2)
  • tools/mtmd/clip.cpp: Modified build_patch_merge_permute() to support scale_factor=1, which skips patch merging for models that don't need it

    • R model uses SigLIP vision encoder with 729 tokens (27x27 patches)
    • Projector: LayerNorm → Linear → GELU → Linear (no patch downsampling)

Architecture:

  • Base text model: Qwen3-4B
  • Vision encoder: SigLIP (384x384, patch size 14)
  • Projector: 2-layer MLP with pre-normalization (no patch merging)
  • Feature selection: full (keeps all 729 vision tokens)

Tested with llama-mtmd-cli and successfully generates English responses with Chinese internal reasoning ( tags).

Make sure to read the contributing guidelines before submitting a PR

This commit adds support for the R-4B model (YannQi/R-4B), a multimodal
large language model with auto-thinking capabilities.

Changes:
- convert_hf_to_gguf.py: Added RVisionModel and RTextModel classes to handle
  the R model architecture (RForConditionalGeneration)
  - RVisionModel uses LFM2 projector type with scale_factor=1 (no patch merging)
  - RTextModel extends Qwen3Model for the language component
  - Proper tensor name mapping for the projector (pre_norm, linear_1, linear_2)

- tools/mtmd/clip.cpp: Modified build_patch_merge_permute() to support
  scale_factor=1, which skips patch merging for models that don't need it
  - R model uses SigLIP vision encoder with 729 tokens (27x27 patches)
  - Projector: LayerNorm → Linear → GELU → Linear (no patch downsampling)

Architecture:
- Base text model: Qwen3-4B
- Vision encoder: SigLIP (384x384, patch size 14)
- Projector: 2-layer MLP with pre-normalization (no patch merging)
- Feature selection: full (keeps all 729 vision tokens)

Tested with llama-mtmd-cli and successfully generates English responses
with Chinese internal reasoning (<think> tags).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant