/หษห.pri.ษl/
Apriel-H1 inference - vLLM plugin for the Apriel-H1 family of hybrid reasoning models.
Apriel-H1-15b-Thinker-SFT is a 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocksโachieving over 2ร higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance.
- Model Size: 15B parameters
- Context Length: 65K (target; runtime dependent)
- Languages: English (best)
- Hybrid TransformerโSSM architecture
- ~2ร throughput improvement over the base Thinker model
- Retains strong reasoning, math, and coding capabilities
- Built via efficient distillationโno training from scratch required
Technical report: Apriel-H1 Report
Training stack: Fast-LLM
All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mamba_cache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B.
@misc{apriel_h1_2025,
title = {Apriel-H1: Towards Efficient Enterprise Reasoning Models},
author = {ServiceNow Language Models Lab},
archivePrefix = {arXiv},
eprint = {2511.02651},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2511.02651},
note = {Model available at \url{https://huggingface.co/ServiceNow-AI/Apriel-H1-15b-Thinker-SFT}},
year = {2025}
}- Model Card: ServiceNow-AI/Apriel-H1-15b-Thinker-SFT


