- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25
Introduce 8da4w quant for decoder-only text models #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. | 
92541cc    to
    602420d      
    Compare
  
    a7369e6    to
    c20bd3e      
    Compare
  
    | @kimishpatel @metascroy @jerryzh168 for review. | 
797a747    to
    c126e84      
    Compare
  
    | Tagging @tarun292 for review as we start adding quantization recipe for native HF models. | 
c126e84    to
    2c461b4      
    Compare
  
    c650a1c    to
    a071afe      
    Compare
  
    | Fixed the executorch version check issue on Linux, which returns '0.6.0+cpu', causing  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantization related code LGTM
ea01b8c    to
    aa83f06      
    Compare
  
    a2de3dc    to
    484eeb9      
    Compare
  
    484eeb9    to
    d741f97      
    Compare
  
    
Initial efforts to introduce quantization to native Hugging Face models that are already supported in optimum-executorch. Start with decoder-only text models using "
8da4w" for linear layers and int8 for embedding.Experiment the quantization configs with the following models:
Qwen3-0.6Bgemma-3-1bHuggingFaceTB/SmolLM2-135MExample usage
via
optimum-clioptimum-cli export executorch --model Qwen/Qwen3-0.6B --task text-generation --recipe xnnpack --use_custom_sdpa --qlinear --qembedding --output_dir qwen3_8da4w_8weor use the
ExecuTorchModelForCausalLM.from_pretrained.et_model = ExecuTorchModelForCausalLM.from_pretrained("./qwen3_8da4w_8we").ptesize comparison: