-
Notifications
You must be signed in to change notification settings - Fork 206
[OMNIML-2244] Create the MXFP8 quant exporter #634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #634 +/- ##
==========================================
- Coverage 74.57% 74.50% -0.07%
==========================================
Files 183 183
Lines 18451 18400 -51
==========================================
- Hits 13759 13709 -50
+ Misses 4692 4691 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
.github/workflows/gpu_tests.yml
Outdated
| timeout-minutes: 120 | ||
| container: &gpu_container | ||
| image: nvcr.io/nvidia/pytorch:25.06-py3 | ||
| image: nvcr.io/nvidia/pytorch:25.08-py3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
25.08 is CUDA 13 container. ort-gpu installation by default is for cuda 12, is that fine? We also install cupy-cuda12x instead of cupy-cuda13x in setup.py for int4 onnx quantization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are issues with using autocast with the TensorRT verison. I will disable it for now for mxfp8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase to latest main branch so all tests can run in github for this PR. CICD is now migrated to .github/workflows (except onnx_ptq bash test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
f0103e0 to
06fc4df
Compare
|
Can you please add the accuracy for the baseline model for comparison? FP32 or FP16 should be okay. Thanks. |
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
06fc4df to
b685019
Compare
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
## What does this PR do? **Type of change:** New feature **Overview:** ? - Implemented functions for the MXFP8 quant exporter - Integrated autocast for converting model to fp16 - deprecated quantize_weights_to_mxfp8 - Updated tests ## Usage ``` python torch_quant_to_onnx.py --quantize_mode=mxfp8 --onnx_save_path=vit_base_patch16_224.mxfp8.onnx --calibration_data_size 64 --batch_size 128 ``` ## Testing ``` python evaluate.py --onnx_path=vit_base_patch16_224.mxfp8.onnx --model_name=vit_base_patch16_224 --results_path=./results.txt --batch_size 128 ``` Accuracy and latency results ``` The top1 accuracy of the model is 85.07% The top5 accuracy of the model is 97.558% Inference latency of the model is 6.65451 ms ``` ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: No - deprecated quantize_weights_to_mxfp8 - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
What does this PR do?
Type of change:
New feature
Overview: ?
Usage
Testing
Accuracy and latency results
Reference accuracy for fp16
Before your PR is "Ready for review"