Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an efficient inference implementation using sparsification/quantization #206

Closed
jpata opened this issue Sep 14, 2023 · 3 comments
Closed
Labels
enhancement New feature or request hard

Comments

@jpata
Copy link
Owner

jpata commented Sep 14, 2023

Goal: reduce inference time of the model using quantization

We made some CPU inference performance results public for 2021 in CMS, https://cds.cern.ch/record/2792320/files/DP2021_030.pdf slide 16, “For context, on a single CPU thread (Intel i7-10700 @ 2.9GHz), the baseline PF requires approximately (9 ± 5) ms, the MLPF model approximately 320 ± 50 ms for Run 3 ttbar MC events”.

Now it's a good time to make the inference as fast as possible, while minimizing any physics impact.

Resources:

@jpata jpata changed the title Provide an efficient inference implementation using sparsification/quantization Provide an efficient GNN inference implementation using sparsification/quantization Sep 14, 2023
@jpata jpata changed the title Provide an efficient GNN inference implementation using sparsification/quantization Provide an efficient GNN inference implementation using sparsification/quantization with ONNX Sep 29, 2023
@jpata
Copy link
Owner Author

jpata commented Sep 29, 2023

adding @raj2022

@jpata jpata added hard enhancement New feature or request labels Oct 12, 2023
@jpata jpata changed the title Provide an efficient GNN inference implementation using sparsification/quantization with ONNX Provide an efficient inference implementation using sparsification/quantization Apr 11, 2024
@jpata
Copy link
Owner Author

jpata commented Apr 30, 2024

Also related: #315

@jpata
Copy link
Owner Author

jpata commented May 27, 2024

Basically, to summarize:

  • with @raj2022 we saw that it's possible to quantize the model to int8 in pytorch using post-training stating quantization, following the recipe in https://github.com/jpata/particleflow/blob/main/notebooks/clic/mlpf-pytorch-transformer-standalone.ipynb
  • the important features were a custom attention layer (in the notebook), and introducing per-feature quantization stubs
  • we also showed that using just relu, it's possible to train a very performant model, therefore this work improved the compute budget
  • however, the int8 exported model was not faster neither on CPU nor on GPU
  • this most likely requires a more informed approach to make sure the int8 attention is actually computed using efficient ops on the hardware
  • the summary notebook was added in normalize loss, reparametrize network #297
  • ONNX may be a better path for performant quantization in the end, but this requires more study.

I'm closing this issue, and putting it on the roadmap to study ONNX post-training static quantization separately.
Many thanks to @raj2022 for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hard
Projects
Status: No status
Development

No branches or pull requests

1 participant