Provide an efficient inference implementation using sparsification/quantization #206

jpata · 2023-09-14T10:17:02Z

Goal: reduce inference time of the model using quantization

We made some CPU inference performance results public for 2021 in CMS, https://cds.cern.ch/record/2792320/files/DP2021_030.pdf slide 16, “For context, on a single CPU thread (Intel i7-10700 @ 2.9GHz), the baseline PF requires approximately (9 ± 5) ms, the MLPF model approximately 320 ± 50 ms for Run 3 ttbar MC events”.

Now it's a good time to make the inference as fast as possible, while minimizing any physics impact.

Resources:

jpata · 2023-09-29T15:10:20Z

adding @raj2022

jpata · 2024-04-30T11:14:31Z

Also related: #315

jpata · 2024-05-27T15:36:54Z

Basically, to summarize:

with @raj2022 we saw that it's possible to quantize the model to int8 in pytorch using post-training stating quantization, following the recipe in https://github.com/jpata/particleflow/blob/main/notebooks/clic/mlpf-pytorch-transformer-standalone.ipynb
the important features were a custom attention layer (in the notebook), and introducing per-feature quantization stubs
we also showed that using just relu, it's possible to train a very performant model, therefore this work improved the compute budget
however, the int8 exported model was not faster neither on CPU nor on GPU
this most likely requires a more informed approach to make sure the int8 attention is actually computed using efficient ops on the hardware
the summary notebook was added in normalize loss, reparametrize network #297
ONNX may be a better path for performant quantization in the end, but this requires more study.

I'm closing this issue, and putting it on the roadmap to study ONNX post-training static quantization separately.
Many thanks to @raj2022 for your contributions!

jpata changed the title ~~Provide an efficient inference implementation using sparsification/quantization~~ Provide an efficient GNN inference implementation using sparsification/quantization Sep 14, 2023

jpata changed the title ~~Provide an efficient GNN inference implementation using sparsification/quantization~~ Provide an efficient GNN inference implementation using sparsification/quantization with ONNX Sep 29, 2023

jpata mentioned this issue Sep 29, 2023

try quantization, both post-training and quantization-aware training #224

Closed

jpata added hard enhancement New feature or request labels Oct 12, 2023

jpata changed the title ~~Provide an efficient GNN inference implementation using sparsification/quantization with ONNX~~ Provide an efficient inference implementation using sparsification/quantization Apr 11, 2024

jpata closed this as completed May 27, 2024

jpata assigned jpata and unassigned jpata May 27, 2024

jpata added this to MLPF: CMS roadmap May 27, 2024

jpata mentioned this issue Nov 1, 2024

ONNX post-training static quantization #362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an efficient inference implementation using sparsification/quantization #206

Provide an efficient inference implementation using sparsification/quantization #206

jpata commented Sep 14, 2023 •

edited

Loading

jpata commented Sep 29, 2023

jpata commented Apr 30, 2024

jpata commented May 27, 2024 •

edited

Loading

Provide an efficient inference implementation using sparsification/quantization #206

Provide an efficient inference implementation using sparsification/quantization #206

Comments

jpata commented Sep 14, 2023 • edited Loading

jpata commented Sep 29, 2023

jpata commented Apr 30, 2024

jpata commented May 27, 2024 • edited Loading

jpata commented Sep 14, 2023 •

edited

Loading

jpata commented May 27, 2024 •

edited

Loading