Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) #931

MillionthOdin16 · 2023-04-13T02:05:10Z

Just saw this and it seems pretty crazy. I don't know exactly where to put it, but figured is worth discussing. They claim significant performance gains and pretty crazy model compression capabilities. A lot of the interesting information is straight on the readme page that I linked.

Neural Magic Repo Link

Our MLPerf Inference v3.0 submission contains the following results for the BERT-Large SQuAD v1.1 question answering task:

Benchmark	Engine	Precision	Compressed File Size	SQuAD v1.1 F1 Score (R=X% of Base Accuracy)	Offline Throughput [samples/sec]
BERT-Large Baseline	ONNXRuntime	FP32	1.3 GB	90.874 (R=100.00%)	4.60
oBERT-Large 99%	DeepSparse	INT8	38.2 MB	90.03 (R=99.07%)	1367.14
oBERT-MobileBERT 99.9%	DeepSparse	INT8	19.45 MB	90.80 (R=99.92%)	3275.62
oBERT-MobileBERT 99%	DeepSparse	INT8	9.56 MB	90.41 (R=99.49%)	5578.73

https://github.com/mlcommons/inference_results_v3.0/blob/main/open/NeuralMagic/README.md

jon-chuang · 2023-04-13T03:01:57Z

From the linked repo:

unstructured gradual pruning, quantization-aware training, and structural distillation

I think the model layout would be very different, and further, not comparable to llama. But definitely interesting.

slaren · 2023-04-15T19:35:45Z

This may be interesting: https://github.com/horseee/LLaMA-Pruning

Pruning: The following script globally removes 50% of the dimensions of the LLaMA-7B model, resulting in a lightweight model with 1.72B parameters.

github-actions · 2024-04-11T01:06:35Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

MillionthOdin16 changed the title ~~Investigate Perf Boosts Through Pruning~~ Discussion: Investigate Perf Boosts Through Pruning Apr 13, 2023

MillionthOdin16 changed the title ~~Discussion: Investigate Perf Boosts Through Pruning~~ Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) Apr 13, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 11, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) #931

Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) #931

MillionthOdin16 commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023

slaren commented Apr 15, 2023

github-actions bot commented Apr 11, 2024

Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) #931

Discussion: Investigate Perf Boosts Through Pruning (DeepSparse) #931

Comments

MillionthOdin16 commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023

slaren commented Apr 15, 2023

github-actions bot commented Apr 11, 2024

MillionthOdin16 commented Apr 13, 2023 •

edited

Loading