[ModelSuite] Refactor TorchBench for ModelSuite inheritance #180

PaliC · 2025-10-02T07:45:08Z

This PR integrates operator benchmarking into the Model Suite by having it inherit from TorchBenchTestSuite. The suite now extracts operator lists from model configs and benchmarks those operators using TorchBench data before running end-to-end model tests.

This approach aligns with the core goal of BackendBench: testing operators. The Model Suite is designed with the assumption that for a given set of ops, users can provide kernel implementations, and the suite will benchmark both the individual ops and the full model using those implementations.

The long-term vision is to make this process seamless—allowing users to run both operator and model benchmarking with a single command.

TorchBench is used here because it provides the strongest guarantee that running the suite benchmarks all operators required for a specific model configuration. Its dataset is easily extensible and includes realistic tensor shapes derived from actual models.

The main design drawback is that this integration makes supporting kernel fusions with models more complex. However, it is preferable to handle kernel fusions in a separate suite regardless.

Testing

Running uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1 with a working mm kernel and other kernels being watermakred yeilds the expected result (below)

Successfully registered 36 custom operators
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][main.py] MODEL EVALUATION RESULTS
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][model.py]
Model: ToyCoreOpsModel
[2025-10-02 07:21:23][INFO][model.py] Status: ✗ Failed (0/3 tests)
[2025-10-02 07:21:23][INFO][model.py]   ✗ small_batch
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]   ✗ medium_batch
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]   ✗ large_input
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]
Model: SmokeTestModel
[2025-10-02 07:21:23][INFO][model.py] Status: ✓ Passed (3/3 tests)
[2025-10-02 07:21:23][INFO][model.py]   ✓ small_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][model.py]   ✓ medium_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][model.py]   ✓ large_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][output.py] Full results saved to generated_kernels/full_results.json
[2025-10-02 07:21:23][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv
[2025-10-02 07:21:23][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json
[2025-10-02 07:21:23][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md
[2025-10-02 07:21:23][INFO][output.py] Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels
Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels
Overall summary saved to: /home/dev/sapling_repos/BackendBench/generated_kernels/OVERALL_SUMMARY.md

Future work with Model Suite

#181

Stack created with Sapling. Best reviewed with ReviewStack.

Summary: Here we introduce model suite (model.py). The idea here to start and codify the ideas from jiannanWang/BackendBenchExamples. Specifically this PR adds some example models / configs which are to be loaded + a Readme. (It may be useful to look at the PR above this as well since it's the model loading logic). This PR adds two toy models to model suite SmokeTestModel - This is simple model that uses aten.ops.mm as we can implement a correct version of this op ToyCoreOpsModel - This is a model which explicitly calls the backwards passes which are both in torchbench + core. Test Plan: the test infra is in the pr above, so tests passing on the PR above should be sufficient here ### Future work with Model Suite #181

### Model Registration This PR creates a way of adding models to the suite and automatically validates them through CI. It also loads the models as well. The way these models are added is detailed in this readme. The tl;dir is we use a format similar to kernelbench and SakanaAI/robust-kbench where we pair model code with a config. Importantly the configs contain initialization code, forward pass arguments (both in a similar format to torchbench), and a list of ops in the forward and backwards passes. These ops are fairly important as they are what we want to point out to the researcher when they are optimizing a model. There is a README.md to help folks setup proper model code / configs. We also further verify these registrations are correct through CI. Specifically we run test/test_model_ops_configs.py to ensure the configs are formatted correctly. ### Small Things - Added a --model-filter to the CLI as it will be needed to support filtering in model suite as it chooses things to test based on the model not set of ops ### Testing New tests are added so pytest resolves things here ### Future work with Model Suite #181

This PR adds another unit test to the model loading / config system in the last PR. Specifically, here we ensure that the ops specified in the config are run in the model itself. This is important as updates to torch could change how backwards passes could work. Furthermore, if we are expecting folks to write kernels for a set of ops and then run the model, we should guarentee those ops are used. ### Future work with Model Suite #181

This PR adds end to end model correctness testing testing to Model suite by comparing the outputs and gradients (after a backwards pass) with 1 iteration of the model. We also integrate it into CI. ### Testing Running `uv run python BackendBench/scripts/main.py --suite model --backend directory` with a working mm kernel and a watermarked kernel for everything else yeilds ```bash [2025-10-02 07:16:13][INFO][main.py] ============================================================ [2025-10-02 07:16:13][INFO][main.py] MODEL EVALUATION RESULTS [2025-10-02 07:16:13][INFO][main.py] ============================================================ [2025-10-02 07:16:13][INFO][model.py] Model: ToyCoreOpsModel [2025-10-02 07:16:13][INFO][model.py] Status: ✗ Failed (0/3 tests) [2025-10-02 07:16:13][INFO][model.py] ✗ small_batch [2025-10-02 07:16:13][INFO][model.py] Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8 [2025-10-02 07:16:13][INFO][model.py] ✗ medium_batch [2025-10-02 07:16:13][INFO][model.py] Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8 [2025-10-02 07:16:13][INFO][model.py] ✗ large_input [2025-10-02 07:16:13][INFO][model.py] Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8 [2025-10-02 07:16:13][INFO][model.py] Model: SmokeTestModel [2025-10-02 07:16:13][INFO][model.py] Status: ✓ Passed (3/3 tests) [2025-10-02 07:16:13][INFO][model.py] ✓ small_batch [2025-10-02 07:16:13][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:16:13][INFO][model.py] ✓ medium_batch [2025-10-02 07:16:13][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:16:13][INFO][model.py] ✓ large_batch [2025-10-02 07:16:13][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:16:13][INFO][main.py] ============================================================ ``` ### Future work with Model Suite #181

This PR integrates operator benchmarking into the Model Suite by having it inherit from TorchBenchTestSuite. The suite now extracts operator lists from model configs and benchmarks those operators using TorchBench data before running end-to-end model tests. This approach aligns with the core goal of BackendBench: testing operators. The Model Suite is designed with the assumption that for a given set of ops, users can provide kernel implementations, and the suite will benchmark both the individual ops and the full model using those implementations. The long-term vision is to make this process seamless—allowing users to run both operator and model benchmarking with a single command. TorchBench is used here because it provides the strongest guarantee that running the suite benchmarks all operators required for a specific model configuration. Its dataset is easily extensible and includes realistic tensor shapes derived from actual models. The main design drawback is that this integration makes supporting kernel fusions with models more complex. However, it is preferable to handle kernel fusions in a separate suite regardless. ### Testing Running `uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1` with a working mm kernel and other kernels being watermakred yeilds the expected result (below) ```bash Successfully registered 36 custom operators [2025-10-02 07:21:23][INFO][main.py] ============================================================ [2025-10-02 07:21:23][INFO][main.py] MODEL EVALUATION RESULTS [2025-10-02 07:21:23][INFO][main.py] ============================================================ [2025-10-02 07:21:23][INFO][model.py] Model: ToyCoreOpsModel [2025-10-02 07:21:23][INFO][model.py] Status: ✗ Failed (0/3 tests) [2025-10-02 07:21:23][INFO][model.py] ✗ small_batch [2025-10-02 07:21:23][INFO][model.py] Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8 [2025-10-02 07:21:23][INFO][model.py] ✗ medium_batch [2025-10-02 07:21:23][INFO][model.py] Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8 [2025-10-02 07:21:23][INFO][model.py] ✗ large_input [2025-10-02 07:21:23][INFO][model.py] Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8 [2025-10-02 07:21:23][INFO][model.py] Model: SmokeTestModel [2025-10-02 07:21:23][INFO][model.py] Status: ✓ Passed (3/3 tests) [2025-10-02 07:21:23][INFO][model.py] ✓ small_batch [2025-10-02 07:21:23][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:21:23][INFO][model.py] ✓ medium_batch [2025-10-02 07:21:23][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:21:23][INFO][model.py] ✓ large_batch [2025-10-02 07:21:23][INFO][model.py] Output match: ✓ Gradients match: ✓ (4 gradients) [2025-10-02 07:21:23][INFO][main.py] ============================================================ [2025-10-02 07:21:23][INFO][output.py] Full results saved to generated_kernels/full_results.json [2025-10-02 07:21:23][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv [2025-10-02 07:21:23][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json [2025-10-02 07:21:23][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md [2025-10-02 07:21:23][INFO][output.py] Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels Overall summary saved to: /home/dev/sapling_repos/BackendBench/generated_kernels/OVERALL_SUMMARY.md ``` ### Future work with Model Suite #181

meta-cla · 2025-10-17T00:27:49Z

Hi @PaliC!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

This was referenced Oct 2, 2025

[ModelSuite] Add model ops coverage validation test #178

Closed

[Model Suite] Add model correctness testing #179

Closed

[ModelSuite] Add model loading infrastructure #177

Closed

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025

PaliC force-pushed the pr180 branch 2 times, most recently from c541577 to 51bf101 Compare October 2, 2025 08:21

This was referenced Oct 2, 2025

[ModelSuite] Add model loading infrastructure #182

Open

[ModelSuite] Add Toy Models #183

Open

[Model Suite] Add model correctness testing #185

Open

[ModelSuite] Add model ops coverage validation test #184

Open

PaliC added 5 commits October 2, 2025 08:29

PaliC changed the title ~~Refactor TorchBench for ModelSuite inheritance~~ [ModelSuite] Refactor TorchBench for ModelSuite inheritance Oct 2, 2025

PaliC force-pushed the pr180 branch from 51bf101 to dda5832 Compare October 2, 2025 08:29

PaliC closed this Oct 2, 2025

PaliC reopened this Oct 2, 2025

PaliC marked this pull request as ready for review October 2, 2025 08:33

PaliC requested review from jiannanWang and msaroufim October 2, 2025 09:55

PaliC mentioned this pull request Oct 2, 2025

[WIP] [ModelSuite] Add Performace Testing #186

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ModelSuite] Refactor TorchBench for ModelSuite inheritance #180

[ModelSuite] Refactor TorchBench for ModelSuite inheritance #180

Uh oh!

PaliC commented Oct 2, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[ModelSuite] Refactor TorchBench for ModelSuite inheritance #180

Are you sure you want to change the base?

[ModelSuite] Refactor TorchBench for ModelSuite inheritance #180

Uh oh!

Conversation

PaliC commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Future work with Model Suite

Uh oh!

meta-cla bot commented Oct 17, 2025

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PaliC commented Oct 2, 2025 •

edited

Loading