Skip to content

Conversation

PaliC
Copy link
Contributor

@PaliC PaliC commented Oct 2, 2025

This PR integrates operator benchmarking into the Model Suite by having it inherit from TorchBenchTestSuite. The suite now extracts operator lists from model configs and benchmarks those operators using TorchBench data before running end-to-end model tests.

This approach aligns with the core goal of BackendBench: testing operators. The Model Suite is designed with the assumption that for a given set of ops, users can provide kernel implementations, and the suite will benchmark both the individual ops and the full model using those implementations.

The long-term vision is to make this process seamless—allowing users to run both operator and model benchmarking with a single command.

TorchBench is used here because it provides the strongest guarantee that running the suite benchmarks all operators required for a specific model configuration. Its dataset is easily extensible and includes realistic tensor shapes derived from actual models.

The main design drawback is that this integration makes supporting kernel fusions with models more complex. However, it is preferable to handle kernel fusions in a separate suite regardless.

Testing

Running uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1 with a working mm kernel and other kernels being watermakred yeilds the expected result (below)

Successfully registered 36 custom operators
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][main.py] MODEL EVALUATION RESULTS
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][model.py]
Model: ToyCoreOpsModel
[2025-10-02 07:21:23][INFO][model.py] Status: ✗ Failed (0/3 tests)
[2025-10-02 07:21:23][INFO][model.py]   ✗ small_batch
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]   ✗ medium_batch
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]   ✗ large_input
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]
Model: SmokeTestModel
[2025-10-02 07:21:23][INFO][model.py] Status: ✓ Passed (3/3 tests)
[2025-10-02 07:21:23][INFO][model.py]   ✓ small_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][model.py]   ✓ medium_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][model.py]   ✓ large_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][output.py] Full results saved to generated_kernels/full_results.json
[2025-10-02 07:21:23][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv
[2025-10-02 07:21:23][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json
[2025-10-02 07:21:23][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md
[2025-10-02 07:21:23][INFO][output.py] Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels
Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels
Overall summary saved to: /home/dev/sapling_repos/BackendBench/generated_kernels/OVERALL_SUMMARY.md

Future work with Model Suite

#181


Stack created with Sapling. Best reviewed with ReviewStack.

PaliC added 5 commits October 2, 2025 08:29
Summary:

Here we introduce model suite (model.py). The idea here to start and codify the ideas from jiannanWang/BackendBenchExamples.  Specifically this PR adds some example models / configs which are to be loaded + a Readme. (It may be useful to look at the PR above this as well since it's the model loading logic).

This PR adds two toy models to model suite

SmokeTestModel - This is simple model that uses aten.ops.mm as we can implement a correct version of this op
ToyCoreOpsModel - This is a model which explicitly calls the backwards passes which are both in torchbench + core.

Test Plan:
the test infra is in the pr above, so tests passing on the PR above should be sufficient here

### Future work with Model Suite
#181
### Model Registration
This PR creates a way of adding models to the suite and automatically validates them through CI. It also loads the models as well. The way these models are added is detailed in this readme. The tl;dir is we use a format similar to kernelbench and SakanaAI/robust-kbench where we pair model code with a config. Importantly the configs contain initialization code, forward pass arguments (both in a similar format to torchbench), and a list of ops in the forward and backwards passes. These ops are fairly important as they are what we want to point out to the researcher when they are optimizing a model. There is a README.md to help folks setup proper model code / configs.

We also further verify these registrations are correct through CI. Specifically we run test/test_model_ops_configs.py to ensure the configs are formatted correctly.

### Small Things
- Added a --model-filter to the CLI as it will be needed to support filtering in model suite as it chooses things to test based on the model not set of ops
### Testing
New tests are added so pytest resolves things here

### Future work with Model Suite
#181
This PR adds another unit test to the model loading / config system in the last PR. Specifically, here we ensure that the ops specified in the config are run in the model itself. This is important as updates to torch could change how backwards passes could work. Furthermore, if we are expecting folks to write kernels for a set of ops and then run the model, we should guarentee those ops are used.

### Future work with Model Suite
#181
This PR adds end to end model correctness testing testing to Model suite by comparing the outputs and gradients (after a backwards pass) with 1 iteration of the model.

We also integrate it into CI.

### Testing

Running `uv run python BackendBench/scripts/main.py --suite model --backend directory` with a working mm kernel and a watermarked kernel for everything else

yeilds
```bash
[2025-10-02 07:16:13][INFO][main.py] ============================================================
[2025-10-02 07:16:13][INFO][main.py] MODEL EVALUATION RESULTS
[2025-10-02 07:16:13][INFO][main.py] ============================================================
[2025-10-02 07:16:13][INFO][model.py]
Model: ToyCoreOpsModel
[2025-10-02 07:16:13][INFO][model.py] Status: ✗ Failed (0/3 tests)
[2025-10-02 07:16:13][INFO][model.py]   ✗ small_batch
[2025-10-02 07:16:13][INFO][model.py]     Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8
[2025-10-02 07:16:13][INFO][model.py]   ✗ medium_batch
[2025-10-02 07:16:13][INFO][model.py]     Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8
[2025-10-02 07:16:13][INFO][model.py]   ✗ large_input
[2025-10-02 07:16:13][INFO][model.py]     Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8
[2025-10-02 07:16:13][INFO][model.py]
Model: SmokeTestModel
[2025-10-02 07:16:13][INFO][model.py] Status: ✓ Passed (3/3 tests)
[2025-10-02 07:16:13][INFO][model.py]   ✓ small_batch
[2025-10-02 07:16:13][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:16:13][INFO][model.py]   ✓ medium_batch
[2025-10-02 07:16:13][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:16:13][INFO][model.py]   ✓ large_batch
[2025-10-02 07:16:13][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:16:13][INFO][main.py] ============================================================
```

### Future work with Model Suite
#181
This PR integrates operator benchmarking into the Model Suite by having it inherit from TorchBenchTestSuite. The suite now extracts operator lists from model configs and benchmarks those operators using TorchBench data before running end-to-end model tests.

This approach aligns with the core goal of BackendBench: testing operators. The Model Suite is designed with the assumption that for a given set of ops, users can provide kernel implementations, and the suite will benchmark both the individual ops and the full model using those implementations.

The long-term vision is to make this process seamless—allowing users to run both operator and model benchmarking with a single command.

TorchBench is used here because it provides the strongest guarantee that running the suite benchmarks all operators required for a specific model configuration. Its dataset is easily extensible and includes realistic tensor shapes derived from actual models.

The main design drawback is that this integration makes supporting kernel fusions with models more complex. However, it is preferable to handle kernel fusions in a separate suite regardless.
### Testing
Running `uv run python BackendBench/scripts/main.py --suite model --backend directory --topn 1` with a working mm kernel and other kernels being watermakred yeilds the expected result (below)

```bash
Successfully registered 36 custom operators
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][main.py] MODEL EVALUATION RESULTS
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][model.py]
Model: ToyCoreOpsModel
[2025-10-02 07:21:23][INFO][model.py] Status: ✗ Failed (0/3 tests)
[2025-10-02 07:21:23][INFO][model.py]   ✗ small_batch
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::small_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 32, 32] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]   ✗ medium_batch
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::medium_batch failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [4, 3, 64, 64] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]   ✗ large_input
[2025-10-02 07:21:23][INFO][model.py]     Error: Model ToyCoreOpsModel::large_input failed: Expected number of channels in input to be divisible by num_groups, but got input of shape [2, 3, 128, 128] and num_groups=8
[2025-10-02 07:21:23][INFO][model.py]
Model: SmokeTestModel
[2025-10-02 07:21:23][INFO][model.py] Status: ✓ Passed (3/3 tests)
[2025-10-02 07:21:23][INFO][model.py]   ✓ small_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][model.py]   ✓ medium_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][model.py]   ✓ large_batch
[2025-10-02 07:21:23][INFO][model.py]     Output match: ✓  Gradients match: ✓ (4 gradients)
[2025-10-02 07:21:23][INFO][main.py] ============================================================
[2025-10-02 07:21:23][INFO][output.py] Full results saved to generated_kernels/full_results.json
[2025-10-02 07:21:23][INFO][output.py] Operator summary CSV saved to generated_kernels/operator_summary.csv
[2025-10-02 07:21:23][INFO][output.py] Failed operations log saved to generated_kernels/failed_tests.json
[2025-10-02 07:21:23][INFO][output.py] Overall summary saved to generated_kernels/OVERALL_SUMMARY.md
[2025-10-02 07:21:23][INFO][output.py] Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels
Results saved to directory: /home/dev/sapling_repos/BackendBench/generated_kernels
Overall summary saved to: /home/dev/sapling_repos/BackendBench/generated_kernels/OVERALL_SUMMARY.md
```

### Future work with Model Suite
#181
@PaliC PaliC changed the title Refactor TorchBench for ModelSuite inheritance [ModelSuite] Refactor TorchBench for ModelSuite inheritance Oct 2, 2025
@PaliC PaliC closed this Oct 2, 2025
@PaliC PaliC reopened this Oct 2, 2025
@PaliC PaliC marked this pull request as ready for review October 2, 2025 08:33
@PaliC PaliC requested review from jiannanWang and msaroufim October 2, 2025 09:55
Copy link

meta-cla bot commented Oct 17, 2025

Hi @PaliC!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant