Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added gpu benchmarking script #192

Merged
merged 5 commits into from
Aug 20, 2024
Merged

added gpu benchmarking script #192

merged 5 commits into from
Aug 20, 2024

Conversation

jcaip
Copy link
Contributor

@jcaip jcaip commented Apr 30, 2024

Add combined GPU sparsity benchmarking script.

This is really a combination of two scripts - https://gist.github.com/cpuhrsch/7fec60079cbe2daeff59c0577f933320 for BSR benchmarking and https://github.com/pytorch/pytorch/blob/8db72a430d0c3a7d3388749d5d438fb805f53407/benchmarks/sparse/benchmark_semi_structured_sparsity.py for semi-structured sparse benchmarking

We're planning on releasing superblock soon, so I want to point the benchmarks to here, with idea being we can farm out consumer card benchmarks for block sparse like we did with #174

For superblock benchmarks run:

python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.8 --block-size 64 --dtype fp32
python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.9 --block-size 64 --dtype fp32
python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.8 --block-size 32 --dtype fp32
python benchmarks/benchmark_gpu_sparsity.py --mode sam-vitb-shapes --sparsity block-sparse --sparsity-level 0.9 --block-size 32 --dtype fp32

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2024
if args.save:
save_file = f"{args.mode}_{args.dtype}_{args.backend}.csv"
df.to_csv(save_file)
print(f"Finished benchmark: {args.mode} saved results to {save_file}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanna also recommend people post their results on a central issue here?

import torch.utils.benchmark as benchmark
import torch.nn.functional as F
from torch import nn
from torch.sparse import SparseSemiStructuredTensor, to_sparse_semi_structured
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require nightlies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably log torch.__version__, but this doesn't require nightlies. Is there a way we can track torchao version as well?

return sparse_weight


def benchmark_in_us(f, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple caveats to this function. I somewhat trust it less than using cuda synchronize and a for loop. Also I'd report the standard deviation as well. If you have 5us but it's +/- 20us something went wrong. blocked autorange is supposed to help with that, but better to verify and print it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think blocked autorange is not great - Actually for the benchmarks, a lot of the time it's just running once, likey as @HDCharles highlighted here.

I think we can use adaptive autorange instead, wrapped in torch.synchronize() to minimze the variability.

def run_gpu_sparse_benchmark(m, k, n, args):
dtype = DTYPE_LOOKUP[args.dtype]

x = torch.randn(n, k).to(dtype).cuda()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't care about accuracy here I assume, subnormal number performance aside, you could also try torch.empty(n, k, dtype=dtype, device='cuda') which might be faster to allocate and doesn't require calling randn. Especially if you run a lot of benchmarks in a row this can become annoying to wait for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid this because this will bias numbers https://www.thonking.ai/p/strangely-matrix-multiplications

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, zeros will be an issue. Also subnormal numbers or such. But yes, we can't rely on empty to not give us just all zeros.

Hm, I guess if this does ever really become a bottleneck we can write a simpler random number generator (like arange + some mod with a prime number etc.).

elif args.eval_fn == "mm":
dense_output = torch.mm(x, A)
sparse_output = torch.mm(x, A_sparse)
correct = torch.allclose(dense_output, sparse_output, rtol=1e-3, atol=1e-3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so we do care about correctness. It seems like maybe something to turn on/off. Morally this should be covered by unit tests, but I also never mind more sources of verification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use torch.testing.assert_allclose to have it raise an exception with location and error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when I first wrote this script, we didn't have tests. Let's just remove this correctness checking - since we have better testing now.

}


if __name__ == "__main__":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's pros/cons to creating new processes for each benchmark, but it seems like in general this script will need a default setting to run all relevant or interesting configurations. If someone runs python benchmark_gpu_sparsity.py and then posts the result, is that enough to be useful?

Copy link

pytorch-bot bot commented May 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/192

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7f8b773 with merge base f2c908b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jcaip jcaip merged commit 0991ba9 into main Aug 20, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants