-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added gpu benchmarking script #192
Conversation
if args.save: | ||
save_file = f"{args.mode}_{args.dtype}_{args.backend}.csv" | ||
df.to_csv(save_file) | ||
print(f"Finished benchmark: {args.mode} saved results to {save_file}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wanna also recommend people post their results on a central issue here?
import torch.utils.benchmark as benchmark | ||
import torch.nn.functional as F | ||
from torch import nn | ||
from torch.sparse import SparseSemiStructuredTensor, to_sparse_semi_structured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this require nightlies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably log torch.__version__
, but this doesn't require nightlies. Is there a way we can track torchao version as well?
benchmarks/benchmark_gpu_sparsity.py
Outdated
return sparse_weight | ||
|
||
|
||
def benchmark_in_us(f, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple caveats to this function. I somewhat trust it less than using cuda synchronize and a for loop. Also I'd report the standard deviation as well. If you have 5us but it's +/- 20us something went wrong. blocked autorange is supposed to help with that, but better to verify and print it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think blocked autorange is not great - Actually for the benchmarks, a lot of the time it's just running once, likey as @HDCharles highlighted here.
I think we can use adaptive autorange instead, wrapped in torch.synchronize() to minimze the variability.
def run_gpu_sparse_benchmark(m, k, n, args): | ||
dtype = DTYPE_LOOKUP[args.dtype] | ||
|
||
x = torch.randn(n, k).to(dtype).cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't care about accuracy here I assume, subnormal number performance aside, you could also try torch.empty(n, k, dtype=dtype, device='cuda')
which might be faster to allocate and doesn't require calling randn
. Especially if you run a lot of benchmarks in a row this can become annoying to wait for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to avoid this because this will bias numbers https://www.thonking.ai/p/strangely-matrix-multiplications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, zeros will be an issue. Also subnormal numbers or such. But yes, we can't rely on empty
to not give us just all zeros
.
Hm, I guess if this does ever really become a bottleneck we can write a simpler random number generator (like arange + some mod with a prime number etc.).
benchmarks/benchmark_gpu_sparsity.py
Outdated
elif args.eval_fn == "mm": | ||
dense_output = torch.mm(x, A) | ||
sparse_output = torch.mm(x, A_sparse) | ||
correct = torch.allclose(dense_output, sparse_output, rtol=1e-3, atol=1e-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so we do care about correctness. It seems like maybe something to turn on/off. Morally this should be covered by unit tests, but I also never mind more sources of verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use torch.testing.assert_allclose to have it raise an exception with location and error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when I first wrote this script, we didn't have tests. Let's just remove this correctness checking - since we have better testing now.
} | ||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's pros/cons to creating new processes for each benchmark, but it seems like in general this script will need a default setting to run all relevant or interesting configurations. If someone runs python benchmark_gpu_sparsity.py
and then posts the result, is that enough to be useful?
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/192
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7f8b773 with merge base f2c908b (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Add combined GPU sparsity benchmarking script.
This is really a combination of two scripts - https://gist.github.com/cpuhrsch/7fec60079cbe2daeff59c0577f933320 for BSR benchmarking and https://github.com/pytorch/pytorch/blob/8db72a430d0c3a7d3388749d5d438fb805f53407/benchmarks/sparse/benchmark_semi_structured_sparsity.py for semi-structured sparse benchmarking
We're planning on releasing superblock soon, so I want to point the benchmarks to here, with idea being we can farm out consumer card benchmarks for block sparse like we did with #174
For superblock benchmarks run: