-
Notifications
You must be signed in to change notification settings - Fork 23.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port masked_select from TH to ATen and optimize perf on CPU #33269
Conversation
In order to parallel Below shows the performance compare between original and this pr, benchmark code available at this op_bench-py, to reproduce, The unit is single socket run
single core run
|
💊 CI failures summary and remediationsAs of commit 065825b (more details on the Dr. CI page):
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Nice job, looks good to me overall. I think you need to rebase to the current master to fix the conflict. |
3532e7c
to
c560e14
Compare
rebased |
|
Sorry, it is this related to this one? |
Oops, wrong browser tab, sorry |
Hey @mingfeima , are you sure your performance comparison between your implementation and the old TH implementation is correct? I was running some measurements on my cuda implementation, and I decided to collect the cpu performance data as well to make sure my changes won't negatively affect yours. Here's the performance I see with a single CPU core, using your commit 1e3b5c4.
My performance measurement script is here, perhaps there's some mistake in it: https://github.com/kurtamohler/pytorch-perf-test-scripts/blob/master/masked_select/masked_select-perf.py I suspect there might be an issue with how your script is measuring time: https://github.com/mingfeima/op_bench-py/blob/master/masked_select.py#L20 The time.time() call uses a syscall, which could be taking a significantly longer time to finish than the masked_select call. It's usually more accurate to call time.time() only twice, surrounding the loop, and then divide the total time by the number of loop iterations, like this:
Or maybe the difference is just that your script might not be ensuring a single-core run with |
Actually, I think your performance measurement script is sufficient. When I run it, I get something that seems to be similar to my measurements, showing a performance decrease. TH CPU masked_select:
ATen CPU masked_select:
TH CPU masked_select (modified to use only a single core):
ATen CPU masked_select (modified to use only a single core):
|
@kurtamohler What type of CPU are you using? And how about the environment setting? |
Post my machine configuration: Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 1000.125
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
### this refers to ATen implementation
(pytorch-mingfei) [mingfeim@mlt-skx090 op_bench-py]$ ./run.sh masked_select.py
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1
### using numactl --physcpubind=0-19 --membind=0
### using OMP_NUM_THREADS=20
input size: [128 1000]; output size: [63941]: time = 0.295 ms
input size: [256 1000]; output size: [127483]: time = 0.413 ms
input size: [512 1000]; output size: [255929]: time = 0.700 ms
input size: [1024 1000]; output size: [511179]: time = 1.238 ms
### using OMP_NUM_THREADS=1
input size: [128 1000]; output size: [63941]: time = 0.703 ms
input size: [256 1000]; output size: [127483]: time = 1.421 ms
input size: [512 1000]; output size: [255929]: time = 2.823 ms
input size: [1024 1000]; output size: [511179]: time = 5.675 ms
### this refers to original TH implementation
(pytorch-cuda) [mingfeim@mlt-skx090 op_bench-py]$ ./run.sh masked_select.py
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1
### using numactl --physcpubind=0-19 --membind=0
### using OMP_NUM_THREADS=20
input size: [128 1000]; output size: [63941]: time = 0.817 ms
input size: [256 1000]; output size: [127483]: time = 1.899 ms
input size: [512 1000]; output size: [255929]: time = 3.747 ms
input size: [1024 1000]; output size: [511179]: time = 6.696 ms
### using OMP_NUM_THREADS=1
input size: [128 1000]; output size: [63941]: time = 0.812 ms
input size: [256 1000]; output size: [127483]: time = 1.620 ms
input size: [512 1000]; output size: [255929]: time = 3.246 ms
input size: [1024 1000]; output size: [511179]: time = 6.488 ms
### this refers to ATen implementation
(pytorch-mingfei) [mingfeim@mlt-skx089 op_bench-py]$ ./run.sh masked_select.py
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1
### using numactl --physcpubind=0-19 --membind=0
### using OMP_NUM_THREADS=20
input size: [128 1000]; output size: [63941]: time = 0.320 ms
input size: [256 1000]; output size: [127483]: time = 0.447 ms
input size: [512 1000]; output size: [255929]: time = 0.706 ms
input size: [1024 1000]; output size: [511179]: time = 1.255 ms
### using OMP_NUM_THREADS=1
input size: [128 1000]; output size: [63941]: time = 0.698 ms
input size: [256 1000]; output size: [127483]: time = 1.416 ms
input size: [512 1000]; output size: [255929]: time = 2.814 ms
input size: [1024 1000]; output size: [511179]: time = 5.651 ms
### this refers to original TH implementation
(pytorch-cuda) [mingfeim@mlt-skx089 op_bench-py]$ ./run.sh masked_select.py
### using KMP_AFFINITY=granularity=fine,compact,1,0
### using KMP_BLOCKTIME=1
### using numactl --physcpubind=0-19 --membind=0
### using OMP_NUM_THREADS=20
input size: [128 1000]; output size: [63941]: time = 0.874 ms
input size: [256 1000]; output size: [127483]: time = 1.900 ms
input size: [512 1000]; output size: [255929]: time = 3.711 ms
input size: [1024 1000]; output size: [511179]: time = 6.727 ms
### using OMP_NUM_THREADS=1
input size: [128 1000]; output size: [63941]: time = 0.860 ms
input size: [256 1000]; output size: [127483]: time = 1.612 ms
input size: [512 1000]; output size: [255929]: time = 3.228 ms
input size: [1024 1000]; output size: [511179]: time = 6.450 ms Since this is not an inplace op, it is almost a must to config jemalloc (tcmalloc also works), otherwise the clear page will be a huge interfere. (also jemalloc is not numa aware... be careful for dual socket machine) From the algorithm level, it is possible that on single core with larger tensor, this ATen implementation might introduce a downgrade. Since More efficient way here is to write two sets of kernels for |
My cpu is an AMD Ryzen Threadripper 2970WX, which has 32 KB L1d and 512 KB L2 per core, and 8 MB per 3 cores. But I don't think the difference between our cache sizes could be the explanation, because the measurements I posted show this slowdown even in relatively small tensor sizes, down to 1000, which should fit within L1 for both of us. Is the TH version using a different amount of memory than your ATen version? I'm not really sure what jemalloc is. I guess it's just one of the implementations of malloc available as a kernel module, right? I'm guessing my malloc is not jemalloc, which might explain our machines' performance difference? |
|
||
auto shape = _self.sizes().vec(); | ||
auto mask_long = at::empty(shape, self.options().dtype(at::kLong)).copy_(_mask); | ||
int64_t numel = mask_long.sum().item().toLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on my cuda version of masked_select, I realized that we can reuse the final element from the cumulative sum result to get numel
, rather than running a whole other sum calculation.
@VitalyFedyunin , when you get a chance, could you review this PR? I've ported the CUDA version, but it depends on some of mingfeima's changes here, so I think I have to wait until this PR is merged before creating my PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anjali411 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anjali411 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Ok, cool, so AMD perf issues are resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, just some small comments.
8f8102f
to
065825b
Compare
@ngimel thanks for the comments, all fixed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ytorch#33269)" This reverts commit fe66bdb. This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap.
…33269)" (#41828) Summary: Pull Request resolved: #41828 This reverts commit fe66bdb. This also makes a sense to THTensorEvenMoreMath because sumall was removed, see THTensor_wrap. Test Plan: Imported from OSS Reviewed By: orionr Differential Revision: D22657473 Pulled By: malfet fbshipit-source-id: 95a806cedf1a3f4df91e6a21de1678252b117489
…on CPU (pytorch#33269)" (pytorch#41828)" This reverts commit 71aad6e.
Summary: This fixes #41473 for discontiguous input, mask and out. Tests to follow. Reverting #33269 is not a great solution because I'm told masked_select was needed for printing complex tensors. cc gchanan , zou3519, ezyang Pull Request resolved: #41841 Reviewed By: mruberry Differential Revision: D22706943 Pulled By: ngimel fbshipit-source-id: 413d7fd3f3308b184de04fd56b8a9aaabcad22fc
This PR ports
masked_select
from TH to ATen and optimize the performance on CPU with TensorIterator.#33053