Reduce overheads on several CPU kernels by avoiding restrides. #36875

robieta · 2020-04-18T19:08:51Z

Calling t.as_strided(..., ...) must make a new TensorImpl to back the new tensor, which takes 300-400 ns. Reduction, scatter/gather, and comparison kernels currently restride inputs and outputs in order to handle dim inside the function passed to TensorIterator. Because these Tensors are created solely for consumption by the iterator a full restride and metadata copy is surplus to requirements. Moreover, shapes are already checked by these kernels prior to calling add_input and add_output, so shape inference and broadcasting are also unnecessary.

This PR adds a TensorIterator::declare_static_shape(...) method, which allows certain kernels to use a much more constrained and efficient shape path. This results in a 900-1200 ns speedup for gather / scatter / scatter_add / cumsum / cumprod and a 250-500 ns speedup for elementwise min and max.

Measurements were taken with this python script, which is driven by this bash script. The general procedure for mitigating environmental skew is to repeatedly switch between an environment which is built with master and one which is built with this branch while running the python script. Within the python measurement script the following was used to reduce variation:

Set number of threads to 1
Aggressively and randomly interleave task measurements to limit correlation between tasks and system state based on when they were run or what task preceded the current one.
Warmup period, dropping the first three passes through all of the tasks.
Two independent end-to-end runs are included since there is some variation even with the above measures. Overall measurement error seems to be about +/- 100 ns.

The benchmark also includes several tasks which are not affected by this PR, both to check for a degradation in TensorIterator performance when static shapes are not set (which did happen for an earlier iteration of this optimization) and to estimate measurement variability and validate that measured improvements are significant.

First run:

                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     920      |    4,000     (-170, +230)       |  3,100     (-110, +140)
gather_dim0              |     910      |    4,100     (-170, +230)       |  3,200     (-110, +150)
gather_dim1              |   1,200      |    4,400     (-190, +240)       |  3,200     (-120, +150)
scatter_1D               |   1,100      |    2,800     (-120, +160)       |  1,700     (-64 , +81)
scatter_dim0             |   1,000      |    2,900     (-130, +160)       |  1,900     (-72 , +95)
scatter_dim1             |   1,200      |    3,200     (-130, +170)       |  1,900     (-67 , +87)
scatter_add_1D           |   1,100      |    2,800     (-120, +150)       |  1,700     (-68 , +89)
scatter_add_dim0         |   1,000      |    2,900     (-120, +150)       |  1,900     (-77 , +93)
scatter_add_dim1         |   1,300      |    3,100     (-140, +180)       |  1,900     (-76 , +92)
cumsum_1D                |   1,000      |    4,600     (-200, +260)       |  3,600     (-120, +170)
cumsum_dim0              |     860      |    4,500     (-190, +240)       |  3,700     (-140, +180)
cumsum_dim1              |   1,200      |    4,800     (-210, +260)       |  3,700     (-130, +180)
cumprod_1D               |   1,000      |    4,600     (-200, +270)       |  3,600     (-130, +170)
cumprod_dim0             |     910      |    4,600     (-210, +270)       |  3,700     (-130, +170)
cumprod_dim1             |   1,200      |    4,900     (-220, +290)       |  3,700     (-130, +170)
min_dim0                 |     280      |    5,900     (-220, +270)       |  5,600     (-220, +260)
min_dim1                 |     560      |    6,200     (-230, +310)       |  5,600     (-230, +270)
max_dim0                 |     320      |    5,900     (-220, +280)       |  5,600     (-200, +250)
max_dim1                 |     540      |    6,100     (-250, +310)       |  5,600     (-200, +250)
std       (reference)    |      58      |    4,300     (-180, +280)       |  4,200     (-160, +200)
clamp     (reference)    |      87      |    3,400     (-160, +220)       |  3,400     (-140, +170)
argmin    (reference)    |     -85      |    3,900     (-170, +250)       |  4,000     (-170, +200)
sum       (reference)    |     -11      |    4,200     (-180, +240)       |  4,200     (-160, +190)
x < y     (reference)    |     110      |    3,700     (-170, +290)       |  3,500     (-140, +150)
max(x, y) (reference)    |     170      |    3,600     (-170, +200)       |  3,400     (-140, +180)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.

Second run:

                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     850      |    3,900     (-130, +150)       |  3,000     (-110, +130)
gather_dim0              |     860      |    4,000     (-140, +150)       |  3,200     (-110, +150)
gather_dim1              |   1,200      |    4,300     (-160, +160)       |  3,200     (-110, +150)
scatter_1D               |   1,100      |    2,700     (-98 , +110)       |  1,700     (-64 , +83)
scatter_dim0             |     950      |    2,800     (-100, +110)       |  1,900     (-67 , +88)
scatter_dim1             |   1,200      |    3,100     (-120, +140)       |  1,900     (-69 , +88)
scatter_add_1D           |   1,100      |    2,700     (-92 , +110)       |  1,700     (-65 , +95)
scatter_add_dim0         |     960      |    2,800     (-100, +100)       |  1,900     (-74 , +100)
scatter_add_dim1         |   1,200      |    3,100     (-100, +130)       |  1,900     (-72 , +100)
cumsum_1D                |     960      |    4,500     (-140, +190)       |  3,600     (-130, +170)
cumsum_dim0              |     820      |    4,500     (-140, +180)       |  3,700     (-130, +170)
cumsum_dim1              |   1,100      |    4,800     (-160, +200)       |  3,600     (-120, +170)
cumprod_1D               |     960      |    4,500     (-130, +190)       |  3,600     (-130, +180)
cumprod_dim0             |     820      |    4,500     (-150, +190)       |  3,700     (-130, +180)
cumprod_dim1             |   1,100      |    4,800     (-150, +220)       |  3,700     (-130, +180)
min_dim0                 |     260      |    5,800     (-210, +250)       |  5,500     (-200, +230)
min_dim1                 |     580      |    6,100     (-230, +270)       |  5,500     (-200, +220)
max_dim0                 |     250      |    5,800     (-210, +230)       |  5,600     (-170, +210)
max_dim1                 |     520      |    6,100     (-220, +240)       |  5,600     (-180, +210)
std       (reference)    |     170      |    4,300     (-210, +220)       |  4,100     (-160, +190)
clamp     (reference)    |     140      |    3,400     (-140, +170)       |  3,300     (-120, +170)
argmin    (reference)    |     -51      |    3,800     (-170, +190)       |  3,900     (-140, +160)
sum       (reference)    |     -58      |    4,100     (-160, +170)       |  4,200     (-170, +190)
x < y     (reference)    |      64      |    3,600     (-150, +210)       |  3,500     (-140, +180)
max(x, y) (reference)    |     120      |    3,500     (-130, +150)       |  3,400     (-130, +150)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.

CC @ilia-cher @VitalyFedyunin @glaringlee @gdankel

This reverts commit 1513d3c.

This reverts commit 04ecab5.

… inputs

dr-ci · 2020-04-18T19:22:50Z

💊 Build failures summary and remediations

As of commit 9405b4f (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 13 times.

robieta · 2020-04-19T00:40:32Z

I believe the CI failures are unrelated. clang-format is complaining about test/cpp/tensorexpr/test_cuda.cpp and I vaguely recall seeing somewhere that I need to rebase and that will fix XLA. I'll rebase and see later.

I did a very long (2 hour, 200 interleaved env switches) with the following result:

                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     840      |    3,900     (-150, +180)       |  3,100     (-120, +160)
gather_dim0              |     860      |    4,100     (-150, +190)       |  3,200     (-130, +170)
gather_dim1              |   1,100      |    4,300     (-160, +190)       |  3,200     (-130, +170)
scatter_1D               |   1,100      |    2,800     (-110, +140)       |  1,700     (-70 , +100)
scatter_dim0             |   1,000      |    2,900     (-110, +150)       |  1,900     (-75 , +110)
scatter_dim1             |   1,300      |    3,100     (-120, +160)       |  1,900     (-75 , +110)
scatter_add_1D           |   1,100      |    2,700     (-110, +130)       |  1,600     (-70 , +110)
scatter_add_dim0         |   1,000      |    2,900     (-110, +140)       |  1,900     (-76 , +110)
scatter_add_dim1         |   1,200      |    3,100     (-120, +150)       |  1,800     (-76 , +110)
cumsum_1D                |     880      |    4,500     (-200, +240)       |  3,600     (-140, +190)
cumsum_dim0              |     820      |    4,500     (-220, +260)       |  3,700     (-150, +190)
cumsum_dim1              |   1,100      |    4,800     (-230, +290)       |  3,700     (-150, +190)
cumprod_1D               |     880      |    4,500     (-190, +240)       |  3,600     (-140, +180)
cumprod_dim0             |     770      |    4,500     (-210, +260)       |  3,800     (-150, +190)
cumprod_dim1             |   1,000      |    4,800     (-220, +290)       |  3,800     (-150, +190)
min_dim0                 |     140      |    5,900     (-220, +270)       |  5,700     (-230, +270)
min_dim1                 |     430      |    6,100     (-240, +290)       |  5,700     (-230, +280)
max_dim0                 |     270      |    5,900     (-220, +280)       |  5,600     (-220, +280)
max_dim1                 |     520      |    6,200     (-240, +310)       |  5,600     (-220, +280)
std       (reference)    |     170      |    4,300     (-170, +220)       |  4,100     (-170, +220)
clamp     (reference)    |      65      |    3,400     (-140, +170)       |  3,300     (-140, +190)
argmin    (reference)    |     -24      |    3,800     (-150, +200)       |  3,800     (-150, +200)
sum       (reference)    |      54      |    4,100     (-180, +220)       |  4,000     (-160, +210)
x < y     (reference)    |      93      |    3,600     (-150, +190)       |  3,500     (-150, +190)
max(x, y) (reference)    |     160      |    3,500     (-140, +180)       |  3,400     (-130, +170)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.

Master was built from the commit that the this PR branched from, so meaningful differences on the reference tasks can't be chalked up to other diffs between the code. Looking at the distributions for torch.max(x, y), this is not a statistical aberration; the improvement is real despite the fact that this version of max uses TensorIterator::reduce_op rather than compare_base_kernel.

This is true even when I only run max(x, y) (reference) even though it doesn't go through the shortcut. All in all, rather curious.

robieta · 2020-04-20T15:45:15Z

Ugg. Bad rebase. Sorry all.

This reverts commit 1513d3c.

This reverts commit 04ecab5.

… inputs

…ytorch into gh/taylorrobie/unsafe_restride

robieta · 2020-04-20T18:46:48Z

Alright! Fast forward resolved the CI failures.

ngimel

Looks good!

ngimel · 2020-04-22T02:47:40Z

Interesting what's happening with max(x,y) and why it changes, but it's all good.

facebook-github-bot

@robieta has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-04-22T16:20:30Z

@robieta merged this pull request in 28fadfc.

Summary: Fixes a safety issue (Nonsense values and segfaults) introduced by #36875 when in-place gather tries to use incorrect shapes. Consider the following block of code: ``` k0 = 8 k1 = 8 m = 100 x = torch.rand((k0, k1)) ind = torch.randint(0, k0, (m, k1)) output = torch.empty((m, k1)) print(torch.gather(x, 0, ind, out=output)) print(torch.gather(x, 1, ind, out=output)) ``` The first gather is legal, the second is not. (`ind` and `output` need to be transposed) Previously this was caught when the kernel tried to restride inputs for TensorIterator, but we can no longer rely on those checks and must test explicitly. If `m` is small the second gather returns gibberish; if it is large enough to push the read out of memory block the program segfaults. Pull Request resolved: #37102 Differential Revision: D21190580 Pulled By: robieta fbshipit-source-id: 80175620d24ad3380d78995f7ec7dbf2627d2998

Taylor Robie added 5 commits April 16, 2020 19:21

add lazy restriding option to TensorIterator

04ecab5

optimize lazy restride scheme

1513d3c

Revert "optimize lazy restride scheme"

40fb1f1

This reverts commit 1513d3c.

Revert "add lazy restriding option to TensorIterator"

12f5135

This reverts commit 04ecab5.

add endpoint to TensorIterator to avoid having to explicitly restride…

9848eb7

… inputs

robieta requested a review from ngimel April 18, 2020 19:08

ignore squash_dim for scalars

4559b68

robieta requested review from albanD, apaszke, ebetica, goldsborough, mrshenli, pietern, pritamdamania87, yf225 and zhaojuanmao as code owners April 20, 2020 15:44

robieta force-pushed the gh/taylorrobie/unsafe_restride branch from b2086a4 to 4559b68 Compare April 20, 2020 15:50

robieta removed request for pietern, ebetica, yf225, apaszke, albanD, goldsborough, mrshenli, pritamdamania87 and zhaojuanmao April 20, 2020 15:51

add lazy restriding option to TensorIterator

bb4147e

Taylor Robie added 6 commits April 20, 2020 08:55

optimize lazy restride scheme

a6c9f24

Revert "optimize lazy restride scheme"

3939d35

This reverts commit 1513d3c.

Revert "add lazy restriding option to TensorIterator"

6dc3220

This reverts commit 04ecab5.

add endpoint to TensorIterator to avoid having to explicitly restride…

2b05e66

… inputs

ignore squash_dim for scalars

2c3a3ee

Merge branch 'gh/taylorrobie/unsafe_restride' of github.com:pytorch/p…

9405b4f

…ytorch into gh/taylorrobie/unsafe_restride

ngimel approved these changes Apr 22, 2020

View reviewed changes

facebook-github-bot reviewed Apr 22, 2020

View reviewed changes

facebook-github-bot closed this in 28fadfc Apr 22, 2020

facebook-github-bot added the merged label Apr 22, 2020

robieta deleted the gh/taylorrobie/unsafe_restride branch April 22, 2020 16:38

robieta mentioned this pull request Apr 22, 2020

extend gather shape check to handle incorrectly sized outputs #37102

Closed

gchanan mentioned this pull request Jul 18, 2020

torch.gather behavior changed from 1.5.1 to master #41532

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce overheads on several CPU kernels by avoiding restrides. #36875

Reduce overheads on several CPU kernels by avoiding restrides. #36875

Uh oh!

robieta commented Apr 18, 2020

Uh oh!

dr-ci bot commented Apr 18, 2020 •

edited

Loading

Uh oh!

robieta commented Apr 19, 2020

Uh oh!

robieta commented Apr 20, 2020

Uh oh!

robieta commented Apr 20, 2020

Uh oh!

ngimel left a comment

Uh oh!

ngimel commented Apr 22, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Apr 22, 2020

Uh oh!

Uh oh!

Reduce overheads on several CPU kernels by avoiding restrides. #36875

Reduce overheads on several CPU kernels by avoiding restrides. #36875

Uh oh!

Conversation

robieta commented Apr 18, 2020

Uh oh!

dr-ci bot commented Apr 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

Uh oh!

robieta commented Apr 19, 2020

Uh oh!

robieta commented Apr 20, 2020

Uh oh!

robieta commented Apr 20, 2020

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Apr 22, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 22, 2020

Uh oh!

Uh oh!

dr-ci bot commented Apr 18, 2020 •

edited

Loading