Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce overheads on several CPU kernels by avoiding restrides. #36875

Closed
wants to merge 13 commits into from

Conversation

robieta
Copy link

@robieta robieta commented Apr 18, 2020

Calling t.as_strided(..., ...) must make a new TensorImpl to back the new tensor, which takes 300-400 ns. Reduction, scatter/gather, and comparison kernels currently restride inputs and outputs in order to handle dim inside the function passed to TensorIterator. Because these Tensors are created solely for consumption by the iterator a full restride and metadata copy is surplus to requirements. Moreover, shapes are already checked by these kernels prior to calling add_input and add_output, so shape inference and broadcasting are also unnecessary.

This PR adds a TensorIterator::declare_static_shape(...) method, which allows certain kernels to use a much more constrained and efficient shape path. This results in a 900-1200 ns speedup for gather / scatter / scatter_add / cumsum / cumprod and a 250-500 ns speedup for elementwise min and max.

Measurements were taken with this python script, which is driven by this bash script. The general procedure for mitigating environmental skew is to repeatedly switch between an environment which is built with master and one which is built with this branch while running the python script. Within the python measurement script the following was used to reduce variation:

  • Set number of threads to 1
  • Aggressively and randomly interleave task measurements to limit correlation between tasks and system state based on when they were run or what task preceded the current one.
  • Warmup period, dropping the first three passes through all of the tasks.
    Two independent end-to-end runs are included since there is some variation even with the above measures. Overall measurement error seems to be about +/- 100 ns.

The benchmark also includes several tasks which are not affected by this PR, both to check for a degradation in TensorIterator performance when static shapes are not set (which did happen for an earlier iteration of this optimization) and to estimate measurement variability and validate that measured improvements are significant.

First run:

                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     920      |    4,000     (-170, +230)       |  3,100     (-110, +140)
gather_dim0              |     910      |    4,100     (-170, +230)       |  3,200     (-110, +150)
gather_dim1              |   1,200      |    4,400     (-190, +240)       |  3,200     (-120, +150)
scatter_1D               |   1,100      |    2,800     (-120, +160)       |  1,700     (-64 , +81)
scatter_dim0             |   1,000      |    2,900     (-130, +160)       |  1,900     (-72 , +95)
scatter_dim1             |   1,200      |    3,200     (-130, +170)       |  1,900     (-67 , +87)
scatter_add_1D           |   1,100      |    2,800     (-120, +150)       |  1,700     (-68 , +89)
scatter_add_dim0         |   1,000      |    2,900     (-120, +150)       |  1,900     (-77 , +93)
scatter_add_dim1         |   1,300      |    3,100     (-140, +180)       |  1,900     (-76 , +92)
cumsum_1D                |   1,000      |    4,600     (-200, +260)       |  3,600     (-120, +170)
cumsum_dim0              |     860      |    4,500     (-190, +240)       |  3,700     (-140, +180)
cumsum_dim1              |   1,200      |    4,800     (-210, +260)       |  3,700     (-130, +180)
cumprod_1D               |   1,000      |    4,600     (-200, +270)       |  3,600     (-130, +170)
cumprod_dim0             |     910      |    4,600     (-210, +270)       |  3,700     (-130, +170)
cumprod_dim1             |   1,200      |    4,900     (-220, +290)       |  3,700     (-130, +170)
min_dim0                 |     280      |    5,900     (-220, +270)       |  5,600     (-220, +260)
min_dim1                 |     560      |    6,200     (-230, +310)       |  5,600     (-230, +270)
max_dim0                 |     320      |    5,900     (-220, +280)       |  5,600     (-200, +250)
max_dim1                 |     540      |    6,100     (-250, +310)       |  5,600     (-200, +250)
std       (reference)    |      58      |    4,300     (-180, +280)       |  4,200     (-160, +200)
clamp     (reference)    |      87      |    3,400     (-160, +220)       |  3,400     (-140, +170)
argmin    (reference)    |     -85      |    3,900     (-170, +250)       |  4,000     (-170, +200)
sum       (reference)    |     -11      |    4,200     (-180, +240)       |  4,200     (-160, +190)
x < y     (reference)    |     110      |    3,700     (-170, +290)       |  3,500     (-140, +150)
max(x, y) (reference)    |     170      |    3,600     (-170, +200)       |  3,400     (-140, +180)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.

Second run:

                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     850      |    3,900     (-130, +150)       |  3,000     (-110, +130)
gather_dim0              |     860      |    4,000     (-140, +150)       |  3,200     (-110, +150)
gather_dim1              |   1,200      |    4,300     (-160, +160)       |  3,200     (-110, +150)
scatter_1D               |   1,100      |    2,700     (-98 , +110)       |  1,700     (-64 , +83)
scatter_dim0             |     950      |    2,800     (-100, +110)       |  1,900     (-67 , +88)
scatter_dim1             |   1,200      |    3,100     (-120, +140)       |  1,900     (-69 , +88)
scatter_add_1D           |   1,100      |    2,700     (-92 , +110)       |  1,700     (-65 , +95)
scatter_add_dim0         |     960      |    2,800     (-100, +100)       |  1,900     (-74 , +100)
scatter_add_dim1         |   1,200      |    3,100     (-100, +130)       |  1,900     (-72 , +100)
cumsum_1D                |     960      |    4,500     (-140, +190)       |  3,600     (-130, +170)
cumsum_dim0              |     820      |    4,500     (-140, +180)       |  3,700     (-130, +170)
cumsum_dim1              |   1,100      |    4,800     (-160, +200)       |  3,600     (-120, +170)
cumprod_1D               |     960      |    4,500     (-130, +190)       |  3,600     (-130, +180)
cumprod_dim0             |     820      |    4,500     (-150, +190)       |  3,700     (-130, +180)
cumprod_dim1             |   1,100      |    4,800     (-150, +220)       |  3,700     (-130, +180)
min_dim0                 |     260      |    5,800     (-210, +250)       |  5,500     (-200, +230)
min_dim1                 |     580      |    6,100     (-230, +270)       |  5,500     (-200, +220)
max_dim0                 |     250      |    5,800     (-210, +230)       |  5,600     (-170, +210)
max_dim1                 |     520      |    6,100     (-220, +240)       |  5,600     (-180, +210)
std       (reference)    |     170      |    4,300     (-210, +220)       |  4,100     (-160, +190)
clamp     (reference)    |     140      |    3,400     (-140, +170)       |  3,300     (-120, +170)
argmin    (reference)    |     -51      |    3,800     (-170, +190)       |  3,900     (-140, +160)
sum       (reference)    |     -58      |    4,100     (-160, +170)       |  4,200     (-170, +190)
x < y     (reference)    |      64      |    3,600     (-150, +210)       |  3,500     (-140, +180)
max(x, y) (reference)    |     120      |    3,500     (-130, +150)       |  3,400     (-130, +150)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.

CC @ilia-cher @VitalyFedyunin @glaringlee @gdankel

@robieta robieta requested a review from ngimel April 18, 2020 19:08
@dr-ci
Copy link

dr-ci bot commented Apr 18, 2020

💊 Build failures summary and remediations

As of commit 9405b4f (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 13 times.

@robieta
Copy link
Author

robieta commented Apr 19, 2020

I believe the CI failures are unrelated. clang-format is complaining about test/cpp/tensorexpr/test_cuda.cpp and I vaguely recall seeing somewhere that I need to rebase and that will fix XLA. I'll rebase and see later.

I did a very long (2 hour, 200 interleaved env switches) with the following result:

                          Delta (median)     Master     (25%,  75%)          Branch     (25%,  75%)
---------------------------------------------------------------------------------------------------------
gather_1D                |     840      |    3,900     (-150, +180)       |  3,100     (-120, +160)
gather_dim0              |     860      |    4,100     (-150, +190)       |  3,200     (-130, +170)
gather_dim1              |   1,100      |    4,300     (-160, +190)       |  3,200     (-130, +170)
scatter_1D               |   1,100      |    2,800     (-110, +140)       |  1,700     (-70 , +100)
scatter_dim0             |   1,000      |    2,900     (-110, +150)       |  1,900     (-75 , +110)
scatter_dim1             |   1,300      |    3,100     (-120, +160)       |  1,900     (-75 , +110)
scatter_add_1D           |   1,100      |    2,700     (-110, +130)       |  1,600     (-70 , +110)
scatter_add_dim0         |   1,000      |    2,900     (-110, +140)       |  1,900     (-76 , +110)
scatter_add_dim1         |   1,200      |    3,100     (-120, +150)       |  1,800     (-76 , +110)
cumsum_1D                |     880      |    4,500     (-200, +240)       |  3,600     (-140, +190)
cumsum_dim0              |     820      |    4,500     (-220, +260)       |  3,700     (-150, +190)
cumsum_dim1              |   1,100      |    4,800     (-230, +290)       |  3,700     (-150, +190)
cumprod_1D               |     880      |    4,500     (-190, +240)       |  3,600     (-140, +180)
cumprod_dim0             |     770      |    4,500     (-210, +260)       |  3,800     (-150, +190)
cumprod_dim1             |   1,000      |    4,800     (-220, +290)       |  3,800     (-150, +190)
min_dim0                 |     140      |    5,900     (-220, +270)       |  5,700     (-230, +270)
min_dim1                 |     430      |    6,100     (-240, +290)       |  5,700     (-230, +280)
max_dim0                 |     270      |    5,900     (-220, +280)       |  5,600     (-220, +280)
max_dim1                 |     520      |    6,200     (-240, +310)       |  5,600     (-220, +280)
std       (reference)    |     170      |    4,300     (-170, +220)       |  4,100     (-170, +220)
clamp     (reference)    |      65      |    3,400     (-140, +170)       |  3,300     (-140, +190)
argmin    (reference)    |     -24      |    3,800     (-150, +200)       |  3,800     (-150, +200)
sum       (reference)    |      54      |    4,100     (-180, +220)       |  4,000     (-160, +210)
x < y     (reference)    |      93      |    3,600     (-150, +190)       |  3,500     (-150, +190)
max(x, y) (reference)    |     160      |    3,500     (-140, +180)       |  3,400     (-130, +170)

* Times in nanoseconds
**Deltas: positive is improvement, negative is regression.

Master was built from the commit that the this PR branched from, so meaningful differences on the reference tasks can't be chalked up to other diffs between the code. Looking at the distributions for torch.max(x, y), this is not a statistical aberration; the improvement is real despite the fact that this version of max uses TensorIterator::reduce_op rather than compare_base_kernel.

Screen Shot 2020-04-18 at 5 04 59 PM

This is true even when I only run max(x, y) (reference) even though it doesn't go through the shortcut. All in all, rather curious.

@robieta
Copy link
Author

robieta commented Apr 20, 2020

Ugg. Bad rebase. Sorry all.

@robieta
Copy link
Author

robieta commented Apr 20, 2020

Alright! Fast forward resolved the CI failures.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ngimel
Copy link
Collaborator

ngimel commented Apr 22, 2020

Interesting what's happening with max(x,y) and why it changes, but it's all good.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robieta has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@robieta merged this pull request in 28fadfc.

@robieta robieta deleted the gh/taylorrobie/unsafe_restride branch April 22, 2020 16:38
facebook-github-bot pushed a commit that referenced this pull request Apr 23, 2020
Summary:
Fixes a safety issue (Nonsense values and segfaults) introduced by #36875 when in-place gather tries to use incorrect shapes.

Consider the following block of code:
```
k0 = 8
k1 = 8
m = 100

x = torch.rand((k0, k1))
ind = torch.randint(0, k0, (m, k1))
output = torch.empty((m, k1))

print(torch.gather(x, 0, ind, out=output))
print(torch.gather(x, 1, ind, out=output))
```

The first gather is legal, the second is not. (`ind` and `output` need to be transposed) Previously this was caught when the kernel tried to restride inputs for TensorIterator, but we can no longer rely on those checks and must test explicitly. If `m` is small the second gather returns gibberish; if it is large enough to push the read out of memory block the program segfaults.
Pull Request resolved: #37102

Differential Revision: D21190580

Pulled By: robieta

fbshipit-source-id: 80175620d24ad3380d78995f7ec7dbf2627d2998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants