Target 8192 blocks instead of split to large grid for large reduction #35997

zasdfgbnm · 2020-04-03T22:47:02Z

Stack from ghstack:

Refactor thread_reduce for better unrolling and vectorization in the future #36014 Refactor thread_reduce for better unrolling and vectorization in the future
Target 8192 blocks instead of split to large grid for large reduction #35997 Target 4096 blocks instead of split to large grid for large reduction

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

Differential Revision: D20927533

When the number of blocks is large enough, we are already achieving blalanced SM allocation. But we still should keep the number of inputs per thread large, because thread reduce is cheap. Benchmark for Half on V100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb On large tensor, it is: 1.37ms vs 1.25ms [ghstack-poisoned]

dr-ci · 2020-04-03T22:48:04Z

💊 CircleCI build failures summary and remediations

As of commit 974fe24 (more details on the Dr. CI page):

1/6 failures introduced in this PR
5/6 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_ios_11_2_1_x86_64_build}	^{Run Simulator Tests}	🔁 rerun

❄️ 5 tentatively flaky failures

5 failures tentatively classified as flaky but have not triggered reruns to confirm:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http:

Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead.

pytorch_cpp_doc_push (2/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http:

Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead.

pytorch_linux_backward_compatibility_check_test (3/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http:

Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead.

pytorch_python_doc_push (4/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http:

Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead.

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test (5/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http:

Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 10 times.

…e reduction" When the number of blocks is large enough, we are already achieving blalanced SM allocation. But we still should keep the number of inputs per thread large, because thread reduce is cheap. Benchmark for Half on V100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb On large tensor, it is: 1.37ms vs 1.25ms [ghstack-poisoned]

aten/src/ATen/native/cuda/Reduce.cuh

jjsjann123

LGTM. Please update the comment in code with your commit description.

…e reduction" When the number of blocks is large enough, we are already achieving blalanced SM allocation. But we still should keep the number of inputs per thread large, because thread reduce is cheap. Benchmark for Half on V100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb On large tensor, it is: 1.37ms vs 1.25ms [ghstack-poisoned]

…e reduction" When the number of blocks is large enough, we are already achieving blalanced SM allocation. But we still should keep the number of inputs per thread large, because thread reduce is cheap. Benchmark for Half on V100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb On large tensor, it is: 1.37ms vs 1.25ms Differential Revision: [D20927533](https://our.internmc.facebook.com/intern/diff/D20927533) [ghstack-poisoned]

ngimel · 2020-04-07T20:31:33Z

aten/src/ATen/native/cuda/Reduce.cuh

@@ -789,15 +789,23 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
    config.output_mult[1] = config.split_output(block_height);
  }

-  if (config.input_mult[1] != 0 && config.values_per_thread() >= 256 && num_outputs <= 4096) {
+  constexpr int target_grid_size = 4096;


This generally looks good, but probably you can make it even less than 4096? You should be targeting full occupancy, which will come out to less than 4096?

facebook-github-bot · 2020-04-10T03:51:09Z

@ngimel merged this pull request in d9227bb.

…pytorch#35997) Summary: Pull Request resolved: pytorch#35997 When the number of blocks is large enough, we are already achieving blalanced SM allocation. But we still should keep the number of inputs per thread large, because thread reduce is cheap. Benchmark for Half on V100: https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb On large tensor, it is: 1.37ms vs 1.25ms Test Plan: Imported from OSS Differential Revision: D20927533 Pulled By: ngimel fbshipit-source-id: 40df52e439cc1c01cda66c6195b600f301c5e984

zasdfgbnm requested review from jjsjann123 and ngimel April 3, 2020 22:51

pytorchbot added the open source label Apr 3, 2020

zasdfgbnm mentioned this pull request Apr 4, 2020

Refactor thread_reduce for better unrolling and vectorization in the future #36014

Closed

jjsjann123 reviewed Apr 4, 2020

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

jjsjann123 approved these changes Apr 4, 2020

View reviewed changes

zasdfgbnm added 3 commits April 4, 2020 13:43

zasdfgbnm changed the title ~~Target 4096 blocks instead of split to large grid for large reduction~~ Target 8192 blocks instead of split to large grid for large reduction Apr 9, 2020

ngimel approved these changes Apr 9, 2020

View reviewed changes

facebook-github-bot closed this in d9227bb Apr 10, 2020

facebook-github-bot added the merged label Apr 10, 2020

zasdfgbnm deleted the gh/zasdfgbnm/33/head branch April 10, 2020 10:18

zasdfgbnm mentioned this pull request Apr 16, 2020

Reduction for torch.int8 is super slow on CUDA #36706

Open

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Target 8192 blocks instead of split to large grid for large reduction #35997

Target 8192 blocks instead of split to large grid for large reduction #35997

Uh oh!

zasdfgbnm commented Apr 3, 2020 •

edited by ngimel

Loading

Uh oh!

dr-ci bot commented Apr 3, 2020 •

edited

Loading

Uh oh!

Uh oh!

jjsjann123 left a comment

Uh oh!

ngimel Apr 7, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

Uh oh!

Target 8192 blocks instead of split to large grid for large reduction #35997

Target 8192 blocks instead of split to large grid for large reduction #35997

Uh oh!

Conversation

zasdfgbnm commented Apr 3, 2020 • edited by ngimel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

1 failure not recognized by patterns:

❄️ 5 tentatively flaky failures

pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)

pytorch_cpp_doc_push (2/5)

pytorch_linux_backward_compatibility_check_test (3/5)

pytorch_python_doc_push (4/5)

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test (5/5)

Uh oh!

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Apr 7, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

Uh oh!

zasdfgbnm commented Apr 3, 2020 •

edited by ngimel

Loading

dr-ci bot commented Apr 3, 2020 •

edited

Loading