Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target 8192 blocks instead of split to large grid for large reduction #35997

Closed
wants to merge 5 commits into from

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Apr 3, 2020

Stack from ghstack:

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

Differential Revision: D20927533

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Apr 3, 2020

💊 CircleCI build failures summary and remediations

As of commit 974fe24 (more details on the Dr. CI page):



1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_ios_11_2_1_x86_64_build Run Simulator Tests 🔁 rerun

❄️ 5 tentatively flaky failures

5 failures tentatively classified as flaky but have not triggered reruns to confirm:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages Unable to connect to ppa.launchpad.net:http:
Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead. 

See CircleCI build pytorch_cpp_doc_push (2/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages Unable to connect to ppa.launchpad.net:http:
Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead. 

See CircleCI build pytorch_linux_backward_compatibility_check_test (3/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages Unable to connect to ppa.launchpad.net:http:
Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead. 

See CircleCI build pytorch_python_doc_push (4/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages Unable to connect to ppa.launchpad.net:http:
Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead. 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_legacy_test (5/5)

Step: "Set Up CI Environment After attach_workspace" (full log | pattern match details | 🔁 rerun) ❄️

E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages Unable to connect to ppa.launchpad.net:http:
Reading package lists... 99%  Reading package lists... Done  
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'http://ppa.launchpad.net/git-core/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: The repository 'http://ppa.launchpad.net/openjdk-r/ppa/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
E: Failed to fetch http://ppa.launchpad.net/openjdk-r/ppa/ubuntu/dists/xenial/main/binary-amd64/Packages  Unable to connect to ppa.launchpad.net:http: 
W: Some index files failed to download. They have been ignored, or old ones used instead. 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 10 times.

…e reduction"

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

[ghstack-poisoned]
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please update the comment in code with your commit description.

…e reduction"

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

[ghstack-poisoned]
…e reduction"

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

Differential Revision: [D20927533](https://our.internmc.facebook.com/intern/diff/D20927533)

[ghstack-poisoned]
…e reduction"

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

Differential Revision: [D20927533](https://our.internmc.facebook.com/intern/diff/D20927533)

[ghstack-poisoned]
@zasdfgbnm zasdfgbnm changed the title Target 4096 blocks instead of split to large grid for large reduction Target 8192 blocks instead of split to large grid for large reduction Apr 9, 2020
@@ -789,15 +789,23 @@ inline void gpu_reduce_kernel(TensorIterator& iter, const ops_t& ops, ident_t id
config.output_mult[1] = config.split_output(block_height);
}

if (config.input_mult[1] != 0 && config.values_per_thread() >= 256 && num_outputs <= 4096) {
constexpr int target_grid_size = 4096;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good, but probably you can make it even less than 4096? You should be targeting full occupancy, which will come out to less than 4096?

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in d9227bb.

@zasdfgbnm zasdfgbnm deleted the gh/zasdfgbnm/33/head branch April 10, 2020 10:18
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
…pytorch#35997)

Summary:
Pull Request resolved: pytorch#35997

When the number of blocks is large enough, we are already achieving
blalanced SM allocation. But we still should keep the number of inputs
per thread large, because thread reduce is cheap.

Benchmark for Half on V100:
https://github.com/zasdfgbnm/things/blob/master/2020Q2/reduction-benchmark.ipynb

On large tensor, it is: 1.37ms vs 1.25ms

Test Plan: Imported from OSS

Differential Revision: D20927533

Pulled By: ngimel

fbshipit-source-id: 40df52e439cc1c01cda66c6195b600f301c5e984
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants