Prevent unbounded growth of sparse tensor in add operation #36030

peterbell10 · 2020-04-04T20:12:02Z

Sparse cuda add was implemented by just concatenating the indices and values for the tensor. If called repeatedly in a tight loop this will let nnz grow unbounded. In the worst case of x.add_(x) it grows exponentially.

dr-ci · 2020-04-04T20:13:00Z

💊 Build failures summary and remediations

As of commit cfa142e (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages (reran 1 job to discount flakiness):

pytorch_linux_bionic_py3_6_clang9_test (1/2)

Step: "Test" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

Apr 30 18:58:23 [E request_callback_impl.cpp:99] Received error while processing request type 15: size mismatch, m1: [3 x 3], m2: [6 x 6] at /var/lib/jenkins/workspace/aten/src/TH/generic/THTensorMath.cpp:41

Apr 30 18:58:20   test_debug_info (__main__.DistAutogradTestWithSpawn) ... skip (0.004s) 
Apr 30 18:58:21   test_dist_autograd_profiling (__main__.DistAutogradTestWithSpawn) ... ok (1.124s) 
Apr 30 18:58:22   test_embedding_bag_with_no_grad_tensors (__main__.DistAutogradTestWithSpawn) ... [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:22 [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:22 [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:22 [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:22 ok (1.323s) 
Apr 30 18:58:23   test_error_in_context (__main__.DistAutogradTestWithSpawn) ... [E request_callback_impl.cpp:99] Received error while processing request type 15: size mismatch, m1: [3 x 3], m2: [6 x 6] at /var/lib/jenkins/workspace/aten/src/TH/generic/THTensorMath.cpp:41 
Apr 30 18:58:23 [E request_callback_impl.cpp:99] Received error while processing request type 15: size mismatch, m1: [3 x 3], m2: [6 x 6] at /var/lib/jenkins/workspace/aten/src/TH/generic/THTensorMath.cpp:41 
Apr 30 18:58:23 [E request_callback_impl.cpp:99] Received error while processing request type 15: size mismatch, m1: [3 x 3], m2: [6 x 6] at /var/lib/jenkins/workspace/aten/src/TH/generic/THTensorMath.cpp:41 
Apr 30 18:58:23 [E request_callback_impl.cpp:99] Received error while processing request type 15: size mismatch, m1: [3 x 3], m2: [6 x 6] at /var/lib/jenkins/workspace/aten/src/TH/generic/THTensorMath.cpp:41 
Apr 30 18:58:23 ok (1.122s) 
Apr 30 18:58:24   test_grad_copy_sparse_indices_extra_ref (__main__.DistAutogradTestWithSpawn) ... [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:24 [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:24 [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:24 [W pybind_utils.h:712] Warning: Using sparse tensors in TorchScript is experimental. Many optimization pathways have not been thoroughly tested with sparse tensors. Please include the fact that the network is running sparse tensors in any bug reports submitted. (function operator()) 
Apr 30 18:58:24 /opt/conda/lib/python3.6/site-packages/torch/nn/functional.py:1850: UserWarning: Argument order of nn.functional.embedding_bag was changed. Usage `embedding_bag(weight, input, ...)` is deprecated, and should now be `embedding_bag(input, weight, ...)`. 
Apr 30 18:58:24   warnings.warn("Argument order of nn.functional.embedding_bag was changed. " 
Apr 30 18:58:24 /opt/conda/lib/python3.6/site-packages/torch/nn/functional.py:1850: UserWarning: Argument order of nn.functional.embedding_bag was changed. Usage `embedding_bag(weight, input, ...)` is deprecated, and should now be `embedding_bag(input, weight, ...)`. 
Apr 30 18:58:24   warnings.warn("Argument order of nn.functional.embedding_bag was changed. " 
Apr 30 18:58:24 /opt/conda/lib/python3.6/site-packages/torch/nn/functional.py:1850: UserWarning: Argument order of nn.functional.embedding_bag was changed. Usage `embedding_bag(weight, input, ...)` is deprecated, and should now be `embedding_bag(input, weight, ...)`.

pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (2/2)

Step: "Build" (full log | pattern match details | 🔁 rerun) <confirmed not flaky by 2 failures>

error pulling image configuration: Get https://prod-us-east-1-starport-layer-bucket.s3.us-east-1.amazonaws.com/307d-308535385114-c0b158d2-8c22-ad64-0178-fb03bd1a4b33/a0352ba9-620b-4257-a272-ddd51dadf83c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200430T180323Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3599&X-Amz-Credential=AKIAI7KZ4NTCV2EWBNUQ%2F20200430%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=9f404b1baa536bf5b9fd85ac8dd9f08d9f95ae8b65bf3c921e47ddbdbb3fbabe: EOF

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-asan:8fcf46ef-4a34-480b-a8ee-b0a30a4d3e59 
error pulling image configuration: Get https://prod-us-east-1-starport-layer-bucket.s3.us-east-1.amazonaws.com/307d-308535385114-c0b158d2-8c22-ad64-0178-fb03bd1a4b33/a0352ba9-620b-4257-a272-ddd51dadf83c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200430T180323Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3599&X-Amz-Credential=AKIAI7KZ4NTCV2EWBNUQ%2F20200430%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=9f404b1baa536bf5b9fd85ac8dd9f08d9f95ae8b65bf3c921e47ddbdbb3fbabe: EOF

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 12 times.

zou3519

seems like a reasonable heuristic. Does CPU sparse tensor add have this problem too?

peterbell10 · 2020-04-06T16:48:58Z

I wasn't getting the same growth with CPU sparse tensors. Looking at the code though, add_out_sparse_non_contiguous also just concatenates the values,

pytorch/aten/src/ATen/native/sparse/SparseTensorMath.cpp

Lines 510 to 512 in 82d58ed

    
           LongTensor r_indices = at::cat({t._indices(), src._indices()}, 1); 
        
           Tensor r_values = at::cat({t_values, s_values}, 0).to(r.scalar_type()); 
        
           alias_into_sparse(r, r_indices, r_values);

However, it's only triggered for sparse tensors with non-contiguous indices or value tensors. The CPU add for contiguous index & value tensors seems to do full addition.

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2020-04-06T18:20:37Z

Oh, it's the uncoalesced addition situation. I'm not fundamentally opposed to some sort heuristic here, but I want to explore a few other options first. It sounds like part of the problem is that CUDA and CPU don't have equivalent coalescing behavior. Do you think it would be difficult for CUDA to be made to behave the same way as CPU (I could believe this is hard, due to CUDA's model, but it would be helpful if you could confirm.)

ezyang · 2020-04-06T19:06:06Z

I looked over the heuristic and I think it's pretty good.

@ezyang

Dismissing my review based on @ezyang's request to explore other options first

peterbell10 · 2020-04-08T14:28:45Z

The CPU implementation does a variation of merging two sorted lists. A similar thing could be done for coalesced inputs in CUDA using thrust::merge_by_key and then coalescing adjacent values. However, this needs multiple passes to do the final coalescing step. I had started implementing this but noticed this comment:

We deliberately choose to simply concat the indices and values tensors rather than merging them. This removes the need to synchronously fetch nnz at the end of the operation, at the cost of having a non-coalesced result. This trade-off is preferable for the common use-case of gradient accumulation.

One thought could be to concat for small nnz where synchronisation would be more costly than the kernel itself. But do a full merge for large nnz, avoiding the memory growth issues. Although, that's not really that much different from this PR; except that it does a merge_by_key instead of sort_by_key.

peterbell10 · 2020-04-30T16:22:19Z

@ezyang, @zou3519 is there anything left to do here?

zou3519 · 2020-04-30T17:16:33Z

Sorry, landing this fell through the cracks. Could you rebase the PR, @peterbell10, just to get test signal again, and then I'll land this once the tests look fine?

Only triggered when sparse tensor indices or values are non-contiguous

facebook-github-bot

@zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zou3519 · 2020-05-01T14:17:00Z

Test failures look unrelated

facebook-github-bot · 2020-05-01T20:16:35Z

@zou3519 merged this pull request in 675b3fc.

peterbell10 added the open source label Apr 4, 2020

peterbell10 requested a review from ezyang April 4, 2020 20:12

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 6, 2020

zou3519 previously approved these changes Apr 6, 2020

View reviewed changes

facebook-github-bot reviewed Apr 6, 2020

View reviewed changes

ezyang approved these changes Apr 8, 2020

View reviewed changes

peterbell10 added 3 commits April 30, 2020 18:59

Prevent unbounded growth of sparse tensor in add operation

54a3d57

Update expect files

3b55c59

Handle edge case growth in CPU sparse add

cfa142e

Only triggered when sparse tensor indices or values are non-contiguous

peterbell10 force-pushed the sparse-add branch from 19f7142 to cfa142e Compare April 30, 2020 18:00

facebook-github-bot reviewed Apr 30, 2020

View reviewed changes

zou3519 approved these changes Apr 30, 2020

View reviewed changes

facebook-github-bot closed this in 675b3fc May 1, 2020

facebook-github-bot added the merged label May 1, 2020

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent unbounded growth of sparse tensor in add operation #36030

Prevent unbounded growth of sparse tensor in add operation #36030

Uh oh!

peterbell10 commented Apr 4, 2020

Uh oh!

dr-ci bot commented Apr 4, 2020 •

edited

Loading

Uh oh!

zou3519 left a comment

Uh oh!

peterbell10 commented Apr 6, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

ezyang commented Apr 6, 2020

Uh oh!

ezyang commented Apr 6, 2020

Uh oh!

peterbell10 commented Apr 8, 2020

Uh oh!

peterbell10 commented Apr 30, 2020 •

edited

Loading

Uh oh!

zou3519 commented Apr 30, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

zou3519 commented May 1, 2020

Uh oh!

facebook-github-bot commented May 1, 2020

Uh oh!

Uh oh!

Prevent unbounded growth of sparse tensor in add operation #36030

Prevent unbounded growth of sparse tensor in add operation #36030

Uh oh!

Conversation

peterbell10 commented Apr 4, 2020

Uh oh!

dr-ci bot commented Apr 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 Build failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_bionic_py3_6_clang9_test (1/2)

pytorch_linux_xenial_py3_clang5_mobile_custom_build_static (2/2)

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

peterbell10 commented Apr 6, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Apr 6, 2020

Uh oh!

ezyang commented Apr 6, 2020

Uh oh!

peterbell10 commented Apr 8, 2020

Uh oh!

peterbell10 commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Apr 30, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented May 1, 2020

Uh oh!

facebook-github-bot commented May 1, 2020

Uh oh!

Uh oh!

dr-ci bot commented Apr 4, 2020 •

edited

Loading

peterbell10 commented Apr 30, 2020 •

edited

Loading