Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add torch.logcumsumexp #36308

Closed
wants to merge 39 commits into from
Closed

Conversation

kshitij12345
Copy link
Collaborator

@kshitij12345 kshitij12345 commented Apr 9, 2020

Creating new PR as I am unable to push to @pandeykartikey 's branch as I don't have the permissions.

Closes #26411

Based on #32876 Thanks @pandeykartikey for starting this out.

Have addressed the comments.

@anjali411 @agadetsky @albanD

@dr-ci
Copy link

dr-ci bot commented Apr 9, 2020

💊 CI failures summary and remediations

As of commit c0ee7ab (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Set Up CI Environment After attach_workspace" (full log | diagnosis details | 🔁 rerun) ❄️

E: Failed to fetch https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64/Packages 404 Not Found
                            100% [Working]                Fetched 5,129 kB in 3s (1,652 kB/s) 
Reading package lists... 99%  Reading package lists... Done  
W: The repository 'https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64  Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://cli-assets.heroku.com/apt ./ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
W: The repository 'https://packagecloud.io/circleci/trusty/ubuntu xenial Release' does not have a Release file. 
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use. 
N: See apt-secure(8) manpage for repository creation and user configuration details. 
W: Failed to fetch https://cli-assets.heroku.com/apt/./InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5DC22404A6F9F1CA 
E: Failed to fetch https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64/Packages  404  Not Found 
W: Some index files failed to download. They have been ignored, or old ones used instead. 

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 100 times.

log-sum-exp trick doesn't seem to be working.
The gradient check doesn't pass with log-sum-exp.
@zhangguanheng66 zhangguanheng66 requested a review from albanD April 9, 2020 21:25
@zhangguanheng66 zhangguanheng66 added module: internals Related to internal abstractions in c10 and ATen triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 9, 2020
@tridao
Copy link
Contributor

tridao commented Apr 10, 2020

I don't think the current implementation is numerically stable. The original issue #26411 suggests using cummax to reduce numerical error, but that has quadratic complexity (see discussion in #32876).
I still think it's better to implement it the way Tensorflow does it: as a prefix scan of the binary operation log_add_exp.

@kshitij12345
Copy link
Collaborator Author

@anjali411 @albanD Could you please review the PR. Also I don't have the rights to re-run the failed pipeline (whose failure is unrelated to the PR).

@tridao I get the point . But I am not sure that I am quite familiar with the codebase. I saw the reference for cumsum https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp
and https://github.com/pytorch/pytorch/blob/master/aten/src/THC/generic/THCTensorMathScan.cu ( CUDA Implementation which I don't think should be used as reference).
Would be great if I can get some pointers.

@tridao
Copy link
Contributor

tridao commented Apr 10, 2020

Agreed, those seem to be where cumsum is implemented.
I imagine one just needs to replace the addition (x + y) of cumsum with the operation log_add_exp(x, y)(implemented in a stable way with max and min, e.g. https://www.tensorflow.org/api_docs/python/tf/math/cumulative_logsumexp).

Re: CUDA implementation in THC: maybe cumsum will eventually be ported from THC to Aten? I also found cummax implementation in Aten (https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ReduceOpsKernel.cu). One would replace max(x, y) with log_add_exp. The cummax implementation also needs the indices but I don't think logcumsumexp will need those.

* Add TODO about code duplication.
* Fix a comment.
@kshitij12345
Copy link
Collaborator Author

@ngimel @albanD Gentle ping :).

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
The perf for the backward might not be great but it is better than nothing.
Feel free to open a new issue if you think the backward should be implemented with a special kernel to make it more efficient. But I think this should be left for a future PR anyway.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kshitij12345
Copy link
Collaborator Author

@albanD
Would be great if this lands soon into master as it would potentially unblock #36458 .
Thank You.

@albanD
Copy link
Collaborator

albanD commented May 20, 2020

The landing is in progress. But there are some flaky internal tests so I had to re-run these...

@facebook-github-bot
Copy link
Contributor

@albanD merged this pull request in 3487744.

@agadetsky
Copy link

@kshitij12345 thank you very much, great work!

@agadetsky
Copy link

@kshitij12345 seems like documentation is a bit broken. At least computation formula does not appear at the moment at the master doc (https://pytorch.org/docs/master/generated/torch.logcumsumexp.html#torch.logcumsumexp)

@kshitij12345 kshitij12345 deleted the add-logcumsumexp branch May 21, 2020 18:58
@kshitij12345
Copy link
Collaborator Author

@agadetsky Thanks for notifying. Will try to get it fixed.

facebook-github-bot pushed a commit that referenced this pull request May 26, 2020
Summary:
References: #24521 #24522 #24547 #24548 #24507

Depends on #36308

Changes related to this PR are only in file :
aten/src/ATen/Declarations.cwrap
aten/src/ATen/native/cuda/ReduceOpsKernel.cu
aten/src/ATen/native/native_functions.yaml
aten/src/THC/generic/THCTensorMathScan.cu
aten/src/THC/generic/THCTensorMathScan.h

Please Review VitalyFedyunin

Thanks.
Pull Request resolved: #36458

Differential Revision: D21718384

Pulled By: ngimel

fbshipit-source-id: 5af15164050c77be164397abd659a48c9ded2b29
facebook-github-bot pushed a commit that referenced this pull request May 26, 2020
Summary:
Reference : #36308 (comment)

After fix:

![Screenshot from 2020-05-23 15-35-09](https://user-images.githubusercontent.com/19503980/82727956-4bcabb80-9d0b-11ea-85a8-81b35012abbc.png)
Pull Request resolved: #38952

Differential Revision: D21722196

Pulled By: ezyang

fbshipit-source-id: 62b08c14e0ce9603133841940627df40d7b1e861
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: internals Related to internal abstractions in c10 and ATen open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LogCumsumExp