Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added check to verify xla device is TPU #3274

Merged
merged 23 commits into from
Oct 6, 2020

Conversation

lezwon
Copy link
Contributor

@lezwon lezwon commented Aug 30, 2020

What does this PR do?

Fixes #3104

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team August 30, 2020 19:00
@lezwon lezwon marked this pull request as draft August 30, 2020 19:01
@codecov
Copy link

codecov bot commented Aug 30, 2020

Codecov Report

Merging #3274 into master will increase coverage by 4%.
The diff coverage is 59%.

@@           Coverage Diff           @@
##           master   #3274    +/-   ##
=======================================
+ Coverage      84%     87%    +4%     
=======================================
  Files         119     118     -1     
  Lines        9764    9184   -580     
=======================================
- Hits         8169    8013   -156     
+ Misses       1595    1171   -424     

@Borda Borda added bug Something isn't working accelerator: tpu Tensor Processing Unit labels Aug 30, 2020
tests/models/test_tpu.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Aug 31, 2020

Hello @lezwon! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-05 17:56:08 UTC

@lezwon lezwon force-pushed the TPU_device_check branch 4 times, most recently from a4fa477 to f315261 Compare September 1, 2020 13:32
@lezwon
Copy link
Contributor Author

lezwon commented Sep 1, 2020

@Borda @justusschock @awaelchli I need some help fixing the windows tests. Has to do something with pickling.

@awaelchli
Copy link
Contributor

@lezwon I think this is a common subprocess thing, when you launch it, it needs to import everything first before it runs it, and so we have to guard the call from the import, i.e., you would have to add if __name__ == "__main__" around the TPU_AVAILABLE constant, and also remove the decorator and apply it directly to the call. This will work:

def tpu_device_exists():
    if xm is not None:
        device = xm.xla_device()
        device_type = fetch_xla_device_type(device)
        return device_type == "TPU"
    else:
        return False


if __name__ == "__main__":
    TPU_AVAILABLE = pl_multi_process(tpu_device_exists)()
    print(TPU_AVAILABLE)

but it's obviously not how we want it.

My suggestions is to move towards a functional check, like I propose here #2877. Then, we shouldn't get the pickle error since the value is only computed at runtime.

@lezwon lezwon force-pushed the TPU_device_check branch 3 times, most recently from dabcf7f to 17867e2 Compare September 6, 2020 02:29
@edenlightning
Copy link
Contributor

hwy @lezwon mind taking a look at this again? we would love to have this issue resolved

@lezwon
Copy link
Contributor Author

lezwon commented Sep 16, 2020

@edenafek will continue on it by the weekend.
Might need some help from @awaelchli on this.

@Borda
Copy link
Member

Borda commented Sep 16, 2020

@lezwon mind resolve conflicts so we know what is the actual state...

@lezwon lezwon force-pushed the TPU_device_check branch 2 times, most recently from ef11533 to 8460e0d Compare September 19, 2020 06:55
@lezwon lezwon marked this pull request as ready for review September 19, 2020 09:05
@mergify mergify bot requested a review from a team September 19, 2020 09:06
@awaelchli
Copy link
Contributor

@lezwon do you still need help with this? is there a problem with windows?

@lezwon
Copy link
Contributor Author

lezwon commented Sep 19, 2020

@lezwon do you still need help with this? is there a problem with windows?

hey, I got it working with a minor workaround. Not sure about the cause of that issue though :)

@awaelchli awaelchli self-requested a review September 19, 2020 14:54
@awaelchli
Copy link
Contributor

awaelchli commented Sep 19, 2020

is it ready for review @lezwon ?

@lezwon
Copy link
Contributor Author

lezwon commented Sep 19, 2020

is it ready for review @lezwon ?

yep. it is

tests/utilities/test_xla_device_utils.py Outdated Show resolved Hide resolved
pytorch_lightning/utilities/xla_device_utils.py Outdated Show resolved Hide resolved
queue.put(None)


def pl_multi_process(func):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a similar, if not identical function in tests/base/develop_utils.py
do we need both? I cannot see an obvious difference besides the inner_f being defined outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically used the other function itself as reference. the one in develop_utils is just for tests right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but couldn't you delete the one from tests and then import from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pl_multi_process_test varies a bit compared to this function. It has an assert statement within and returns either 1 or -1 for the test. This one is meant to return the device type on None.

@lezwon
Copy link
Contributor Author

lezwon commented Sep 30, 2020

@lezwon I have rebased it, mind check if it is correct... :]

Thank you :) It looks good to me 👍

@mergify
Copy link
Contributor

mergify bot commented Sep 30, 2020

This pull request is now in conflict... :(

@mergify
Copy link
Contributor

mergify bot commented Oct 1, 2020

This pull request is now in conflict... :(

@Borda Borda force-pushed the TPU_device_check branch 2 times, most recently from 0a1a435 to 3a71c87 Compare October 2, 2020 10:58
@mergify mergify bot requested a review from a team October 2, 2020 11:44
@mergify mergify bot requested a review from a team October 2, 2020 11:48
Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot requested a review from a team October 2, 2020 14:40
@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2020

This pull request is now in conflict... :(

# Conflicts:
#	CHANGELOG.md
#	pytorch_lightning/accelerators/tpu_backend.py
#	pytorch_lightning/trainer/data_loading.py
#	tests/models/test_tpu.py
@mergify
Copy link
Contributor

mergify bot commented Oct 5, 2020

This pull request is now in conflict... :(

@Borda Borda merged commit 69833da into Lightning-AI:master Oct 6, 2020
@Borda Borda added this to the 0.10.0 milestone Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TPU available: true when there are no TPUs
9 participants