Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cached torch_extensions on CI runners #18868

Merged
merged 4 commits into from
Sep 2, 2022
Merged

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Sep 2, 2022

What does this PR do?

  • The test

    tests/deepspeed/test_deepspeed.py::TrainerIntegrationDeepSpeed::test_hf_scheduler_ds_optimizer
    

    failed since 2 weeks due to some cache issue. The error message is

    E   ImportError: /github/home/.cache/torch_extensions/py38_cu113/fused_adam/fused_adam.so: undefined symbol: _ZN3c104impl8GPUTrace13gpuTraceStateE`
  • After I remove the cache (on the host runners, not inside the running docker) by

    sudo rm -rf /home/github_actions/actions-runner/_work_temp/_github_home/.cache/torch_extensions/py38_cu113/

    the test passes.

  • This PR add the following in the workflow file

    rm -rf /github/home/.cache/torch_extensions/

    to avoid the same problem occurring in the future.

Remark: Notice the host directory

/home/github_actions/actions-runner/_work_temp/_github_home/

is mapped to

 /github/home/

inside the running docker (we can see this in the job run page).

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh ydshieh requested review from stas00 and sgugger September 2, 2022 14:10
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for figuring out how to fix this, @ydshieh

@ydshieh ydshieh changed the title remove cached torch_extensions Remove cached torch_extensions on CI runners Sep 2, 2022
@ydshieh ydshieh merged commit ecdf9b0 into main Sep 2, 2022
@ydshieh ydshieh deleted the avoid_deepspeed_test_issue branch September 2, 2022 16:18
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants