Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move gsutil copy to condtional to avoid breakages #909

Conversation

kocchop
Copy link
Contributor

@kocchop kocchop commented Sep 23, 2024

gsutil is not in the requirements list, hence not installed by default. The copy command is breaking for the containers which don't have it by default. Moved it under a conditonal

Copy link
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kocchop!

@gobbleturk
Copy link
Collaborator

Isn't it better to error out when a user requests profiling but the profiling fails? User can set profiler='' (default, no profiling) to avoid profiling instead of this check?

This error points to a real bug in maxtext - we should either ensure gsutil is installed or perhaps can use a different gcs API (upload_blob)

@kocchop
Copy link
Contributor Author

kocchop commented Sep 23, 2024

Hi @gobbleturk the profiling works fine as is. This line just copies the generated .nsys-rep file to some directory. Without gsutil this copy fails and the run breaks. If for some reason the profiling breaks, it should break even with this PR. In sum, this PR does not ignore profiling failures, rather it avoids the breakages caused by gsutil file copy.

@gobbleturk
Copy link
Collaborator

Hi @gobbleturk the profiling works fine as is. This line just copies the generated .nsys-rep file to some directory. Without gsutil this copy fails and the run breaks. If for some reason the profiling breaks, it should break even with this PR. In sum, this PR does not ignore profiling failures, rather it avoids the breakages caused by gsutil file copy.

Don't we need to move the profile to GCS? How are we able to use the profile otherwise?

@kocchop
Copy link
Contributor Author

kocchop commented Sep 23, 2024

@gobbleturk yes, on clusters using gcs, it is needed indeed. Hence, the nightly maxtext container made through this script installs gsutil etc. However, for other clusters it may not be necessary as there could be different file system mounts. Without any checks, it leads to breakages.

Another way could be to enforce the gsutil to be on the requirements list. Even with that, we might need to change the behavior for non-gcs scenarios.

@gobbleturk
Copy link
Collaborator

script

Ya this makes sense. For TPUs the profile will be sent to base_output_directory which is configurable (can be GCS or local), I think we should change the GPU path to also be configurable, as it won't always be GCS.

Copy link
Collaborator

@gobbleturk gobbleturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will approve for now to unblock nsys profiling use cases, as we work on longer term fix

@gobbleturk
Copy link
Collaborator

Will approve for now to unblock nsys profiling use cases, as we work on longer term fix

Created #911 to track longer term fix

@copybara-service copybara-service bot merged commit 9a0cfd4 into AI-Hypercomputer:main Sep 24, 2024
4 of 5 checks passed
@kocchop kocchop deleted the faysal/move-gsutil-cp-2-conditional branch September 25, 2024 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants