-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better core dump for debugging #184
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2104,8 +2104,12 @@ def get_main_container(args, system, docker_image, resource_type) -> str: | |
' is required but not installed. Aborting"; exit 24;};' | ||
) | ||
xpk_internal_commands += ( | ||
'set -x;' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious why we add |
||
'WORKER_ID=$HOSTNAME;' | ||
f'gsutil -m cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;' | ||
f'gcloud storage cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;' | ||
f'gcloud storage cp /tmp/*.dump {args.debug_dump_gcs}/$WORKER_ID;' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need additional flags set to find the /tmp/*.dump and /tmp/tpu_logs? just want to make sure these additional dumps always exist and don't crash a users job if they can't be found (like only available with additional flags) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There will be an error without finding the specific path. And the pipeline will still continue after the error since we didn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
SG, Should we explicitly set the flags for the core dumps? |
||
f'gcloud storage cp -r /tmp/tpu_logs {args.debug_dump_gcs}/$WORKER_ID;' | ||
'sleep 120;' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the sleep for? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a workaround, not an ideal solution. The main host pod requires a significant amount of time to upload Let me know if you have better solution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm buffer times are scary to put into production code because you don't know if you have set the time for too long or too short! It looks like |
||
) | ||
|
||
command = args.command | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the gsutil_test_command anymore since you move to gcloud storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let me change this part accordingly.