-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better core dump for debugging #184
base: main
Are you sure you want to change the base?
Conversation
f'gcloud storage cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;' | ||
f'gcloud storage cp /tmp/*.dump {args.debug_dump_gcs}/$WORKER_ID;' | ||
f'gcloud storage cp -r /tmp/tpu_logs {args.debug_dump_gcs}/$WORKER_ID;' | ||
'sleep 120;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the sleep for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a workaround, not an ideal solution. The main host pod requires a significant amount of time to upload /tmp/*.dump
which are approximately 10 GiB in size.
I did find the main host pod was terminated earlier before finishing upload. I just want to add a buffer time to each pod's lifecycle, ensuring the host pod has sufficient time to finish uploading.
Let me know if you have better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm buffer times are scary to put into production code because you don't know if you have set the time for too long or too short!
It looks like gcloud storage cp ..
a blocking call? Why would the main host terminate earlier is strange... do you mind looking into this more? Thank you Zhiyu!
@@ -2104,8 +2104,12 @@ def get_main_container(args, system, docker_image, resource_type) -> str: | |||
' is required but not installed. Aborting"; exit 24;};' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the gsutil_test_command anymore since you move to gcloud storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let me change this part accordingly.
'WORKER_ID=$HOSTNAME;' | ||
f'gsutil -m cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;' | ||
f'gcloud storage cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;' | ||
f'gcloud storage cp /tmp/*.dump {args.debug_dump_gcs}/$WORKER_ID;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need additional flags set to find the /tmp/*.dump and /tmp/tpu_logs? just want to make sure these additional dumps always exist and don't crash a users job if they can't be found (like only available with additional flags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No flags for dumping /tmp/tpu_logs.
- As for
/tmp/*.dump
, this file will be dumped in case of some specific hardware failure. We can also force core dump with some flags as well.
There will be an error without finding the specific path. And the pipeline will still continue after the error since we didn't set -e;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also force core dump with some flags as well.
SG, Should we explicitly set the flags for the core dumps?
@@ -2104,8 +2104,12 @@ def get_main_container(args, system, docker_image, resource_type) -> str: | |||
' is required but not installed. Aborting"; exit 24;};' | |||
) | |||
xpk_internal_commands += ( | |||
'set -x;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why we add set -x;
here? Is it related to the debug_dump_gcs feature or a more general feature? If it is a more general feature, maybe it should be part of --enable-debug-logs arg?
Features
upload more detailed info to gcs bucket in failures
Switch to
gcloud storage cp
, since it around twice as fast asgsutil -m cp
.Testing / Documentation