Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better core dump for debugging #184

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ZhiyuLi-goog
Copy link
Collaborator

@ZhiyuLi-goog ZhiyuLi-goog commented Sep 8, 2024

Features

upload more detailed info to gcs bucket in failures

Switch to gcloud storage cp, since it around twice as fast as gsutil -m cp.

Testing / Documentation

  • Verified in burnin tests

f'gcloud storage cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;'
f'gcloud storage cp /tmp/*.dump {args.debug_dump_gcs}/$WORKER_ID;'
f'gcloud storage cp -r /tmp/tpu_logs {args.debug_dump_gcs}/$WORKER_ID;'
'sleep 120;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the sleep for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround, not an ideal solution. The main host pod requires a significant amount of time to upload /tmp/*.dump which are approximately 10 GiB in size.
I did find the main host pod was terminated earlier before finishing upload. I just want to add a buffer time to each pod's lifecycle, ensuring the host pod has sufficient time to finish uploading.

Let me know if you have better solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm buffer times are scary to put into production code because you don't know if you have set the time for too long or too short!

It looks like gcloud storage cp .. a blocking call? Why would the main host terminate earlier is strange... do you mind looking into this more? Thank you Zhiyu!

@@ -2104,8 +2104,12 @@ def get_main_container(args, system, docker_image, resource_type) -> str:
' is required but not installed. Aborting"; exit 24;};'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the gsutil_test_command anymore since you move to gcloud storage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me change this part accordingly.

'WORKER_ID=$HOSTNAME;'
f'gsutil -m cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;'
f'gcloud storage cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;'
f'gcloud storage cp /tmp/*.dump {args.debug_dump_gcs}/$WORKER_ID;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need additional flags set to find the /tmp/*.dump and /tmp/tpu_logs? just want to make sure these additional dumps always exist and don't crash a users job if they can't be found (like only available with additional flags)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No flags for dumping /tmp/tpu_logs.
  2. As for /tmp/*.dump, this file will be dumped in case of some specific hardware failure. We can also force core dump with some flags as well.

There will be an error without finding the specific path. And the pipeline will still continue after the error since we didn't set -e;.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also force core dump with some flags as well.

SG, Should we explicitly set the flags for the core dumps?

@@ -2104,8 +2104,12 @@ def get_main_container(args, system, docker_image, resource_type) -> str:
' is required but not installed. Aborting"; exit 24;};'
)
xpk_internal_commands += (
'set -x;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we add set -x; here? Is it related to the debug_dump_gcs feature or a more general feature? If it is a more general feature, maybe it should be part of --enable-debug-logs arg?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants