Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better core dump for debugging #184

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion src/xpk/core/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -2104,8 +2104,12 @@ def get_main_container(args, system, docker_image, resource_type) -> str:
' is required but not installed. Aborting"; exit 24;};'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the gsutil_test_command anymore since you move to gcloud storage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me change this part accordingly.

)
xpk_internal_commands += (
'set -x;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we add set -x; here? Is it related to the debug_dump_gcs feature or a more general feature? If it is a more general feature, maybe it should be part of --enable-debug-logs arg?

'WORKER_ID=$HOSTNAME;'
f'gsutil -m cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;'
f'gcloud storage cp -r /tmp/xla_dump/ {args.debug_dump_gcs}/$WORKER_ID;'
f'gcloud storage cp /tmp/*.dump {args.debug_dump_gcs}/$WORKER_ID;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need additional flags set to find the /tmp/*.dump and /tmp/tpu_logs? just want to make sure these additional dumps always exist and don't crash a users job if they can't be found (like only available with additional flags)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No flags for dumping /tmp/tpu_logs.
  2. As for /tmp/*.dump, this file will be dumped in case of some specific hardware failure. We can also force core dump with some flags as well.

There will be an error without finding the specific path. And the pipeline will still continue after the error since we didn't set -e;.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also force core dump with some flags as well.

SG, Should we explicitly set the flags for the core dumps?

f'gcloud storage cp -r /tmp/tpu_logs {args.debug_dump_gcs}/$WORKER_ID;'
'sleep 120;'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the sleep for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workaround, not an ideal solution. The main host pod requires a significant amount of time to upload /tmp/*.dump which are approximately 10 GiB in size.
I did find the main host pod was terminated earlier before finishing upload. I just want to add a buffer time to each pod's lifecycle, ensuring the host pod has sufficient time to finish uploading.

Let me know if you have better solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm buffer times are scary to put into production code because you don't know if you have set the time for too long or too short!

It looks like gcloud storage cp .. a blocking call? Why would the main host terminate earlier is strange... do you mind looking into this more? Thank you Zhiyu!

)

command = args.command
Expand Down
Loading