Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enable log rotation for Raylet and GCS server #26121

Closed
wants to merge 25 commits into from

Conversation

kfstorm
Copy link
Member

@kfstorm kfstorm commented Jun 27, 2022

Why are these changes needed?

This PR adds log rotation functionality for Raylet and GCS server by setting a non-empty log_dir when invoking ray::StartRayLog. This is critical for production environments to avoid logs taking too much disk space.

After this PR, raylet_{pid}.log/gcs_server_{pid}.log and potentially raylet_{pid}.{number}.log/gcs_server_{pid}.{number}.log are created as well as the existing .out and .err files. See the sample below.

dashboard_agent.log
dashboard.log
debug_state_gcs.txt
debug_state.txt
events
gcs_server_309463.log
gcs_server.err
gcs_server.out
log_monitor.log
monitor.err
monitor.log
monitor.out
old
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_309360.log
python-core-worker-28ccdc393f85a1b96961f0794c9305256c6e02d2d91d2cce1ea76aa6_309765.log
raylet_309662.log
raylet.err
raylet.out
worker-28ccdc393f85a1b96961f0794c9305256c6e02d2d91d2cce1ea76aa6-01000000-309765.err
worker-28ccdc393f85a1b96961f0794c9305256c6e02d2d91d2cce1ea76aa6-01000000-309765.out

TODO:

  • Code changes in dashboard and agent.
  • More test updates and fixes.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@kfstorm kfstorm requested a review from wumuzi520 June 27, 2022 13:10
@kfstorm kfstorm changed the title [WIP] Enable log rotation for Raylet and GCS server [WIP] [Core] Enable log rotation for Raylet and GCS server Jun 27, 2022
@edoakes
Copy link
Contributor

edoakes commented Jun 27, 2022

Assigning to @rkooo567 because he looked into this in the past and has the most context.

@rkooo567
Copy link
Contributor

rkooo567 commented Jun 30, 2022

Thanks for the PR! It is indeed an important feature (and I've heard some issues regarding the disk usage!)

A couple high level comments before deep diving;

  1. We should update this doc. https://docs.ray.io/en/master/ray-observability/ray-logging.html. Can you include changes here?
  2. Why not completely removing gcs_server.out and gcs_server.err? We can't because of log monitor? (unless we remove it, log rotation has no meaning?)
  3. Why do we need {pid} for raylet? Is it because new raylet can be started on that node again? (isn't this creating a new session name anyway?)

@kfstorm
Copy link
Member Author

kfstorm commented Jul 1, 2022

We should update this doc. docs.ray.io/en/master/ray-observability/ray-logging.html. Can you include changes here?

No problem.

Why not completely removing gcs_server.out and gcs_server.err? We can't because of log monitor? (unless we remove it, log rotation has no meaning?)

In some cases, logs are printed to stderr or stdout. Cases I can think of are:

  1. An unhandled C++ exception is thrown.
  2. GRPC or ABSL assertion failure.
  3. Dashboard agent C++ logs. (Dashboard agent invokes C++ code without initializing Ray C++ logging and it shares stdour and stderr of Raylet.)
  4. ERROR or FATAL logs are always printed to stderr as well. See [Logging] Core worker ERRORs/check failures not streamed to drivers. #12893. I'm not sure if we need to expose Raylet and GCS server error logs to driver as well. If not, we can disable printing to stderr for Raylet and GCS server.

Although stderr and stdout logs are uncommon, I do think they are helpful when debugging issues.

Why do we need {pid} for raylet? Is it because new raylet can be started on that node again? (isn't this creating a new session name anyway?)

This is the default behavior of Ray C++ logging. In the case of multi-node Python tests, I guess log rotation won't work well because multiple processes share the same log filename pattern. However, I haven't read and tested the internal logic of rotating_file_sink of spdlog yet. I can test removing _{pid} suffix and see if they work well.

@kfstorm
Copy link
Member Author

kfstorm commented Jul 1, 2022

After reading the implementation of log rotation in spdlog https://github.com/gabime/spdlog/blob/v1.x/include/spdlog/sinks/rotating_file_sink-inl.h, I don't think multiple processes with the same filename pattern will work well.

@kfstorm
Copy link
Member Author

kfstorm commented Jul 6, 2022

@rkooo567 Any more high-level comments before I update the PR?

@rkooo567
Copy link
Contributor

rkooo567 commented Jul 6, 2022

Yeah I think it generally looks good. A couple comments;

  1. When we have RAY_LOG(ERROR) or RAY_LOG(FATAL), can we also write logs to .err file (so we will have 2 sinks, .log and .err)? i.e. keep the below behavior

ERROR or FATAL logs are always printed to stderr as well. See #12893. I'm not sure if we need to expose Raylet and GCS server error logs to driver as well. If not, we can disable printing to stderr for Raylet and GCS server.

  1. Right now, if you use cluster_utils.py, it will create raylet.1.out, raylet.2.out, and etc.. Can you just quickly verify how it'd look like after this PR?

I will start the full review soon!

@rkooo567
Copy link
Contributor

rkooo567 commented Jul 6, 2022

Also, @wuisawesome @ckw017 would there be any compatibility issue with Loki integration?

@rkooo567
Copy link
Contributor

rkooo567 commented Jul 6, 2022

Also for the compatibility (since there are lots of people who will still do cat gcs_server.out), can you write a comment on the top of the .out log file to see .log file instead?

@ckw017
Copy link
Member

ckw017 commented Jul 6, 2022

Also, @wuisawesome @ckw017 would there be any compatibility issue with Loki integration?

Promtail/Loki should be fine with log rotations

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I think the code itself is okay, but it is a pretty big backward incompatible change (e.g., it can easily break anyone who scrapes logs now). Let's actually do the API review here, or is it possible to keep the basic log file as raylet.out & gcs_server.out instead of raylet_[pid].log? Potential alternative file name is

raylet.out -> same as the current raylet_[pid].log
raylet.err -> same as now

and something else that's for redirecting stdout.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 2, 2022
@stale
Copy link

stale bot commented Oct 8, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 8, 2022
@stale
Copy link

stale bot commented Oct 22, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this Oct 22, 2022
@jovany-wang
Copy link
Contributor

@rkooo567
I don't think it's a big backward incompatible change.

  1. raylet.log and gcs_server.log are often used by Ray developers, even most of them are Ray core developers. It's not used widely by our users.

  2. We can update the developing guide for how to grep log messages from raylet and gcs server.

@jovany-wang jovany-wang reopened this Nov 11, 2022
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 11, 2022
@jovany-wang
Copy link
Contributor

Maybe CC @scv119 @iycheng @jjyao

@jovany-wang jovany-wang self-assigned this Nov 11, 2022
@jovany-wang jovany-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 11, 2022
@rkooo567
Copy link
Contributor

raylet.log and gcs_server.log are often used by Ray developers, even most of them are Ray core developers. It's not used widely by our users.

Hmm I actually found we explicitly mentioned the file name's backward compatibility is not maintained. So it is probably okay. We still need a doc change (actually I've seen many users reading this log when they debug, and changing the name can break their log scraping depending on how it is implemented). I think we still need an API approval. cc @pcmoritz.

@rkooo567 rkooo567 added the core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. label Nov 24, 2022
@jovany-wang
Copy link
Contributor

jovany-wang commented Nov 24, 2022

I think we still need an API approval. cc @pcmoritz.

Sure.

@jovany-wang jovany-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 24, 2022
Signed-off-by: Qing Wang <kingchin1218@gmail.com>
@jovany-wang jovany-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 24, 2022
@jovany-wang
Copy link
Contributor

@ray-project/ray-docs for docs owner approval.

@jovany-wang jovany-wang added the enhancement Request for new feature and/or capability label Nov 24, 2022
@jovany-wang
Copy link
Contributor

@ericl @pcmoritz for the core api approval.

Signed-off-by: Qing Wang <kingchin1218@gmail.com>
@stale
Copy link

stale bot commented Dec 26, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 26, 2022
@stale
Copy link

stale bot commented Jan 10, 2023

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-interface-change-approval-required This changes the Ray core behavior / API and requires broader approvals. enhancement Request for new feature and/or capability stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants