Skip to content

Conversation

@MengjinYan
Copy link
Contributor

Why are these changes needed?

This PR add integration tests for task status update events.

Related issue number

N/A

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@MengjinYan MengjinYan added the go add ONLY when ready to merge, run all tests label Oct 15, 2025
@MengjinYan MengjinYan marked this pull request as ready for review October 20, 2025 19:23
@MengjinYan MengjinYan requested a review from a team as a code owner October 20, 2025 19:23
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 21, 2025
event["taskDefinitionEvent"]["taskType"] == "DRIVER_TASK"
and event["taskDefinitionEvent"]["jobId"] != test_job_id
):
driver_task_id = event["taskDefinitionEvent"]["taskId"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just test it with the field renaming turned off

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The PR was written before the field renaming. Will the add the test for both and when we remove the env var, we can remove the field renaming in the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just testing the renaming disabled is fine since you have other tests to check to renaming but up to you.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's figure out why it takes a long time to run. Ideally CI tests should be fast.

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@MengjinYan
Copy link
Contributor Author

Let's figure out why it takes a long time to run. Ideally CI tests should be fast.

I dig deeper into the tests and did some fixes to make the tests more stable. At the same time, I had the following findings from the tests and created followup issues for them:

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
def run_driver_script_and_wait_for_events(script, httpserver, cluster, validation_func):
httpserver.expect_request("/", method="POST").respond_with_data("", status=200)
node_ids = [node.node_id for node in cluster.list_all_nodes()]
assert wait_until_grpc_channel_ready(cluster.gcs_address, node_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use the existing wait_until_server_available util?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I looked at the wait_until_server_available before but from what I understand, it only checks whether a port is ready but not the GRPC server is ready for connection for not.

# before start the driver script. A longer term fix is to improve the start up
# sequence of the dashboard agent and the workers.
# Followup issue: https://github.com/ray-project/ray/issues/58007
time.sleep(3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we already wait for the dashboard agent server to be up with wait_until_grpc_channel_ready, why do we need to sleep again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding what happened is:

  1. When the test cluster started, we do ray.init() right away. This creates the driver process along with a core worker process.
  2. The processes starts to send the events to the aggregator agent every 100ms, even if there is no events (this is fixed in the latest commit.)
  3. At the time when the core worker starts, the aggregator agent might not ready to receive grpc connection, so the first sends failed. At the same time, the GRPC client's backoff retry strategy kicks in, so in the background, the connection will be retried only after the retry interval completes.
  4. So it could happen that the aggregator agent server is ready to receive connection but the retry hasn't happened yet. And the wait here was mainly to wait for the retry to happen and the core worker has successfully create the connection.

At the same time, as mentioned in 2, I task buffer's updated to logic to only send grpc requests to aggregator agent when there are events to send. This should eliminate the issue. So I removed the sleep in the latest commit.

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@can-anyscale can-anyscale self-requested a review October 24, 2025 01:38
@can-anyscale
Copy link
Contributor

just add myself so the PR shown up on my github review todo list, not blocking

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

MengjinYan and others added 4 commits October 24, 2025 12:26
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
@jjyao jjyao merged commit 9cbe131 into ray-project:master Oct 26, 2025
6 checks passed
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ay-project#57636)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#57636)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ay-project#57636)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#57636)

Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants