-
Notifications
You must be signed in to change notification settings - Fork 7k
[Core][TaskEvents] Add Integration Tests for Task Event Generation #57636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
| event["taskDefinitionEvent"]["taskType"] == "DRIVER_TASK" | ||
| and event["taskDefinitionEvent"]["jobId"] != test_job_id | ||
| ): | ||
| driver_task_id = event["taskDefinitionEvent"]["taskId"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just test it with the field renaming turned off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The PR was written before the field renaming. Will the add the test for both and when we remove the env var, we can remove the field renaming in the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just testing the renaming disabled is fine since you have other tests to check to renaming but up to you.
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
jjyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's figure out why it takes a long time to run. Ideally CI tests should be fast.
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
I dig deeper into the tests and did some fixes to make the tests more stable. At the same time, I had the following findings from the tests and created followup issues for them:
|
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
| def run_driver_script_and_wait_for_events(script, httpserver, cluster, validation_func): | ||
| httpserver.expect_request("/", method="POST").respond_with_data("", status=200) | ||
| node_ids = [node.node_id for node in cluster.list_all_nodes()] | ||
| assert wait_until_grpc_channel_ready(cluster.gcs_address, node_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use the existing wait_until_server_available util?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I looked at the wait_until_server_available before but from what I understand, it only checks whether a port is ready but not the GRPC server is ready for connection for not.
| # before start the driver script. A longer term fix is to improve the start up | ||
| # sequence of the dashboard agent and the workers. | ||
| # Followup issue: https://github.com/ray-project/ray/issues/58007 | ||
| time.sleep(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we already wait for the dashboard agent server to be up with wait_until_grpc_channel_ready, why do we need to sleep again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding what happened is:
- When the test cluster started, we do ray.init() right away. This creates the driver process along with a core worker process.
- The processes starts to send the events to the aggregator agent every 100ms, even if there is no events (this is fixed in the latest commit.)
- At the time when the core worker starts, the aggregator agent might not ready to receive grpc connection, so the first sends failed. At the same time, the GRPC client's backoff retry strategy kicks in, so in the background, the connection will be retried only after the retry interval completes.
- So it could happen that the aggregator agent server is ready to receive connection but the retry hasn't happened yet. And the wait here was mainly to wait for the retry to happen and the core worker has successfully create the connection.
At the same time, as mentioned in 2, I task buffer's updated to logic to only send grpc requests to aggregator agent when there are events to send. This should eliminate the issue. So I removed the sleep in the latest commit.
|
just add myself so the PR shown up on my github review todo list, not blocking |
jjyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: xgui <xgui@anyscale.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…ay-project#57636) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Why are these changes needed?
This PR add integration tests for task status update events.
Related issue number
N/A
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.