Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head. #47160

rynewang · 2024-08-15T18:21:15Z

Previously each Dashboard(Agent)HeadModule can have its own ThreadPoolExecutor. This PR makes a unified TPE in Dashboard(Agent)Head, and uses it everywhere. Also adds asyncio yield in DataOrganizer.organize() to avoid event loop blocking in DataOrganizer.organize for big time.

…ntHeads. Adds to DataOrgranizer. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

alexeykudinkin

@rynewang one more change we'd do is to make frequency of data refresh configurable so that we can mitigate impact by suggesting to users to reduce update frequency

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang · 2024-08-15T18:52:16Z

@alexeykudinkin done. Now one can set RAY_ORGANIZE_DATA_INTERVAL_SECONDS, defaults to 2.

jjyao · 2024-08-15T22:43:51Z

python/ray/dashboard/datacenter.py

@@ -81,7 +90,7 @@ async def organize(cls):
        DataSource.core_worker_stats.reset(core_worker_stats)


Is DataSource thread safe given now it's accessed in another thread as well

not sure...

+1

These custom data-structures aren't thread-safe. If we're planning to run this on TPE we need to cover these with locks

alexeykudinkin · 2024-08-16T01:40:23Z

PTAL at https://github.com/anyscale/runtime/issues/900#issuecomment-2292570183

alexeykudinkin · 2024-08-16T04:32:36Z

python/ray/_private/state_api_test_utils.py

-    return StateAPIManager(state_api_data_source_client)
+    return StateAPIManager(
+        state_api_data_source_client,
+        thread_pool_executor=ThreadPoolExecutor(


Provided that we're planning on offloading CPU-bound tasks that however still hold on to GIL, we should limit # of threads in the TPE (by default TPE provisions at # of CPUs + 4 threads)

changed to 1

changed in head.py, I mean

alexeykudinkin · 2024-08-16T04:35:00Z

python/ray/dashboard/agent.py

@@ -47,6 +48,9 @@ def __init__(
        # Public attributes are accessible for all agent modules.
        self.ip = node_ip_address
        self.minimal = minimal
+        self.thread_pool_executor = ThreadPoolExecutor(


Please check my comment above.

We need to

Isolate these TPEs to non critical operations (like refreshing stats/data used in UI)

Limit concurrency of these TPEs (to 2-4)

alexeykudinkin · 2024-08-16T04:35:33Z

python/ray/dashboard/consts.py

@@ -25,7 +25,7 @@
 RETRY_REDIS_CONNECTION_TIMES = 10
 CONNECT_REDIS_INTERNAL_SECONDS = 2
 PURGE_DATA_INTERVAL_SECONDS = 60 * 10
-ORGANIZE_DATA_INTERVAL_SECONDS = 2
+ORGANIZE_DATA_INTERVAL_SECONDS = env_integer("RAY_ORGANIZE_DATA_INTERVAL_SECONDS", 2)


Suggested change

ORGANIZE_DATA_INTERVAL_SECONDS = env_integer("RAY_ORGANIZE_DATA_INTERVAL_SECONDS", 2)

ORGANIZE_DATA_INTERVAL_SECONDS = env_integer("RAY_DASHBOARD_STATS_UPDATING_INTERVAL", 2)

alexeykudinkin · 2024-08-16T04:40:03Z

python/ray/dashboard/datacenter.py

@@ -81,7 +90,7 @@ async def organize(cls):
        DataSource.core_worker_stats.reset(core_worker_stats)


+1

These custom data-structures aren't thread-safe. If we're planning to run this on TPE we need to cover these with locks

alexeykudinkin · 2024-08-16T04:41:03Z

python/ray/dashboard/head.py

@@ -125,6 +125,10 @@ def __init__(
        self._modules_to_load = modules_to_load
        self._modules_loaded = False

+        self._thread_pool_executor = ThreadPoolExecutor(


Check comment above

rynewang · 2024-08-23T22:14:47Z

todo:

revert datacenter.py changes
move node_stats change back to async ctx

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang · 2024-08-24T01:29:20Z

@jjyao @alexeykudinkin pls take a look

jjyao · 2024-08-24T15:33:19Z

python/ray/dashboard/datacenter.py

+            # Offloads the blocking operation to a thread pool executor. This also
+            # yields to the event loop.
+            workers = await get_or_create_event_loop().run_in_executor(
+                thread_pool_executor, cls.get_node_workers, node_id


get_node_workers uses DataSource which is not thread safe, so we cannot just run it in a different thread.

For this PR, let's just call it in the main event loop thread and have yield inside the for loop.

We can optimize get_node_workers by stopping using ImmutableDict in a separate PR

python/ray/dashboard/head.py

python/ray/dashboard/modules/node/node_head.py

jjyao · 2024-08-25T15:25:22Z

After this PR, we should check every call of dashboard_utils.message_to_dict and run it in the background thread.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

jjyao · 2024-08-26T23:13:34Z

python/ray/dashboard/modules/job/job_head.py

@@ -317,7 +313,7 @@ async def upload_package(self, req: Request):
        try:
            data = await req.read()
            await get_or_create_event_loop().run_in_executor(
-                self._upload_package_thread_pool_executor,


Is it critical enough to deserve its own thread pool?

I don't think so because it's not latency critical either.

rynewang added 2 commits August 15, 2024 11:18

Unify ThreadPoolExecutor usage in all DashboardHeads and DashboardAge…

ec7a341

…ntHeads. Adds to DataOrgranizer. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

remove bad comment

7be696e

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

alexeykudinkin reviewed Aug 15, 2024

View reviewed changes

make the organize internal env var configurable

d806727

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang added the go add ONLY when ready to merge, run all tests label Aug 15, 2024

rynewang assigned jjyao Aug 15, 2024

jjyao reviewed Aug 15, 2024

View reviewed changes

alexeykudinkin reviewed Aug 16, 2024

View reviewed changes

move organize() back to main thread with yield. move node_stats back.

43bc740

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rynewang changed the title ~~Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head and DataOrganizer.~~ Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head. Aug 24, 2024

rynewang added 2 commits August 23, 2024 18:06

Merge remote-tracking branch 'origin/master' into tpe-for-all

d02e64a

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

fix

7ae3e0e

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

jjyao reviewed Aug 24, 2024

View reviewed changes

rynewang added 4 commits August 26, 2024 11:28

fixes

3352356

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

Merge remote-tracking branch 'origin/master' into tpe-for-all

f720375

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

rename

34b5169

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

return

40fbc18

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>

jjyao approved these changes Aug 26, 2024

View reviewed changes

rynewang enabled auto-merge (squash) August 26, 2024 23:22

rynewang merged commit 6c7da02 into ray-project:master Aug 27, 2024
6 checks passed

rynewang deleted the tpe-for-all branch August 27, 2024 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head. #47160

Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head. #47160

rynewang commented Aug 15, 2024 •

edited

Loading

alexeykudinkin left a comment

rynewang commented Aug 15, 2024

jjyao Aug 15, 2024

rynewang Aug 16, 2024

alexeykudinkin Aug 16, 2024

alexeykudinkin commented Aug 16, 2024

alexeykudinkin Aug 16, 2024

rynewang Aug 24, 2024

rynewang Aug 24, 2024

alexeykudinkin Aug 16, 2024

alexeykudinkin Aug 16, 2024

alexeykudinkin Aug 16, 2024

alexeykudinkin Aug 16, 2024

rynewang commented Aug 23, 2024

rynewang commented Aug 24, 2024

jjyao Aug 24, 2024

jjyao commented Aug 25, 2024

jjyao Aug 26, 2024

rynewang Aug 26, 2024

		@@ -81,7 +90,7 @@ async def organize(cls):
		DataSource.core_worker_stats.reset(core_worker_stats)

	ORGANIZE_DATA_INTERVAL_SECONDS = env_integer("RAY_ORGANIZE_DATA_INTERVAL_SECONDS", 2)
	ORGANIZE_DATA_INTERVAL_SECONDS = env_integer("RAY_DASHBOARD_STATS_UPDATING_INTERVAL", 2)

Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head. #47160

Unify ThreadPoolExecutor usage in all Dashboard(Agent)Head. #47160

Conversation

rynewang commented Aug 15, 2024 • edited Loading

alexeykudinkin left a comment

Choose a reason for hiding this comment

rynewang commented Aug 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykudinkin commented Aug 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rynewang commented Aug 23, 2024

rynewang commented Aug 24, 2024

Choose a reason for hiding this comment

jjyao commented Aug 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rynewang commented Aug 15, 2024 •

edited

Loading