[Serve] Avoid looping over all snapshot ids for each long poll request #45881

JoshKarpel · 2024-06-11T22:12:28Z

Why are these changes needed?

See #45872 for background - TLDR we are trying to improve the scalability of the Serve Controller and I'm hunting down optimizations in the controller code.

In this case, it looks like there's an expensive operation happening on each long poll client request - see comments for the details.

Before, 1800 apps with one deployment each and maybe ~5k handles in the system, probably more:

After, 1800 apps with one deployment each, averaging ~7.5k active listen_for_change tasks on the controller, with both #45872 and #45878 as well:

Related issue number

Discovered while looking at Serve Controller flamegraphs for #45777

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2024-06-12T20:09:39Z

python/ray/serve/_private/long_poll.py

+        except KeyError:
+            # Initial snapshot id must be >= 0, so that the long poll client
+            # can send a negative initial snapshot id to get a fast update.
+            self.snapshot_ids[object_key] = 0


This was previously a random number https://github.com/ray-project/ray/pull/45881/files#diff-f138b21f7ddcd7d61c0b2704c8b828b9bbe7eb5021531e2c7fabeb20ec322e1aL195 but I couldn't figure out why - I can of course use a random number here but this seems simpler if that's not necessary.

@shrekris-anyscale do you know why this was a random number / is that necessary?

We set it to a random number to handle an edge case when the controller crashes and recovers. Suppose the controller always starts with ID 0, and it's currently at ID 1. Suppose all the clients are also at ID 1. Now suppose:

The ServeController (which runs the LongPollHost) crashes and restarts. The LongPollHost's snapshot IDs are reset to 0.

All clients– except 1 slow client– reconnect to the LongPollHost on the ServeController. Their snapshot IDs are still 1, so the controller propagates an update, and all the connected clients' snapshot IDs are set back to 0.

One of the connected clients sends an update to the LongPollHost using notify_changed. The controller bumps the snapshot ID to 1 and updates the connected clients.

The slow client finally connects to the controller. Its ID is also 1, so it doesn't receive the update.

To correctly avoid this edge case, we should cache the snapshot IDs and restore them when the controller recovers. That's pretty complex though, so instead we initialize the snapshot IDs to a random number between 0 and 1,000,000. That makes this edge case very unlikely.

Could we switch back to using a default dict with the random number generator as the factory function?

Oh! That totally makes sense, and I will restore that behavior.

Could we switch back to using a default dict with the random number generator as the factory function?

Let me play around with this, I might be able to make it work

Oh I just remembered why I changed it do a plain dict - the problem is that snapshot_ids is a defaultdict, but object_snapshots isn't, so I was trying to get items from one that didn't exist in the other. Now I'm wondering if those two mappings can be combined, since they have the same keys 🤔

python/ray/serve/_private/long_poll.py

JoshKarpel · 2024-06-12T20:32:10Z

python/ray/serve/_private/long_poll.py

-        self.snapshot_ids[object_key] += 1
+        try:
+            self.snapshot_ids[object_key] += 1


Not a defaultdict anymore :(

shrekris-anyscale

Thanks for contributing! I left some comments.

python/ray/serve/_private/long_poll.py

shrekris-anyscale · 2024-06-19T00:06:05Z

python/ray/serve/_private/long_poll.py

+            try:
+                existing_id = self.snapshot_ids[key]
+            except KeyError:
+                # The caller may ask for keys that we don't know about (yet),
+                # just ignore them.
+                continue
+
+            if existing_id != snapshot_id:
+                updated_objects[key] = UpdatedObject(
+                    self.object_snapshots[key], existing_id
+                )


We can avoid raising and catching an error by using a conditional block.

Suggested change

try:

existing_id = self.snapshot_ids[key]

except KeyError:

# The caller may ask for keys that we don't know about (yet),

# just ignore them.

continue

if existing_id != snapshot_id:

updated_objects[key] = UpdatedObject(

self.object_snapshots[key], existing_id

)

# The caller may ask for keys that we don't know about (yet),

# just ignore them.

if key in self.snapshot_ids:

latest_id = self.snapshot_ids[key]

if latest_id != snapshot_id:

updated_objects[key] = UpdatedObject(

self.object_snapshots[key], latest_id

)

python/ray/serve/_private/long_poll.py

shrekris-anyscale · 2024-06-19T00:22:43Z

python/ray/serve/_private/long_poll.py

+        except KeyError:
+            # Initial snapshot id must be >= 0, so that the long poll client
+            # can send a negative initial snapshot id to get a fast update.
+            self.snapshot_ids[object_key] = 0


We set it to a random number to handle an edge case when the controller crashes and recovers. Suppose the controller always starts with ID 0, and it's currently at ID 1. Suppose all the clients are also at ID 1. Now suppose:

The ServeController (which runs the LongPollHost) crashes and restarts. The LongPollHost's snapshot IDs are reset to 0.

All clients– except 1 slow client– reconnect to the LongPollHost on the ServeController. Their snapshot IDs are still 1, so the controller propagates an update, and all the connected clients' snapshot IDs are set back to 0.

One of the connected clients sends an update to the LongPollHost using notify_changed. The controller bumps the snapshot ID to 1 and updates the connected clients.

The slow client finally connects to the controller. Its ID is also 1, so it doesn't receive the update.

To correctly avoid this edge case, we should cache the snapshot IDs and restore them when the controller recovers. That's pretty complex though, so instead we initialize the snapshot IDs to a random number between 0 and 1,000,000. That makes this edge case very unlikely.

Could we switch back to using a default dict with the random number generator as the factory function?

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

shrekris-anyscale · 2024-07-01T16:43:07Z

python/ray/serve/_private/long_poll.py

+            # They should also be randomized to try to avoid situations where,
+            # if the controller restarts and a client has a now-invalid snapshot id
+            # that happens to match what the controller restarts with,
+            # the client wouldn't receive an update.


Instead of explaining the edge case here, could you link my comment?

shrekris-anyscale · 2024-07-01T16:54:12Z

python/ray/serve/_private/long_poll.py

+            # if the controller restarts and a client has a now-invalid snapshot id
+            # that happens to match what the controller restarts with,
+            # the client wouldn't receive an update.
+            self.snapshot_ids[object_key] = random.randint(0, 1_000_000)


Now that this isn't a defaultdict anymore, could you double check whether any other places we access self.snapshot_ids needs this KeyError check?

I see four uses:

The first two are in

ray/python/ray/serve/_private/long_poll.py

Lines 411 to 418 in fd37d93

try:

self.snapshot_ids[object_key] += 1

except KeyError:

# Initial snapshot id must be >= 0, so that the long poll client

# can send a negative initial snapshot id to get a fast update.

# They should also be randomized to try to avoid situations where,

# if the controller restarts and a client has a now-invalid snapshot id

# that happens to match what the controller restarts with,

, which are this block that's protected by the KeyError check.

The next is

ray/python/ray/serve/_private/long_poll.py

Lines 256 to 264 in fd37d93

try:

existing_id = self.snapshot_ids[key]

except KeyError:

# The caller may ask for keys that we don't know about (yet),

# just ignore them.

# This can happen when, for example,

# a deployment handle is manually created for an app

# that hasn't been deployed yet (by bypassing the safety checks).

continue

, which is also protected by a KeyError check.

The last is

ray/python/ray/serve/_private/long_poll.py

Lines 308 to 317 in fd37d93

else:

updated_object_key: str = async_task_to_watched_keys[done.pop()]

updated_object = {

updated_object_key: UpdatedObject(

self.object_snapshots[updated_object_key],

self.snapshot_ids[updated_object_key],

)

}

self._count_send(updated_object)

return updated_object

. That block runs when a key changes, which is triggered by

ray/python/ray/serve/_private/long_poll.py

Lines 422 to 423 in fd37d93

logger.debug(f"LongPollHost: Notify change for key {object_key}.")

, which is below that first block which ensure the snapshot id is present in the mapping, so I think we're good on that one too.

Great, thanks for being thorough here!

My pleasure!

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

shrekris-anyscale

Premerge is failing. Could you merge master and retry?

JoshKarpel · 2024-07-01T19:47:16Z

Premerge is failing. Could you merge master and retry?

Hmmm, I'm seeing the same failure on my unrelated PR #45939 - looks like maybe CI is busted?

shrekris-anyscale · 2024-07-01T20:20:19Z

I'm also seeing a similar failure on master (link). I can't merge this PR until it passes premerge. Someone is likely fixing the issue now. Let me check in with the team, and see if there's an ETA.

shrekris-anyscale · 2024-07-01T20:26:22Z

We're working on a fix here: #46361.

JoshKarpel · 2024-07-08T17:11:32Z

@shrekris-anyscale looks like CI is passing again - mind taking another look at this PR? Thanks in advance!

JoshKarpel · 2024-07-08T18:14:56Z

Thanks!

JoshKarpel added 3 commits June 11, 2024 17:10

avoid looping over all snapshot ids for each long poll request

01c8528

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

fix handling of unknown keys

6bd6d6c

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into optimize-outdated-snapshot-id-handling

5eab50e

JoshKarpel commented Jun 12, 2024

View reviewed changes

Merge branch 'master' into optimize-outdated-snapshot-id-handling

b58c40e

JoshKarpel marked this pull request as ready for review June 12, 2024 21:03

shrekris-anyscale reviewed Jun 19, 2024

View reviewed changes

JoshKarpel and others added 3 commits June 21, 2024 14:19

Apply suggestions from code review

7ec2781

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

bring back random initial snapshot id

4da4794

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

expand comment

fd37d93

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel requested review from zcin and shrekris-anyscale June 24, 2024 17:57

Merge branch 'master' into optimize-outdated-snapshot-id-handling

2cf1ab1

shrekris-anyscale added the go add ONLY when ready to merge, run all tests label Jul 1, 2024

shrekris-anyscale reviewed Jul 1, 2024

View reviewed changes

replace comment with link

1eb0af7

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel requested a review from shrekris-anyscale July 1, 2024 17:33

shrekris-anyscale approved these changes Jul 1, 2024

View reviewed changes

shrekris-anyscale reviewed Jul 1, 2024

View reviewed changes

Merge branch 'master' into optimize-outdated-snapshot-id-handling

a7488d3

Merge branch 'master' into optimize-outdated-snapshot-id-handling

2011d76

shrekris-anyscale merged commit 6beeacb into ray-project:master Jul 8, 2024
6 checks passed

JoshKarpel deleted the optimize-outdated-snapshot-id-handling branch July 8, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Avoid looping over all snapshot ids for each long poll request #45881

[Serve] Avoid looping over all snapshot ids for each long poll request #45881

JoshKarpel commented Jun 11, 2024 •

edited

Loading

JoshKarpel Jun 12, 2024

zcin Jun 18, 2024

shrekris-anyscale Jun 19, 2024 •

edited

Loading

JoshKarpel Jun 21, 2024

JoshKarpel Jun 21, 2024

JoshKarpel Jun 12, 2024

shrekris-anyscale left a comment

shrekris-anyscale Jun 19, 2024

shrekris-anyscale Jun 19, 2024 •

edited

Loading

shrekris-anyscale Jul 1, 2024

JoshKarpel Jul 1, 2024

shrekris-anyscale Jul 1, 2024

JoshKarpel Jul 1, 2024

shrekris-anyscale Jul 1, 2024

JoshKarpel Jul 1, 2024

shrekris-anyscale left a comment

JoshKarpel commented Jul 1, 2024 •

edited

Loading

shrekris-anyscale commented Jul 1, 2024 •

edited

Loading

shrekris-anyscale commented Jul 1, 2024

JoshKarpel commented Jul 8, 2024

JoshKarpel commented Jul 8, 2024

	try:
	self.snapshot_ids[object_key] += 1
	except KeyError:
	# Initial snapshot id must be >= 0, so that the long poll client
	# can send a negative initial snapshot id to get a fast update.
	# They should also be randomized to try to avoid situations where,
	# if the controller restarts and a client has a now-invalid snapshot id
	# that happens to match what the controller restarts with,

	try:
	existing_id = self.snapshot_ids[key]
	except KeyError:
	# The caller may ask for keys that we don't know about (yet),
	# just ignore them.
	# This can happen when, for example,
	# a deployment handle is manually created for an app
	# that hasn't been deployed yet (by bypassing the safety checks).
	continue

	else:
	updated_object_key: str = async_task_to_watched_keys[done.pop()]
	updated_object = {
	updated_object_key: UpdatedObject(
	self.object_snapshots[updated_object_key],
	self.snapshot_ids[updated_object_key],
	)
	}
	self._count_send(updated_object)
	return updated_object

	logger.debug(f"LongPollHost: Notify change for key {object_key}.")

[Serve] Avoid looping over all snapshot ids for each long poll request #45881

[Serve] Avoid looping over all snapshot ids for each long poll request #45881

Conversation

JoshKarpel commented Jun 11, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

JoshKarpel commented Jul 1, 2024 • edited Loading

shrekris-anyscale commented Jul 1, 2024 • edited Loading

shrekris-anyscale commented Jul 1, 2024

JoshKarpel commented Jul 8, 2024

JoshKarpel commented Jul 8, 2024

JoshKarpel commented Jun 11, 2024 •

edited

Loading

shrekris-anyscale Jun 19, 2024 •

edited

Loading

shrekris-anyscale Jun 19, 2024 •

edited

Loading

JoshKarpel commented Jul 1, 2024 •

edited

Loading

shrekris-anyscale commented Jul 1, 2024 •

edited

Loading