Fix #4730: AgentScheduler doesn't assign the leader on load #4774

curtisman · 2021-01-09T01:39:38Z

AgentScheduler only clean up values in the consensus collection if there are multiple client. The last client that left would not clear up the task assigned in the AgentScheduler and it rely on the next load of to clear it up.

However on load, we try to assign tasks (including leader election) for non assigned task, and that is done before clear up. It ends up the first client won't assign a leader, and leaving session without a leader. If the session disconnect and reconnect, or there is a second client join, then leader will be assigned then.

The fix is to assign the tasks on load if there is no one assigned to it or the one assigned to it is not in the quorum (already left). In which case, we will not try to clear it.

This should fix issue #4730.

Also

added test for the "leader"/"notleader" event on container.close for issue Validate Runtimes notleader not called on all transitions #4297.
- For that, the LocalDocumentDeltaConnection.close will need to disconnect the client.
Don't lint for the build task that is run before we launch the debugging session.

msfluid-bot · 2021-01-09T01:50:55Z

■ @fluidframework/base-host: No change

Metric Name	Baseline Size	Compare Size	Size Diff
main.js	167.2 KB	167.2 KB	■ No change
Total Size	167.2 KB	167.2 KB	■ No change

⯅ @fluid-example/bundle-size-tests: +82 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
container.js	191.48 KB	191.56 KB	⯅ +82 Bytes
map.js	45.84 KB	45.84 KB	■ No change
matrix.js	144.52 KB	144.52 KB	■ No change
odspDriver.js	193.22 KB	193.22 KB	■ No change
sharedString.js	158.46 KB	158.46 KB	■ No change
Total Size	733.52 KB	733.6 KB	⯅ +82 Bytes

Baseline commit: c432b2b

Generated by 🚫 dangerJS against 4c75288

vladsud · 2021-01-09T02:18:44Z

I'd describe problem slightly differently - "removeMember" quorum handler has this.isActive() check to ensure that client in read-only connection does not attempt to submit an op. This results in no action as we process leave ops before we get connected.
So in essence, this is regression from introduction of "read" mode.

tanviraumi

curtisman · 2021-01-09T14:32:58Z

I'd describe problem slightly differently - "removeMember" quorum handler has this.isActive() check to ensure that client in read-only connection does not attempt to submit an op. This results in no action as we process leave ops before we get connected.
So in essence, this is regression from introduction of "read" mode.

Ah, right, because when we load, we would have process the the Leave message for that client eventually, and if it is in write mode we would have clear it, and trigger the reassign process.

curtisman · 2021-01-09T14:53:28Z

Hold on. Is it "read only" mode or "view only" mode that you are talkin about? "read only" mode has no impact, because that client will never write. So the next client that comes along with write permission will still see the "Leave" message, process it to clear the leader and trigger reassignment (whether is connected initially as view only or not). So the issue isn't about "read" mode?

This may be unique to detached create flow then, where the "fake" client used during detached never gets a "Leave" message. The fix is probably still good.

curtisman · 2021-01-09T14:57:30Z

@vladsud, separate related thought: does that mean that if you are the first client you always immediately to from the default view mode to write mode because of leader election? I assume the server would downgrade client back to view only mode after some inactivity too, so if you are the only client, then you will keep on reconnect as view only to write mode?

curtisman · 2021-01-09T16:43:52Z

Ok. I am completely wrong here. I tried reverting the change in AgentScheduler, and my new unit test still works. So there isn't really a bug here, and so it doesn't really fix 4730 😢

I got confused when I was writing my test case, because when detach create, the resulting container is connected in view only mode. Where as on load, the current default is connect in "write" mode (see Container.load() in container.ts in container-loader package).

The fact is that you will clear and reassigned when we either we see the Leave message, or when you turn active (on attach or connect) and view only -> write mode will give you a fresh "connect" transition, that is covered. Logically, I couldn't think of a bug here.

curtisman · 2021-01-11T22:15:03Z

Closing this PR for now, it is not strictly necessary, but avoiding the clear then assign will avoid addition round trip and may be done later

AgentScheduler fix microsoft#4730

4c75288

curtisman requested a review from vladsud January 9, 2021 01:39

github-actions bot requested review from jatgarg, tanviraumi, arinwt, markfields and agarwal-navin January 9, 2021 01:39

curtisman linked an issue Jan 9, 2021 that may be closed by this pull request

Not summarizing on first connection #4730

Closed

vladsud approved these changes Jan 9, 2021

View reviewed changes

tanviraumi approved these changes Jan 9, 2021

View reviewed changes

curtisman removed a link to an issue Jan 9, 2021

Not summarizing on first connection #4730

Closed

curtisman closed this Jan 11, 2021

curtisman deleted the agentfix branch June 2, 2022 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #4730: AgentScheduler doesn't assign the leader on load #4774

Fix #4730: AgentScheduler doesn't assign the leader on load #4774

curtisman commented Jan 9, 2021

msfluid-bot commented Jan 9, 2021

vladsud commented Jan 9, 2021

tanviraumi left a comment

curtisman commented Jan 9, 2021

curtisman commented Jan 9, 2021

curtisman commented Jan 9, 2021

curtisman commented Jan 9, 2021 •

edited

Loading

curtisman commented Jan 11, 2021

Fix #4730: AgentScheduler doesn't assign the leader on load #4774

Fix #4730: AgentScheduler doesn't assign the leader on load #4774

Conversation

curtisman commented Jan 9, 2021

msfluid-bot commented Jan 9, 2021

vladsud commented Jan 9, 2021

tanviraumi left a comment

Choose a reason for hiding this comment

curtisman commented Jan 9, 2021

curtisman commented Jan 9, 2021

curtisman commented Jan 9, 2021

curtisman commented Jan 9, 2021 • edited Loading

curtisman commented Jan 11, 2021

curtisman commented Jan 9, 2021 •

edited

Loading