Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #4730: AgentScheduler doesn't assign the leader on load #4774

Closed
wants to merge 1 commit into from

Conversation

curtisman
Copy link
Member

AgentScheduler only clean up values in the consensus collection if there are multiple client. The last client that left would not clear up the task assigned in the AgentScheduler and it rely on the next load of to clear it up.

However on load, we try to assign tasks (including leader election) for non assigned task, and that is done before clear up. It ends up the first client won't assign a leader, and leaving session without a leader. If the session disconnect and reconnect, or there is a second client join, then leader will be assigned then.

The fix is to assign the tasks on load if there is no one assigned to it or the one assigned to it is not in the quorum (already left). In which case, we will not try to clear it.

This should fix issue #4730.

Also

@curtisman curtisman requested a review from vladsud January 9, 2021 01:39
@curtisman curtisman linked an issue Jan 9, 2021 that may be closed by this pull request
@msfluid-bot
Copy link
Collaborator

@fluidframework/base-host: No change
Metric NameBaseline SizeCompare SizeSize Diff
main.js 167.2 KB 167.2 KB No change
Total Size 167.2 KB 167.2 KB No change
@fluid-example/bundle-size-tests: +82 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
container.js 191.48 KB 191.56 KB +82 Bytes
map.js 45.84 KB 45.84 KB No change
matrix.js 144.52 KB 144.52 KB No change
odspDriver.js 193.22 KB 193.22 KB No change
sharedString.js 158.46 KB 158.46 KB No change
Total Size 733.52 KB 733.6 KB +82 Bytes

Baseline commit: c432b2b

Generated by 🚫 dangerJS against 4c75288

@vladsud
Copy link
Contributor

vladsud commented Jan 9, 2021

I'd describe problem slightly differently - "removeMember" quorum handler has this.isActive() check to ensure that client in read-only connection does not attempt to submit an op. This results in no action as we process leave ops before we get connected.
So in essence, this is regression from introduction of "read" mode.

Copy link
Contributor

@tanviraumi tanviraumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@curtisman
Copy link
Member Author

I'd describe problem slightly differently - "removeMember" quorum handler has this.isActive() check to ensure that client in read-only connection does not attempt to submit an op. This results in no action as we process leave ops before we get connected.
So in essence, this is regression from introduction of "read" mode.

Ah, right, because when we load, we would have process the the Leave message for that client eventually, and if it is in write mode we would have clear it, and trigger the reassign process.

@curtisman
Copy link
Member Author

Hold on. Is it "read only" mode or "view only" mode that you are talkin about? "read only" mode has no impact, because that client will never write. So the next client that comes along with write permission will still see the "Leave" message, process it to clear the leader and trigger reassignment (whether is connected initially as view only or not). So the issue isn't about "read" mode?

This may be unique to detached create flow then, where the "fake" client used during detached never gets a "Leave" message. The fix is probably still good.

@curtisman
Copy link
Member Author

@vladsud, separate related thought: does that mean that if you are the first client you always immediately to from the default view mode to write mode because of leader election? I assume the server would downgrade client back to view only mode after some inactivity too, so if you are the only client, then you will keep on reconnect as view only to write mode?

@curtisman
Copy link
Member Author

curtisman commented Jan 9, 2021

Ok. I am completely wrong here. I tried reverting the change in AgentScheduler, and my new unit test still works. So there isn't really a bug here, and so it doesn't really fix 4730 😢

I got confused when I was writing my test case, because when detach create, the resulting container is connected in view only mode. Where as on load, the current default is connect in "write" mode (see Container.load() in container.ts in container-loader package).

The fact is that you will clear and reassigned when we either we see the Leave message, or when you turn active (on attach or connect) and view only -> write mode will give you a fresh "connect" transition, that is covered. Logically, I couldn't think of a bug here.

@curtisman
Copy link
Member Author

Closing this PR for now, it is not strictly necessary, but avoiding the clear then assign will avoid addition round trip and may be done later

@curtisman curtisman closed this Jan 11, 2021
@curtisman curtisman deleted the agentfix branch June 2, 2022 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants