Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset client reset cycle detection after upload/download complete #6196

Merged
merged 11 commits into from
Jan 18, 2023

Conversation

jbreams
Copy link
Contributor

@jbreams jbreams commented Jan 13, 2023

What, How & Why?

When we start a client reset with recovery we insert an object into the realm to track whether we've already started a client reset with recovery so that if the client reset fails (say because we recover local changes that cannot be applied to the server) we don't end up in an endless loop. Previously we cleared this tracking tombstone when we received a download message that advanced the upload progress - the idea being that if the server acknowledged our client reset with a download ack then the recovered changes must have been valid.

It turns out that if client reset with recovery doesn't actually recover anything, then there's nothing for the server to acknowledge and we never clear that tracking tombstone, even though everything succeeded. If there is another client reset, then we'll enter into a cycle where the sync client thinks there's a pending client reset.

This change moves where we remove the tombstone from just whenever we receive a download with an advancing upload cursor to after we've received an upload/download ACK (i.e. a download message with an advancing upload cursor and/or a MARK message indicating the server has nothing else to send us).

I also fixed the connection-level error handling so it respects the retry-timeout info sent from the server rather than a hardcoded 5 minute timeout, and unified the try again backoff logic into a helper class.

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
    * [ ] C-API, if public C++ API changed.

@jbreams jbreams marked this pull request as ready for review January 13, 2023 19:52
@jbreams jbreams marked this pull request as draft January 13, 2023 21:57
@jbreams jbreams marked this pull request as ready for review January 17, 2023 21:55
@jbreams
Copy link
Contributor Author

jbreams commented Jan 17, 2023

@ironage / @michael-wb , I think this is ready for a look now.

@@ -2181,6 +2221,10 @@ std::error_code Session::receive_ident_message(SaltedFileIdent client_file_ident
REALM_ASSERT_EX(m_last_version_selected_for_upload == 0, m_last_version_selected_for_upload);

get_transact_reporter()->report_sync_transact(client_reset_old_version, client_reset_new_version);

if (has_pending_client_reset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is correct to clear the pending reset state here because the reset has just finished locally and it hasn't received any acknowledgement from the server yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't necessarily clear anything, it schedules the async waiting for the server to ack all uploads and tell the client when there are no more downloads, which will then clear the reset state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the reason we check if there's a pending client reset and then only call this if there is one is so that we don't need to open a separate read transaction after calling get_status() if there are no client resets to do this dance with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it thanks! I hadn't understood that async_wait_for is initiating an upload/download completion handler.

auto ft = m_db->start_frozen();
return _impl::client_reset::has_pending_reset(ft);
}();
REALM_ASSERT(pending_reset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it safer to just return early if there is no pending reset?

This flow feels racy because we check for pending reset state in a read transaction and then clear it on a separate write transaction. Are we allowed to assume this here because of the single threaded nature of the sync client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda think it's equally safe to check pre-conditions vs returning early if pre-conditions aren't met?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok. Can we be certain that no other writer can clear the pending reset between when it gets written and when this frozen read transaction starts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the places we call this function we can guarantee it because we always call it after checking pre-conditions.

Copy link
Contributor

@ironage ironage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!
Just needs a bugfix entry in the changelog, otherwise LGTM! 👍

m_sess->logger.info("Tracking pending client reset of type \"%1\" from %2", pending_reset->type,
pending_reset->time);
util::bind_ptr<SessionWrapper> self(this);
async_wait_for(true, true, [self = std::move(self), pending_reset = *pending_reset](std::error_code ec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's a problem if more than one async_wait_for() calls are waiting at a time to clear the client reset? Such as if process_pending_flx_bootstrap() throws an exception and is run again or if a connection drops after receiving the ident message and then Session::receive_ident_message() is called again when the client reconnects.
Probably it won't be a problem since the write transaction will serialize the calls to remove_pending_client_resets() and this function is a noop if there is nothing to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at worst it will be a no-op. We used to always remove the pending client reset whenever we got a download that advanced the upload cursor. so this should be no worse than that? I've added some extra guards and logging so that if we end up with multiple callbacks from async_wait_for, only the one that matches the pending reset info that kicked off the waiting for actually triggers removing the pending resets.

Copy link
Contributor

@michael-wb michael-wb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walked through the flow and didn't see any issues. LGTM

auto cur_pending_reset = _impl::client_reset::has_pending_reset(wt);
if (!cur_pending_reset) {
logger.debug(
"Was going to remove client reset tracker for type \"%1\" from %2, but it was already removed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing values for %1 and %2 - assuming this is supposed to be pending_reset.type, pending_reset.time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@jbreams jbreams merged commit 2f08973 into master Jan 18, 2023
@jbreams jbreams deleted the jbr/client_reset_cycle branch January 18, 2023 23:32
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client reset cycle detection can get locked in a cycle if recovery does not result in any local commits
3 participants