-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCORE-2209 Treat completing a client reset as receiving a MARK message #7921
Conversation
Pull Request Test Coverage Report for Build thomas.goyne_494Details
💛 - Coveralls |
4cc2831
to
a4cbf2f
Compare
m_sending_session = sess; | ||
m_sending = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be set before calling async_write_binary() to support the completion handler being called synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good catch, but i can't think of any issues it could have caused - was it causing some failures in your testing or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly added test socket provider crashes without this change because it synchronously calls the completion handler.
@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only) | |||
|
|||
auto tr = m_db->start_read(); | |||
// Start with a reader so it doesn't try to write until we are ready | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(tr); | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the changes to migration store, pending bootstrap store, sync metadata schema, pending bootstrap store, and subscriptions are just secondary effects of making PendingResetStore::has_pending_reset() take a Group rather than a TransactionRef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside from passing around a ref vs const ref to a shared_ptr, is there a reason why this changed from a Transaction to a Group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from passing around a const ref instead of a const ref to a shared_ptr, is there a reason the parameter is a Group now and not a Transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside from passing around a ref vs ptr, is there a reason why this changed from a Transaction to a Group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root change is enabling PendingResetStore::has_pending_reset(realm->read_group())
, which previously didn't work because the function expected a Transaction even though it didn't do anything which required a Transaction. All of these functions should have been taking a Group the whole time as they don't change the transaction state.
@@ -308,6 +308,9 @@ StatusWith<std::shared_ptr<Realm>> async_open_realm(const Realm::Config& config) | |||
std::shared_ptr<Realm> successfully_async_open_realm(const Realm::Config& config) | |||
{ | |||
auto status = async_open_realm(config); | |||
if (!status.is_ok()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unexpected errors here previously didn't log the error so it was annoying to debug.
@@ -1196,6 +1269,7 @@ TEST_CASE("flx: client reset", "[sync][flx][client reset][baas]") { | |||
auto subs = realm->get_latest_subscription_set(); | |||
auto result = subs.get_state_change_notification(sync::SubscriptionSet::State::Complete).get(); | |||
CHECK(result == sync::SubscriptionSet::State::Complete); | |||
SyncSession::OnlyForTesting::pause_async(*realm->sync_session()).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the server was sufficiently fast it could theoretically send the client reset we trigger below to the current sync session, which isn't what these tests want. Probably never actually happened in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - in case you weren't aware, the client reset triggered by the command below just invalidates the file ident and the reset is not initiated until the session reconnects.
else | ||
tr->commit_and_continue_writing(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find anywhere that this was being used where this commit would do anything useful and it made the client reset tests which verified that exactly two commits were made more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also assume this is fine; if the caller had opened a write transaction, they will also commit it at some point.
{ | ||
// Write transaction required | ||
REALM_ASSERT(wr_tr->get_transact_stage() == DB::TransactStage::transact_Writing); | ||
auto reset_store = PendingResetStore::load_or_create_schema(wr_tr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the schema here was kinda slow and it wasn't actually being used for anything.
9204cfc
to
65330eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvements! LGTM.
CHANGELOG.md
Outdated
@@ -6,6 +6,8 @@ | |||
### Fixed | |||
* <How do the end-user experience this issue? what was the impact?> ([#????](https://github.com/realm/realm-core/issues/????), since v?.?.?) | |||
* Sync client may report duplicate compensating write errors ([#7708](https://github.com/realm/realm-core/issues/7708), since v14.8.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fix was released yesterday so the following additions will have to be updated to go to the new section
else | ||
tr->commit_and_continue_writing(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also assume this is fine; if the caller had opened a write transaction, they will also commit it at some point.
{ | ||
std::vector<SyncMetadataTable> unified_schema_version_table_def{ | ||
{&m_table, | ||
c_sync_internal_schemas_table, | ||
{&m_schema_group_field, c_meta_schema_schema_group_field, type_String}, | ||
{{&m_version_field, c_meta_schema_version_field, type_Int}}}}; | ||
|
||
// Any type of transaction is allowed, including frozen and write, as long as it supports reading | ||
REALM_ASSERT_EX(tr->get_transact_stage() != DB::transact_Ready, tr->get_transact_stage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a group everywhere makes the intent much more clear. 💯
return this; | ||
} | ||
|
||
TestClientReset* TestClientReset::expect_reset_error(std::optional<SyncError>& err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice simplification 👍
329c1c6
to
7e20b9f
Compare
m_sending_session = sess; | ||
m_sending = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good catch, but i can't think of any issues it could have caused - was it causing some failures in your testing or something?
@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only) | |||
|
|||
auto tr = m_db->start_read(); | |||
// Start with a reader so it doesn't try to write until we are ready | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(tr); | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside from passing around a ref vs const ref to a shared_ptr, is there a reason why this changed from a Transaction to a Group?
@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only) | |||
|
|||
auto tr = m_db->start_read(); | |||
// Start with a reader so it doesn't try to write until we are ready | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(tr); | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from passing around a const ref instead of a const ref to a shared_ptr, is there a reason the parameter is a Group now and not a Transaction?
@@ -53,14 +53,14 @@ bool MigrationStore::load_data(bool read_only) | |||
|
|||
auto tr = m_db->start_read(); | |||
// Start with a reader so it doesn't try to write until we are ready | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(tr); | |||
SyncMetadataSchemaVersionsReader schema_versions_reader(*tr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside from passing around a ref vs ptr, is there a reason why this changed from a Transaction to a Group?
test/object-store/sync/flx_sync.cpp
Outdated
// A socket provider which claims to always work, but when `disconnect = true` | ||
// will actually drop all incoming and outgoing messages. This enables testing | ||
// going offline at very specfic points. | ||
struct DisconnectingSocketProvider : sync::websocket::DefaultSocketProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more of a DiscardAllTrafficSocketProvider or something rather than a DisconnectingSocketProvider since it never actually disconnects you. Is there anything in this PR that would behave differently if there was an actual disconnect where we reset all the protocol state rather than just discarding messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the name, but it's "disconnecting" in the sense of disconnecting a network cable between you and the server.
Resetting the protocol state would require tying the tests to implementation details of sync connections, while this approach lets us test it via the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more like disconnecting in the sense that an intermediate hop is dropping packets, but yeah. by "reset all the protocol state" i just meant calling websocket_closed_handler() to signal to the sync client that the connection has been closed. i think without some reworking of how the sync client handles closed connections this could be kinda tough though. I think the answer to my question is that there aren't any changes the depend on actually disconnecting the session since client_reset_if_needed()
doesn't depend on having any of the Session's previous state be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these tests I do just want to be absolutely sure that no synchronization has happened after the download of the fresh realm has completed (until it's time to allow sync to happen again) and the exact details are unimportant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked this type to call websocket_closed_handler() (or defer the call to DefaultSocketProvider::connect()) as it turned out that dropping packets really didn't work outside of simple cases.
@@ -947,6 +948,19 @@ void ClientHistory::update_sync_progress(const SyncProgress& progress, Downloada | |||
root.set(s_progress_uploaded_bytes_iip, | |||
RefOrTagged::make_tagged(uploaded_bytes)); // Throws | |||
|
|||
if (previous_upload_client_version < progress.upload.client_version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this assume that if we make any upload progress that we'll have fully uploaded all changes and we know for sure we aren't going to get another client reset from any recovered changesets? maybe now that we have compensating writes that doesn't really matter as much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploading changesets should either result in the server acknowledging the upload or sending a client reset and not both, so once our UPLOAD is acked the window for getting a client reset due to those changesets being invalid has ended.
I think this check is probably wrong and it needs to actually be checking if we've reached the client version at the time of the client reset (or at time of opening).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a while to figure out how to test it but this is indeed incorrect; it marks the client reset as complete as soon as any changesets are acked rather than when all of the recovered changesets are.
bf89359
to
c78df18
Compare
c78df18
to
edc1174
Compare
I've updated this to track which client version was the last one recovered by a client reset and mark the client reset as complete once that version has been uploaded (and acked). This is sort of a fake bug fix relative to the behavior in practice with PBS and QBSv1. There's some extreme edge cases that now work better (e.g. recovered changesets are successfully uploaded, then device goes offline before receiving the MARK and stays offline until the client file ident expires on the server), but the primary benefit is that the behavior of an async open that triggers a client reset no longer depends on how the server handles a MARK sent while uploading changesets. With PBS/QBSv1 the MARK doesn't wait for the server to have processed those UPLOADs, and with QBSv2 it (sometimes?) does. By not relying on MARK for marking client resets complete, we preserve the existing visible behavior for client resets. |
9e43725
to
9252ccc
Compare
0857d0e
to
80604f8
Compare
Client resets which did not recovery any changes (either because changes were discarded, there was nothing to recover, or the recovered changesets became empty after merging) don't need to wait for a server round-trip to mark the reset as complete, as that round-trip merely consisted of sending a MARK to the server and waiting for a response. This partially reverts #6196 and fixes that bug by immediately removing the client reset tracker as part of the diff commit if there was nothing recovered.
Performing a client reset involves waiting for download completion and bringing the Realm file into the state it would have been in if it had completed downloading, so it should fire download completion handlers. Previously we did everything we would do on download completion except for this. Since we performed a wait for download completion after applying a client reset diff the handlers would eventually get called, but the exact timing depended on server behavior which is changing in QBSv2 (and the wait for download completion is removed by the above change).