Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Network test sync logic #2748

Merged
merged 19 commits into from
Jul 18, 2024

Conversation

nasdf
Copy link
Member

@nasdf nasdf commented Jun 21, 2024

Relevant issue(s)

Resolves #2747

Description

This PR improves the network sync test logic to allow for further improvements to the network implementation.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

make test

Specify the platform(s) on which this was tested:

  • MacOS

@nasdf nasdf added the area/testing Related to any test or testing suite label Jun 21, 2024
@nasdf nasdf added this to the DefraDB v0.12 milestone Jun 21, 2024
@nasdf nasdf self-assigned this Jun 21, 2024
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 95.55556% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.38%. Comparing base (a7fe539) to head (b4fbe86).

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2748      +/-   ##
===========================================
+ Coverage    79.36%   79.38%   +0.02%     
===========================================
  Files          323      323              
  Lines        24661    24691      +30     
===========================================
+ Hits         19572    19600      +28     
+ Misses        3692     3686       -6     
- Partials      1397     1405       +8     
Flag Coverage Δ
all-tests 79.38% <95.56%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
net/peer.go 82.49% <100.00%> (-2.44%) ⬇️
net/server.go 74.76% <95.00%> (+1.25%) ⬆️

... and 18 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7fe539...b4fbe86. Read the comment docs.

Comment on lines 327 to 330
for i := 0; i < len(s.nodes); i++ {
if nodeID.HasValue() && nodeID.Value() != i {
continue // node is not selected
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The documentation says waits for all selected nodes but the function only receives one. You can probably do away with the loop and just get the node directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nodeID has no value it will wait for all nodes.

Comment on lines 343 to 354
// update the expected document heads of connected nodes
for id := range s.nodeConnections[i] {
if _, ok := s.actualDocHeads[id][evt.DocID]; ok {
s.expectedDocHeads[id][evt.DocID] = evt.Cid
}
}
// update the expected document heads of replicator sources
for id := range s.nodeReplicatorTargets[i] {
if _, ok := s.actualDocHeads[id][evt.DocID]; ok {
s.expectedDocHeads[id][evt.DocID] = evt.Cid
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thougth: It's odd that we would be updating the expected heads only if already present in the actual heads. Shouldn't it be updated regardless?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the code is correct based on what we discussed regarding replicator target nodes sending updates of a document back to the replicator source node.

@nasdf nasdf requested a review from a team July 3, 2024 18:16
// eventSubs is a list of all event subscriptions
eventSubs []*event.Subscription
// nodeUpdateSubs is a list of all update event subscriptions
nodeUpdateSubs []*event.Subscription
Copy link
Contributor

@AndrewSisley AndrewSisley Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Can we wrap these up in a dedicated type? They are consuming a disproportionately large amount of attention on the state struct compared to their visibility needs.

something like:

type p2pState struct {
  nodeConnections []map[int]struct{}
  ...
}
type state struct {
  ...
  p2pState p2pState
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1205,6 +1205,10 @@ func createDoc(
s.documents = append(s.documents, make([][]*client.Document, action.CollectionID-len(s.documents)+1)...)
}
s.documents[action.CollectionID] = append(s.documents[action.CollectionID], doc)

if action.ExpectedError == "" {
waitForUpdateEvents(s, action.NodeID, action.CollectionID)
Copy link
Contributor

@AndrewSisley AndrewSisley Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: This is a significant change in the tests - and one that looks like it removes our ability to test concurrent P2P stuff (something that we very much want-to/need-to/did test)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is a bit misleading, but Update events are published immediately after a document is saved and shouldn't really change how we test things concurrently.

We could also merge waitForUpdateEvents into waitForSync since the events are buffered if you would prefer that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean - it is not waiting for the sync, just the event publishing to the current node's event queue?

Why are you calling it then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean - it is not waiting for the sync, just the event publishing to the current node's event queue?

Yes

Why are you calling it then?

Its a convenient way to track which doc heads are expected to be merged on remote nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a convenient way to track which doc heads are expected to be merged on remote nodes.

Can you please document that really well? Our integration tests are already super magic and far more complex than most, and that complexity is not really tested. The P2P space is particularly complicated and prone to annoying, late-to-discover, flaky bugs and it is well worth over-explaining things to save future hassle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more documentation about how the document head are tracked. It should also be much easier to follow since its all moved to an events.go file. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look really good, thanks Keenan. Has got me thinking about test Cid substitution too, in the future we might want to tweak state.nodeP2P.DocHeads to allow this.

@@ -1355,6 +1359,10 @@ func deleteDoc(
}

assertExpectedErrorRaised(s.t, s.testCase.Description, action.ExpectedError, expectedErrorRaised)

if action.ExpectedError == "" {
waitForUpdateEvents(s, action.NodeID, action.CollectionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why is this only done using create-save, and not create-gql and the other create? And why not on delete?

thought: This looks quite easy to mess up and forget, and it missing it might not create an obvious and immediate failure (just flaky tests).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 similar thought, for here and other instances of this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly the createDoc function calls the create-save and create-gql functions so they both should be covered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😁 My bad, the github presentation of the diff helped me miss-read where this line was... Ignore this thread :)

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not finished reviewing, but want to block the merge until my concerns around the apparent loss of concurrency have been resolved. The code looks simpler, but I think we may have lost the ability to test some important things.

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Jul 3, 2024

todo: Please make sure you run the P2P tests a lot (IIRC Fred/I had to run them 100 or so times to reproduce the last solved issues) before merging this, we had flakiness problems with this before and they cost a lot of time.

Maybe involve a few colleagues with different machines, as in the past more constrained machines had different flaky bugs compared to faster ones.

Comment on lines 65 to 70
// nodeReplicatorSources contains all active replicators.
//
// The index of the slice is the source node id. The map key is the target node id.
nodeReplicatorSources []map[int]struct{}

// nodeReplicatorTargets contains all active replicators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: documentation suggests that nodeReplicatorSources and nodeReplicatorTargets are the same, but they are not. Please make documentation reflect it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -781,6 +766,10 @@ func refreshCollections(
) {
s.collections = make([][]client.Collection, len(s.nodes))

for i := len(s.nodePeerCollections); i < len(s.collectionNames); i++ {
s.nodePeerCollections = append(s.nodePeerCollections, make(map[int]struct{}))
Copy link
Contributor

@AndrewSisley AndrewSisley Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This looks like it will break when collections are removed/deactivated in a node (using patchCollection)? Users currently can do this, although we likely have no P2P tests for it atm.

There may be other test things that currently break when doing that, but we should avoid adding new ones.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easy to handle. Just need to find which collections to deactivate from the p2p state.

  • handle patchCollection

switch action := tc.(type) {
case ConnectPeers:
// Give the nodes a chance to connect to each other and learn about each other's subscribed topics.
time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I forgot we were currently relying on a sleep - thank you very much for removing this!

@@ -479,7 +337,61 @@ func getAllP2PCollections(
assert.Equal(s.t, expectedCollections, cols)
}

// waitForSync waits for all given wait channels to receive an item signaling completion.
// waitForUpdateEvents waits for all selected nodes to publish an update event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Building on the other related documentation request, it might be worth highlighting that you mean publish to the local event queue, not to the pub/sub net or some other P2P related network thing.

@nasdf nasdf requested a review from AndrewSisley July 16, 2024 20:59
reps, err := s.nodes[nodeID].GetAllReplicators(s.ctx)
require.NoError(s.t, err)

p2pTopicEvents := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: It looks like p2pTopicEvents is only ever 0 or 1, why is it not a boolean? Is this a bug/wip?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to bool

@@ -1205,6 +1205,10 @@ func createDoc(
s.documents = append(s.documents, make([][]*client.Document, action.CollectionID-len(s.documents)+1)...)
}
s.documents[action.CollectionID] = append(s.documents[action.CollectionID], doc)

if action.ExpectedError == "" {
waitForUpdateEvents(s, action.NodeID, action.CollectionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look really good, thanks Keenan. Has got me thinking about test Cid substitution too, in the future we might want to tweak state.nodeP2P.DocHeads to allow this.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a really nice improvement, thanks Keenan :)

@@ -1355,6 +1359,10 @@ func deleteDoc(
}

assertExpectedErrorRaised(s.t, s.testCase.Description, action.ExpectedError, expectedErrorRaised)

if action.ExpectedError == "" {
waitForUpdateEvents(s, action.NodeID, action.CollectionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😁 My bad, the github presentation of the diff helped me miss-read where this line was... Ignore this thread :)

@@ -24,6 +25,86 @@ import (
"github.com/sourcenetwork/defradb/tests/clients"
)

// p2pState contains all p2p related testing test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: related testing state.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Core looks now much more intuitive. I just a have a question to clarify.

@@ -342,7 +341,7 @@ func performAction(
assertClientIntrospectionResults(s, action)

case WaitForSync:
waitForSync(s, action)
waitForMergeEvents(s, action)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: it's great that now waiting is done on the same thread. Earlier it used to call a goroutine for every ConnectPeers action and had subtle shortcomings.

@@ -1457,6 +1424,10 @@ func updateDoc(
}

assertExpectedErrorRaised(s.t, s.testCase.Description, action.ExpectedError, expectedErrorRaised)

if action.ExpectedError == "" {
waitForUpdateEvents(s, action.NodeID, action.CollectionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questions: this looks like unsolicited sync. Why is it unconditional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should always be an update event published to the local event bus if a document is updated or created.

DontSync bool
// Skip waiting for an update event. This should only be used for
// tests that do not correctly publish an update event.
SkipUpdateEvent bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: why is it synced by default now? That resembles sequentially consistent memory order instead of relaxed one that we used before.
I think, if we want to sync, we should request it explicitly. Unsynced actions are closer to realistic behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I think this is a misleading name and comment, IIRC waitForUpdateEvents does not wait for the P2P sync, but only the update event to be published to the event bus. If that is correct, @nasdf can you please rework the property name and documentation to make that clearer.

Copy link
Contributor

@islamaliev islamaliev Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this then if not waiting: <-s.nodeEvents[i].update.Message():?

Copy link
Contributor

@AndrewSisley AndrewSisley Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is just the event bus, it is written to synchronously by the database when mutating documents - the P2P code also subscribes to this bus - so 'waiting' to the bus should never really block, and the P2P stuff hasn't even been sent over the network at that point.

If I understand it correctly. I made the same comment as you did earlier for waitForUpdateEvents - I think Keenan improved the documentation and code location since I commented, but given that it is still tripping up reviewers, that too should probably be improved further.

Copy link
Member

@shahzadlone shahzadlone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Stuff 👍

@nasdf nasdf merged commit 517333c into sourcenetwork:develop Jul 18, 2024
40 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Related to any test or testing suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve network test sync logic
5 participants