Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement persistence service #75

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

dirkmc
Copy link
Collaborator

@dirkmc dirkmc commented Oct 12, 2018

An implementation of a persistence service as discussed in #57

Stores a linked list in IPFS and a pointer to the head of the list in IPNS

snapshot <- delta <- delta <- delta
                                ^
                               HEAD

New deltas are added as they come in to the store and snapshots are taken after a certain interval or after a certain number of deltas.

TODO:

  • implement shared persistence

    • store snapshots/deltas to IPFS
    • share HEAD state with IPNS
      IPNS does not yet use the DHT, so the above only works if the IPFS Repo is shared.
      Therefore we may need to
    • implement alternative naming system
  • more tests

    • persistence heuristic
    • various init / failover scenarios
  • implement master election
    IPNS is single-write, so we need to implement a protocol to

    • elect a master replica
    • failover to a new replica if the master goes down

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 15, 2018

@pgte and I had a conversation about strategies for implementing naming across multiple peers that perform writes (multi-write) using IPNS which is single-writer.

1. All peers write to IPNS

In this scheme there are two kinds of IPNS keys:

  1. A collaboration IPNS key
    • shared between all members of the collaboration
    • IPNS key can be derived from the collaboration's write key
  2. A peer IPNS key
    • each peer has its own IPNS key for keeping track of the HEAD

When a peer's state changes it

  • writes the delta to IPFS
  • updates its peer IPNS key (the HEAD) to point to the latest delta
    snapshot <- delta <- delta <= <peer key>

When collaboration membership changes, the system

  • updates an IPFS record to include the peer IPNS key of all members of the collaboration
  • updates the collaboration IPNS key (the HEAD) to point to the record
    <collaboration key> => [<peer key>, <peer key>, <peer key>]

When a new peer comes online and is not able to connect to any other peer, it

  • Uses the collaboration IPNS key to retrieve the peer IPNS keys of all members of the collaboration
  • Uses the peer IPNS keys to retrieve the latest state of each member of the collaboration
  • Joins the states together

Pros:

  • Simple: no leadership election

Cons:

  • Every peer is writing snapshots and deltas to IPFS, lots of data transfer
  • Multiple peers may concurrently update the collaboration membership and overwrite each other
  • IPNS publish is slow, so it's not very scalable

2. A single leader peer writes to IPNS

In this scheme a single leader peer writes to IPNS with the shared collaboration IPNS key.

Each time membership changes

  • peers check if the leader was evicted
  • if so, elect a new leader

When a new peer comes online and is not able to connect to any other peer, it

  • uses the collaboration IPNS key to retrieve the latest state

Pros:

  • Single writer: sidesteps issues with concurrent writes
  • Less data transfer: only one node is writing to IPFS / IPNS
  • Scalable: state updates spread through gossip anyway, only one node needs to persist them

Cons:

  • Leader election adds some complexity

Behaviour under network partition:

In both schemes the partition behaviour is essentially the same. The difference is that in multi-writer there are multiple peers in each partition writing to IPNS, whereas in single-writer there is one leader peer in each partition writing to IPNS. But conceptually partition behaviour can be described in terms of a single IPNS writer per partition. Here we describe the single-writer case under two scenarios:

  1. Both collaboration gossip and IPNS are partitioned

    • Peers on each side of the partition will elect their own leader
    • Each leader will update its own copy of the collaboration IPNS record
    • If all peers in a partition go offline and a new peer comes online it will be able to retrieve state for all peers that were in its partition
    • When the network comes back together
      • the copy of the collaboration IPNS record with highest sequence number will win
      • updates in the other side of the partition will not be visible until one of those peers comes online
      • when one peer from each side of the partition comes online, state converges (through gossip)
  2. Collaboration gossip is partitioned but IPNS is not

    • Leaders on both sides of the partition update the same collaboration IPNS record, overwriting each other
    • If all peers in partition A go offline
      • partition B's IPNS writes win
      • when a new peer comes online in partition A it only sees the members of partition B
      • it is able to retrieve the state of partition B but unable to connect to the peers through gossip
    • When the network comes back together the IPNS record will be updated with the correct membership state as peers come online, and global state will converge

@pgte
Copy link
Contributor

pgte commented Oct 16, 2018

adding @aschmahmann to the conversation

@satazor
Copy link
Contributor

satazor commented Oct 16, 2018

As we discussed in the bi-weekly meeting, it would be awesome if we could provide the persistence strategy for peer-star-app to use. This would require us to agree on a common interface. Different strategies would need to implement that interface and they could be plugged into peer-star-app, perhaps via an option.

The strategies themselves could even live in different packages, like peer-star-persistency-ipns and so on.

The persistence problem is a very complex one, and I think there might be different strategies for difference kind of applications, or even collaborations. Having this interface would allow us to easily test different approaches.

Thoughts?

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 16, 2018

@satazor I like this idea 👍

I think it would be good to also refactor the persistence service into a separate repo

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 17, 2018

Considerations for a leadership election protocol:

  • We already have
    • a mechanism for gossiping membership changes to all peers in the collaboration
    • a membership summary hash: a hash function over all the peers ids in the collaboration (currently when a gossip message is received, the membership summary hash is used to compare local and remote membership)
  • It would be nice not to require any more communication than this to elect a leader
  • The leader should be known only to the members of the collaboration (for censorship resistance)

I propose that when membership changes, the new leader is chosen by each node like so:

  • find the membership summary hash
  • encrypt the hash with the collaboration key
  • use the encrypted hash as an index into the sorted list of peers
    eg
members: [alice peer id', 'bob peer id', 'carol peer id']
leader: members[encrypt(hash(members)) % members.length]

Note: For optimal censorship resistance, only the leader would know it is leader, and other nodes would know only that they are not the leader. This is a problem they are also trying to solve in Filecoin. If you can figure it out you get $200k :)

@pgte
Copy link
Contributor

pgte commented Oct 18, 2018

@dirkmc It would be nice to not require interaction to elect a leader indeed, but if you don't require quorum over membership changes, you can get split brains.
For instance, right now the eviction process is not an agreement between peers, it's simply a function of a peer finding out that a certain peer has been unreachable for a certain amount of connection attempts...
But perhaps split brains would not be a big issue here, not sure.
I think the risk would be if there's a collaboration partition but not a DHT partition, persistence is threatened while the partition lasts and for some time after it heals.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 18, 2018

@pgte when Alice finds that Bob is unreachable, Alice will evict Bob from her membership list and gossip that out to other peers in urgent mode, right? I was imagining that the new leader would not start persisting for some minimum time based on the gossip frequency heuristic.

In the scenario where there's a collaboration partition but not a DHT partition, the peers on each side of the partition will form two groups, where the members of each group will elect a leader. The two leaders will overwrite each other's IPNS entries. I don't think there's any way of getting around that unless we have multi-writer IPNS.

Because IPNS publish is slow (up to 60s for a publish to propagate), I feel that it may not be feasible or scalable to have all peers writing to IPNS, which is why I think leadership election may produce a more consistent (but not perfect) outcome. What do you think?

@pgte
Copy link
Contributor

pgte commented Oct 18, 2018

I agree with your point, but my point was that if the leader got (really) elected interactively, no split brains would occur, and thus no overwriting of IPNS by a minority.

OTOH, I don't want to overcomplicate the protocol..
Again, this all may be solved in UX, if the user doesn't get the "Saved" indicator if something like this happens to the IPNS record...

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 22, 2018

As @pgte pointed out, when there is a collaboration partition but not an IPNS partition, one leader will be elected on each side of the partition and they will overwrite each other's IPNS entries (the "split brain" scenario).

Protocols like Raft have a single leader that takes care of adding and removing members, and leaders must be elected by a majority of members, so there is no possibility of a split brain. However Raft is designed for scenarios in which members come and go in an orderly manner, whereas peer-star will need to support ad-hoc collaborations where groups of members may come and go in a less predictable fashion, and without warning (eg when a user closes their browser).

One solution is to have two collaboration IPNS keys:

  1. HEAD IPNS key:
    Keeps track of the HEAD state, for collaboration persistence (as described in previous comments above)
  2. Signaling IPNS key:
    The leader periodically writes a monotonically increasing sequence number to the signaling key, and also polls the key. If there is a partition, one of the leaders on either side of the partition will detect that someone else is writing to the same key and back off

IPNS currently only supports RSA keys, which are very big, so having to pass around two keys per collaboration is undesirable. Another solution is for the leader to simply poll the HEAD IPNS key to check if it gets overwritten by someone else, indicating a partition. In this scheme the leader should indicate liveness by periodically updating the HEAD IPNS key, even if there are no collaboration state changes (eg by incrementing a sequence number):

snapshot <- delta <- delta <- delta
                                ^
                             Seq: 3
                                ^
                             HEAD (IPNS)

@pgte does that sound like the right approach? Do you think we need Raft-style election by majority for this solution or would non-interactive election suffice?

@pgte
Copy link
Contributor

pgte commented Oct 22, 2018

I think that one IPNS record will do the trick, as long as you are able to keep track of the leader inside the IPNS value.

Indeed, Raft membership changes have to be orderly, which may not happen in a p2p scenario, where a bunch of replicas leaving at the same time may leave the cluster unable to progress in persisting.

So yes, I think that split brains may be inevitable here, and that a best-effort IPNS entry could be the solution. If there are nodes alive on each side of the split, they will have a different world view for a while, and they will be overwriting each others persistence.

Since the persistence can diverge now, if IPNS has live update feed we can track the "Saved" state on all peers for feedback regarding this.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 22, 2018

That makes sense. Hopefully it should be rare that there is a network partition that splits the collaboration but does not split IPNS.

In an ad-hoc network the non-interactive election I suggested doesn't really make sense, because the list of members will change frequently, so instead I think a Raft style election where peers can be in follower / candidate / leader states would actually work better.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Oct 24, 2018

I wanted to think through how a Raft style leader election would work concomitantly with peer-star membership gossip in a few different scenarios. The first four scenarios are similar to how Raft already works, and the fifth scenario demonstrates how to deal with a membership change:

scenario1

scenario2

scenario3

scenario4

scenario5

In scenario 5, membership changes while voting is already underway. In the example above, the new member (Eve) sees that there is an incomplete vote going on, so she increments the epoch number and votes for another peer. When the other peers see that the epoch number has changed, they vote on the leader for the next epoch.

Similarly if a peer detects that a member has gone offline, that peer should increment the epoch number and vote for themselves as leader.

In the general case, when a quorum cannot be formed, or membership changes, a new epoch of voting should be initiated.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 13, 2018

I've created a separate WIP PR against this branch with an implementation of leadership, so that this PR doesn't get too big: dirkmc#1

@pgte
Copy link
Contributor

pgte commented Nov 14, 2018

@dirkmc would you like me to merge this into a new peer-star-app branch?

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 15, 2018

@pgte are you asking about creating a feat/persister branch on peer-star-app and merging dirkmc:peer-star-app/feat/persister into it? I'm not sure I see the advantage?

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 15, 2018

FYI I have written an implementation of leadership. Today I wrote several test scenarios for leadership.

Next I need to

  • write a few more test scenarios for leadership
  • write some tests for the persistence + leadership bridge
  • test with an example app across several browsers with a real websocket server to see how this works in the real world

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 21, 2018

I have completed the tests for leadership and have been testing out leadership persistence with a real-world example app. Some issues I've come across:

  • It takes several seconds to receive the first pubsub message. I need to look into exactly why.
  • When a peer goes offline it takes resetConnectionIntervalMS (6s) * maxUnreachableBeforeEviction (10) = 60s. So a new election takes at least a minute to start.

@pgte
Copy link
Contributor

pgte commented Nov 21, 2018

@dirkmc Let's see if I understand the issue:
You're driving the leader election through the membership, and a node is only evicted once a certain time passes
I think we can safely drop the amount of unreachable before eviction to a lower number, since if a node detects that he's not part of the membership, it should issue a correction almost immediately.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 21, 2018

@pgte that's a good point. Do you have a sense for how reliable real-world connections are? If they are very reliable that should work well. If they are very unreliable it could cause a lot of churn if leaders are constantly changing.

@pgte
Copy link
Contributor

pgte commented Nov 21, 2018

@dirkmc I don't think that should be a big concern. I'm more concerned about peers voluntarily closing the connection (like when the user closes the app tab), but without an orderly close.
I think that this problem could be solved by:

  • collaboration.stop() should remove the peer id from the collaboration membership and give some time to propagate that.
  • the DApp should catch the tab close and do a best effort to stop the collaboration.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 21, 2018

@pgte that makes sense. It occurs to me that if a leader is on an unreliable internet connection it's probably not such a bad thing if it ends up being voted out. We may be able to use the pubsub messages, and possibly heartbeats over peer-to-peer connections, to develop a heuristic to determine how stable a peer's connection is.

I'm going to do some investigation to understand how reliably we can detect scenarios such as

  • peer's network connection goes down
  • peer's browser tab was closed
  • connection to peer was closed

Hopefully the transport layer can take care of some of this for us.

By the way I noticed that it can sometimes take a few minutes for a peer to start receiving messages from pubsub, so I'm going to look into that too

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 21, 2018

Some things with pubsub I've noticed that I need to investigate further:

  • When a peer subscribes to pubsub, it takes at least 5 seconds before it receives any messages
  • Any messages sent to the pubsub topic in the interim are not received, eg:
    • Alice subscribes to pubsub at time 1s
    • Bob sends messages to pubsub at time 2s, 3s, 4s, 5s, 6s
    • Alice receives messages at time 5s, 6s (but not 2s, 3s or 4s)
  • Message latency increases 10-fold when the tabs are in the background. I'm not sure yet if this applies to the sending tab, receiving tab or both. Either way it's very strange

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 21, 2018

I found that

  • When the user closes their browser tab, we will immediately get an event from libp2p informing us that the socket has been closed 🎉
  • The reason there is a 5 second delay in receiving any messages through pubsub is because of the way discovery works in peer-star-app:
    • when this peer discovers that another peer has joined the rendezvous server
      • we wait 5 seconds
      • we connect to the peer
      • then we poll pubsub to see if the peer is interested in our topic:
        • wait 500ms
        • check pubsub to see if peer is subscribed to topic
        • repeat (for five seconds)

@pgte there's a couple of things I'm not sure about here:

  • why do we wait for 5 seconds before connecting to the peer?
  • I'm not sure I understand how pubsub discovery works. Does pubsub only find out about the peer once we have connected to it?

@pgte
Copy link
Contributor

pgte commented Nov 22, 2018

@dirkmc
a) it's not necessarily 5 seconds, but I think it needs to be randomised and throttled somehow, so not all the nodes in an app bombard a new peer that has come online. This mechanism is very wonky, and hopefully it will be replaced once we adopt the rendezvous protocol. Also, I think there's obviously a lot of space for improvement of the current solution..

b) Yes, our current pubsub implementation (floodsub) only knows about the topics a peer is interested in after connecting to it..

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 23, 2018

@pgte when the local peer is informed by the rendezvous server that a remote peer has connected to rendezvous, the local peer connects to the remote peer.
What about if the local peer sends a message to the remote peer directly asking it for the topics it is interested in?

@pgte
Copy link
Contributor

pgte commented Nov 23, 2018

@dirkmc when discovering a new node from the underlying transport, peer-star discovery eventually connects to that peer so that the pubsub topics we understand the pubsub topics they're interested in. If the remote peer is not interested in the app topic, it disconnects. If it's connected to the app topic, it adds that peer to the ring of app peers, and if that peer is part of the Dias Set of peers for the app, it stays connected to it. Otherwise, it disconnects.

As I think you suggested, you propose replacing this connect-and-poll-pubsub-topics strategy with trying a direct connection to a new protocol to inquire about interest in participating in the app?

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 23, 2018

@pgte yes that's what I was thinking - seeing as we're connecting to the peer anyway, maybe we can avoid polling by simply asking it it's interested in our topic

@pgte
Copy link
Contributor

pgte commented Nov 23, 2018

@dirkmc makes sense.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 23, 2018

I should also explain the context: when leadership starts up, it waits for a certain amount of time, and if it doesn't hear from another peer, it assumes it is the only peer and elects itself as leader. So I want to minimize the amount of time it should have to wait by minimizing gossip startup.

I will write a PR and try it out with a real-world example app

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 23, 2018

I should also mention that I noticed that sometimes when two peers start at the same time, they seem to miss each other, and instead of taking several seconds to discover each other they take several minutes, so I want to make sure we avoid that problem.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 23, 2018

@pgte it turns out that floodsub already does exactly what we need: when floodsub discovers a remote peer (by listening for libp2p peer:connect), the local peer dials the remote peer and sends the list of topics that the local peer is interested in.

When the local peer receives a list of topics from a remote peer, floodsub updates its local cache of topics that the remote peer is interested in. If floodsub were to emit an event as well, we could listen for that event in Discovery instead of polling.

Note also that if the peer is discovered through another mechanism (eg when floodsub dials the list of peers in the peerbook), Discovery only finds out the topics that the new peer is interested when Discovery (eventually) tries to poll it. If there were an event, Discovery would find out immediately.

Do you think we should submit a PR to js-libp2p-floodsub to emit an event when it receives new subscription information for a peer?

Note that there are a couple of complicating factors:

  • floodsub is abstracted through the pubsub interface in IPFS, so we would either have to bubble the event up through pubsub, or listen to ipfs._libp2pNode._floodSub
  • The only events that floodsub emits have an event name equal to the topic id, eg emit('my topic', message), so the name of the new event might clash with someone's topic name

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 23, 2018

@jacobheun I have a question about web socket star and rendezvous. It is my understanding that when a peer joins the rendezvous server it is added to a peer list. The rendezvous server broadcasts the list out to all listeners every 10s, is that correct?

Is there any way that a new peer can receive the list of other peers as soon as it joins the rendezvous server, rather than waiting (up to 10s) for the next broadcast?

@pgte
Copy link
Contributor

pgte commented Nov 24, 2018

@dirkmc correct, it's through floodsub subscriptions that a peer knows if another peer is interested in an app. Replacing polling with an event would make things more efficient. 👍

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 27, 2018

With the changes I've made in the above PRs in floodsub, websocket-star and websocket-star-rendezvous, the time between starting up and discovering a peer has gone from around 15 seconds to around 400ms:
https://youtu.be/pjd5lHoZn7s

Note that the latency is near zero because rendezvous is running on my localhost, but even with rendezvous on a remote server I don't expect it to take more than a second.

@pgte
Copy link
Contributor

pgte commented Nov 27, 2018

@dirkmc that's awesome!

@jacobheun
Copy link
Contributor

The websocket-star updates have been released and I put in a ticket with infra to get the rendezvous server deployed, ipfs/infra#458.

@dirkmc
Copy link
Collaborator Author

dirkmc commented Nov 29, 2018

Great! Thanks @jacobheun

@jacobheun
Copy link
Contributor

ws-star.discovery.libp2p.io @0.3.0 is up and running 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants