Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two p2p feature enhancement proposal #4053

Closed
Hyung-bharvest opened this issue Oct 14, 2019 · 10 comments
Closed

two p2p feature enhancement proposal #4053

Hyung-bharvest opened this issue Oct 14, 2019 · 10 comments
Labels
C:p2p Component: P2P pkg S:proposal Status: Proposal
Milestone

Comments

@Hyung-bharvest
Copy link
Contributor

Hyung-bharvest commented Oct 14, 2019

I would like to suggest two p2p feature inhancement.

  1. wildcard_peer_ids
  • If id of a peer is in this list, the peer can try peering my node even if max_num_inbound(outbound)_peers is reached.
  • benefit : nodes can allow peering with most trusted list of peers(wildcard peers) even if maximum is reached by other peers.
  1. persistent_peers_dialing_max_period(in seconds)
  • When exponential backoff, the dialing period will not increase more than this period
  • benefit : Tendermint will continuesly try dialing all persistent peers indefinitely so that even though it disconnected for a while, it tries to reconnect.

Please share opinions(agreement, disagreement, additional feature related to these features, possible vulnerabilities, etc). Also if anyone wants to add other improvement on p2p peer management, we can add those to our job list.(such as blacklist_peer_ids)

These two ideas were one of the most pain-in-the-ass problems for real validator operation experience. It will greatly improve the stability of connection among nodes.

Based on feedbacks from community, we start hacking on these two issues. We will ask a grant of 2k atom from community fund by governance proposal "after" we PR this. We expect PR to be ready in October.

@mdyring
Copy link
Contributor

mdyring commented Oct 15, 2019

Good suggestions.

Not sure I understand the rationale for differentiating between persistent peers and "wildcard peers", is it because you don't want to provide a static IP for a peers that should always be allowed an inbound slot?

I think the default behavior should be to allow explicitly configured peers a connection to slot (similar to what is suggested in #2041).

Instead of having both a "wildcard peers" and "persistent peers", perhaps something like this would work instead:

  1. persistent_peers are used as today to establish outbound connections
  2. if an inbound connection arrives with a nodeid listed in persistent_peers, and there is no existing outbound connection to that nodeid, it will be allowed an inbound slot
  3. limits on in/outbound connections should be changed to "on-demand" connections and not apply to explicitly configured peers in the config.

persistent_peers_dialing_max_period sounds sensible!

@Creamers158
Copy link

Agree, think you would get something like:
persistent_peers = "e4d19c423f4b2ft7e40905b320d54155907b6b54@23.34.567.34:26656,7f40e45c3e3fc1d4dc68edc977b8e3eeb4e50724,b2ft7e40905320d541423f4b2ft@75.34.564.67:26656"

This way you can also setup a connection direction without worrying about open peer slots.

Also, when having persistent peers I can imaging that some of them need to kept private. When 'fixing' the persistent_peer make sure to look at #1705 so it wont expose private peers that are part of the persistent_peer.

@Hyung-bharvest
Copy link
Contributor Author

Hyung-bharvest commented Oct 15, 2019

I like both suggestion! Thanks for sharing your thoughts! So in summary,

  1. gaiad will allow any peer having node id which is listed in persistent peers even if when max inbound(outbound) peer num is reached.

  2. we can write id@ip(to allow and also to dial outbound) or id(to allow inbound or maybe outbound dialing from addressbook) in persistent peers to wildcard the peer.

  3. now we dont need additional parameter wildcard peer ids in config.

It sounds neater solution. How do you think? @mdrying @Creamers158

//edit

We found out that putting different structures in one list(id@ip and id) is creating some implementation codebase complexity. So we came back to the original idea to have seperate wildcard_peer_ids, which only allow to write in id-only-format.

Another option raised is to put inbound wildcard in persistent_peers as originally allowed id@0.0.0.0:0 format. But it seems like creating new confusion and unnecessary errors.

To include every persistent_peer as a wildcard_peer is still an open question.

@melekes melekes added C:p2p Component: P2P pkg S:proposal Status: Proposal labels Oct 15, 2019
@Hyung-bharvest
Copy link
Contributor Author

Hyung-bharvest commented Oct 18, 2019

Below is our up-to-date detail specs.
Please share opinions.

wildcard_peer_ids

1) problem

  • When max_num_inbound_peers or max_num_outbound_peers of a node reached, the node cannot spare more slots to any peer. Therefore, in case of disconnection, very important connections can be lost indefinitely because all slots consumed by other peers.

2) objective

  • Allow list of peers connect to my node by inbound or outbound, regardless of max_num_inbound_peers or max_num_outbound_peers reached.
  • This configure is especially helpful if a node operator want to dedicate some peer slots to number of trusted peers.

3) implementation detail

  • add WildcardPeerIDs in config/config.go
    // Wildcard node id list for limit of MaxNumInboundPeers, MaxNumOutboundPeers
    WildCardPeerIDs string `mapstructure:"wildcard_peer_ids"`
  • add WildcardPeerIDs in config/toml.go
    # Wildcard node id list for limit of max_num_inbound_peers, max_num_outbound_peers
    wildcard_peer_ids = "{{ .P2P.WildCardPeerIDs }}"
  • add WildcardPeerIDs error check in node/node.go
    err = sw.AddWildCardPeerIDs(splitAndTrimEmpty(config.P2P.WildCardPeerIDs, ",", " "))
    if err != nil {
    	return nil, errors.Wrap(err, "could not add peer ids from wildcard_peer_ids field")
    }
  • add 1 to numToDial and maxAttempts when the peer is wildcard in p2p/pex/pex_reactor.go
    if r.Switch.IsWildCardPeer(try.ID){
    	numToDial += 1
    	maxAttempts += 1
    }
  • add WildcardPeerIDs in p2p/switch.go
    wildCardPeerIDs      map[ID]struct{}
    
    ...
    
    wildCardPeerIDs:      make(map[ID]struct{}),
  • add exception of WildcardPeerIDs when counting NumPeers in p2p/switch.go
    if !sw.IsWildCardPeer(peer.ID()){
    	if peer.IsOutbound() {
    		outbound++
    	} else {
    		inbound++
    	}
    }
  • add IsWildCardPeer and AddWildCardPeerIDs in p2p/switch.go
    func (sw *Switch) IsWildCardPeer(id ID) bool {
    	_, ok := sw.wildCardPeerIDs[id]
    	return ok
    }
    
    func (sw *Switch) AddWildCardPeerIDs(ids []string) error {
    	sw.Logger.Info("Adding wildcard peer ids", "ids", ids)
    	for _, id := range ids {
    		err := validateID(ID(id))
    		if err != nil {
    			return err
    		}
    		sw.wildCardPeerIDs[ID(id)] = struct{}{}
    	}
    	return nil
    }
  • add exception of WildcardPeerIDs when checking MaxNumInboundPeers in p2p/switch.go
    if !sw.IsWildCardPeer(p.NodeInfo().ID()){
    	// Ignore connection if we already have enough peers, except wildcard peer
    	_, in, _ := sw.NumPeers()
    	if in >= sw.config.MaxNumInboundPeers {
    		sw.Logger.Info(
    			"Ignoring inbound connection: already have enough inbound peers",
    			"address", p.SocketAddr(),
    			"have", in,
    			"max", sw.config.MaxNumInboundPeers,
    		)
    
    		sw.transport.Cleanup(p)
    
    		continue
    	}
    
    }

maximum_dial_period

1) problem

  • exponential backoff of dialing combined with short term disconnection of a persistent peer can cause maximum times of exponential backoff resulting in no-dial to this persistent peer for indefinite future.

2) objective

  • For persistent peers, prevent exponential backoff of dialing backoffs more than given maximum_dial_period.

3) implementation detail

- add `MaximumDialPeriod` in config/config.go

    MaximumDialPeriod time.Duration `mapstructure:"maximum_dial_period"`
    
    ...
    
    MaximumDialPeriod:       60 * time.Second,
  • add MaximumDialPeriod in config/toml.go
    # Maximum dial period seconds when exponential back off for persistent peers, 0 == default as exponential back off
    maximum_dial_period = "{{ .P2P.MaximumDialPeriod }}"
  • add MaximumDialPeriod in node/node.go
    MaximumDialPeriod: config.P2P.MaximumDialPeriod,
  • add isBackoffExpected in p2p/pex/pex_reactor.go
    func (r *PEXReactor) isBackoffExcepted(addr *p2p.NetAddress) bool {
    	return r.config.MaximumDialPeriod != 0 && r.Switch.IsPeerPersistent(addr)
    }
    
    ...
    
    // Maximum dial period seconds when exponential backoff for persistent peers, when value exist
    	MaximumDialPeriod time.Duration
  • add exception on backoffDuration for persistent peers in p2p/pex/pex_reactor.go
    func (r *PEXReactor) dialPeer(addr *p2p.NetAddress) error {
    	attempts, lastDialed := r.dialAttemptsInfo(addr)
    
    	if attempts > maxAttemptsToDial && !r.isBackoffExcepted(addr) {
    		// TODO(melekes): have a blacklist in the addrbook with peers whom we've
    		// failed to connect to. Then we can clean up attemptsToDial, which acts as
    		// a blacklist currently.
    		// https://github.com/tendermint/tendermint/issues/3572
    		r.book.MarkBad(addr)
    		return errMaxAttemptsToDial{}
    	}
    
    	// exponential backoff if it's not our first attempt to dial given address
    	if attempts > 0 {
    		jitterSeconds := time.Duration(cmn.RandFloat64() * float64(time.Second)) // 1s == (1e9 ns)
    		var backoffDuration time.Duration
    		backoffDuration = jitterSeconds + ((1 << uint(attempts)) * time.Second)
    
    		if backoffDuration > r.config.MaximumDialPeriod && r.isBackoffExcepted(addr){
    			backoffDuration = r.config.MaximumDialPeriod
    		}
    		sinceLastDialed := time.Since(lastDialed)
    		if sinceLastDialed < backoffDuration {
    			return errTooEarlyToDial{backoffDuration, lastDialed}
    		}
    	}
    
    ...

@alexanderbez
Copy link
Contributor

Thanks @dlguddus for writing up a concise spec outline. Would you be able to write up the context/background and the proposal (it can be concise) into a quick ADR? This would happen ideally before a detailed spec (a spec may not even be necessary here) and implementation.

That being said, I'm not too privy to the mechanics of Tendermint's p2p layer and config so correct me if I'm wrong here. The goal of wildcard peer IDs seems like it should be accomplished by the already existing persistent peer IDs config? That is, even if all connection slots are filled, persistent peers always have the privilege to connect. Is this not the case, if not, why?

However, I do think the maximum_dial_period proposal makes sense 👍

@Hyung-bharvest
Copy link
Contributor Author

Hyung-bharvest commented Oct 22, 2019

Difference between persistent peers and wildcard:

  1. persistent peers only deal with outbound but wildcard deal with both in/outbound
  2. wildcard does not write ip address but only id

Because of struct difference(id@ip vs id), we think it is too complicated to put wildcard function into persistent peers.

ADR : #4072

ADR waiting for review : @tessr @melekes

@Hyung-bharvest
Copy link
Contributor Author

Agree, think you would get something like:
persistent_peers = "e4d19c423f4b2ft7e40905b320d54155907b6b54@23.34.567.34:26656,7f40e45c3e3fc1d4dc68edc977b8e3eeb4e50724,b2ft7e40905320d541423f4b2ft@75.34.564.67:26656"

This way you can also setup a connection direction without worrying about open peer slots.

Also, when having persistent peers I can imaging that some of them need to kept private. When 'fixing' the persistent_peer make sure to look at #1705 so it wont expose private peers that are part of the persistent_peer.

is #1705 not resolved yet? if it still stays in bug status, we might want to include it to this proposal.
@Creamers158

@tessr tessr changed the title two p2p feature inhancement proposal two p2p feature enhancement proposal Nov 19, 2019
@tessr
Copy link
Contributor

tessr commented Nov 19, 2019

I admit I don't operate a validator, and I'm still coming up to speed with lots of things here. But can you share more context on

These two ideas were one of the most pain-in-the-ass problems for real validator operation experience. It will greatly improve the stability of connection among nodes.

What exactly are the problems that you experienced? It sounds like this is a reasonable proposal but I'd like to better understand the issue. Thank you!

@Hyung-bharvest
Copy link
Contributor Author

I admit I don't operate a validator, and I'm still coming up to speed with lots of things here. But can you share more context on

These two ideas were one of the most pain-in-the-ass problems for real validator operation experience. It will greatly improve the stability of connection among nodes.

What exactly are the problems that you experienced? It sounds like this is a reasonable proposal but I'd like to better understand the issue. Thank you!

In PoS, nodes have to share its ip and id to connect to other nodes. And there exists a significant risk from ddos attacks to even node compromisation.

So what we construct is a sentry architecture, having public nodes(sentry) in front and we have internal nodes to bumper the connection from public internet to the precious validator.

These internal connections should be always connected. But, because of exponential backoff and max_num_peer, a disconnected connection often never come back. This causes instability of internal(or trusted) connection quite frequently.

But I think this is just a tip of iceberg. As a validator operator, we experience many other issues too. This ADR is just a pilot project to have experience with tendermint team to acquire better approach and communication from ourselves. If we think it is efficient and worth well, we would like to dig in dipper to suggest more improvement of p2p feature in tendermint.

@tessr
Copy link
Contributor

tessr commented Nov 19, 2019

OK, thank you for the explanation. I hear you're coming to our dev session on Monday, and it sounds like there will be further discussion then. I'm looking forward to it!

melekes pushed a commit that referenced this issue Nov 20, 2019
Refs #4053

## Commits:

* Create adr-050-improved-trusted-peering.md

* Modify `maximum_dial_period`

Modify `maximum_dial_period` to `persistent_peers_maximum_dial_period`

* Update adr-050-improved-trusted-peering.md

* Update docs/architecture/adr-050-improved-trusted-peering.md

Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com>

* Update docs/architecture/adr-050-improved-trusted-peering.md

Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com>

* Update docs/architecture/adr-050-improved-trusted-peering.md

Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com>

* Update docs/architecture/adr-050-improved-trusted-peering.md

Co-Authored-By: Tess Rinearson <tess.rinearson@gmail.com>

* wildcard -> unconditional

wildcard -> unconditional

* Remove blank lines

* fix spelling

* add quotes
@dongsam dongsam mentioned this issue Nov 21, 2019
5 tasks
@brapse brapse added this to the ABCI Updates milestone Dec 3, 2019
melekes pushed a commit that referenced this issue Dec 4, 2019
…od` (#4176)

implementation spec of Improved Trusted Peering ADR-050 by B-Harvest

- add unconditional_peer_ids and persistent_peers_max_dial_period to config
- add unconditionalPeerIDs map to Switch struct

default config value of persistent_peers_max_dial_period is 0s(disabled)

Refs #4072, #4053
@melekes melekes closed this as completed Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:p2p Component: P2P pkg S:proposal Status: Proposal
Projects
None yet
Development

No branches or pull requests

7 participants