Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

12/WAKU2-FILTER: Handle Client failure #475

Closed
rshiv opened this issue Nov 18, 2021 · 4 comments
Closed

12/WAKU2-FILTER: Handle Client failure #475

rshiv opened this issue Nov 18, 2021 · 4 comments

Comments

@rshiv
Copy link
Contributor

rshiv commented Nov 18, 2021

Currently a filter node attempts to push messages indefinitely even if the peer client node is offline. A node should be able to drop the peer once it’s determined that a client node is unreachable.

The issue was introduced here #469 .

Acceptance criteria

  • Consensus on how to solve the issue amongst the solutions presented.
  • Update RFC in 12/WAKU2-FILTER

Possible Solutions

The solution of the problem needs to be with the existing libp2p connection between server and client, rather than new connection for heartbeat.

As the waku message handler handles the message push during init , we check for dial failure at that point, which is already implemented to maintain a map as, <peer, offtimestart>, where offtimestart is the first failure epoch. We remove it from peer list if failure persist ,given it exceeds a certain time limit and clear the peer from the map as well.
Time to remove can be configured by the filter node to provide the flexibility of adaptive server node.

Notes

There are 2 more approaches to solve the problem however they have their tradeoffs.

  1. The existing libp2p connection has a callback for connection failure and can be used as a fallback point to stop the message push service and drop the peer. The solution will clean the memory organically as the connections drop, however, as adaptive nodes are expected to be resource restrictive, if we forbid the push request after first failure , it will cause repetitive filter calls.

  2. Adding a heartbeat mechanism using libp2p ping protocol between the peers, which tries the connection every x minutes, failure for 5 times cause the server to drop the peer. This approach will rely on continuous polling which will effect the network bandwith.

cc: @oskarth @richard-ramos @jm-clius @staheri14

@oskarth
Copy link
Contributor

oskarth commented Nov 23, 2021

Thanks for the initial issue! I'm not going to comment on specific approach suggested here (though Hanno might have stronger opinions here). What I'll do instead is to give some meta comments on how the problem is formulated, broken down and approached. Hopefully it is useful.

1) Problem description can be clearer

Problem background can be understood from #469 .

This issue intends to handle and reach consensus for the problem of Wakunode attempting to push messages indefinitely even if the peer client node is inactive.
Adaptive peer nodes is assumed to be restricted in resources and are expected to go inactive .

This can be rewritten for clarity to make it more clear what the problem is and then link to additional context. For example:

Problem

Currently a filter node attempts to push messages indefinitely even if the peer client node is offline. A node should be able to drop the peer once it’s determined that a client node is unreachable.

For more background and context, see #469

A reader can thus understand in one sentence what the problem is, and once a PR is issued we can see if it addresses this problem or not. There's also a link for additional context but the issue itself is self-contained.

2) Proposed solution can be more directive and take a specific point of view

As opposed to just outlining many alternatives and forcing reader to think about the problem in detail. It is better to have a specific proposal, mentioning trade-offs and suggested order of operation. If someone disagrees, or if you missed something, it is easier to fill in the gaps for the reader "did you think of this?" or "I think we should do this before this"

Proposed solutions

inserted immediate proposed solution
briefly on trade-offs and alternative solutions

If one solution is simpler and gets us 80% there, can have a section "future enhancements" or so detailing additional measures to take.

3) Implementation issue

Since there's already a spec and initial implementation of filter, it would be good to also have a corresponding issue on nim-waku on immediate first step. There are also some issues already in nim-waku (ctrl-f "filter") so linking those in a parent/specific issue would be useful.

4) Taking a step back - contextualizing problem and approach

This started in #469

It might be a good idea to respond there with a comment in the shape of something like this (obviously only for the things we know about / makes sense, just as a rough guideline):

"I'm going to look into this, I think it makes sense to stat with this specific subproblem because of reasons. After that we can attack this problem. I also think this might be a problem. This specific issue doesn't seem to be a priority right now because of reasons.

@rshiv
Copy link
Contributor Author

rshiv commented Nov 23, 2021

Thanks for the initial issue! I'm not going to comment on specific approach suggested here (though Hanno might have stronger opinions here). What I'll do instead is to give some meta comments on how the problem is formulated, broken down and approached. Hopefully it is useful.

1) Problem description can be clearer

Problem background can be understood from #469 .

This issue intends to handle and reach consensus for the problem of Wakunode attempting to push messages indefinitely even if the peer client node is inactive.
Adaptive peer nodes is assumed to be restricted in resources and are expected to go inactive .

This can be rewritten for clarity to make it more clear what the problem is and then link to additional context. For example:

Problem

Currently a filter node attempts to push messages indefinitely even if the peer client node is offline. A node should be able to drop the peer once it’s determined that a client node is unreachable.
For more background and context, see #469

A reader can thus understand in one sentence what the problem is, and once a PR is issued we can see if it addresses this problem or not. There's also a link for additional context but the issue itself is self-contained.

2) Proposed solution can be more directive and take a specific point of view

As opposed to just outlining many alternatives and forcing reader to think about the problem in detail. It is better to have a specific proposal, mentioning trade-offs and suggested order of operation. If someone disagrees, or if you missed something, it is easier to fill in the gaps for the reader "did you think of this?" or "I think we should do this before this"

Proposed solutions

inserted immediate proposed solution
briefly on trade-offs and alternative solutions

If one solution is simpler and gets us 80% there, can have a section "future enhancements" or so detailing additional measures to take.

3) Implementation issue

Since there's already a spec and initial implementation of filter, it would be good to also have a corresponding issue on nim-waku on immediate first step. There are also some issues already in nim-waku (ctrl-f "filter") so linking those in a parent/specific issue would be useful.

4) Taking a step back - contextualizing problem and approach

This started in #469

It might be a good idea to respond there with a comment in the shape of something like this (obviously only for the things we know about / makes sense, just as a rough guideline):

"I'm going to look into this, I think it makes sense to stat with this specific subproblem because of reasons. After that we can attack this problem. I also think this might be a problem. This specific issue doesn't seem to be a priority right now because of reasons.

addressed the issues raised. @oskarth @jm-clius

@jm-clius
Copy link
Contributor

Thanks, Reeshav.

I'll comment a bit more on the specific Proposed Solutions.

I agree with Oskar here that you can (and should) propose the specific solution among the alternatives that you believe will work best, unless it really is unclear which one is best and you'd need external input to proceed. In general you can assume that between all waku developers you are the one that studied filter the most recently with this issue in mind, so would be best placed to propose a solution. *UPDATE: I see you've now done this, so thanks for that.

Yes, solutions (1) and (2) does not make much sense for us, as we discussed:

  1. We shouldn't expect filter/resource restricted clients to maintain an always-on filter connection, even if it is currently implemented in nim-waku this way. With this perspective, a disconnect event for a specific client on a filter node is expected and should not result in it being removed.
  2. A ping heartbeat might work, and even be required at some point, but for now it is likely over-engineering a solution to a fairly simple problem and adds to the bandwidth burden.

So, in general the proposed solution makes sense to me - that is, to keep track of the first time we failed to contact a peer and, if the failure condition persists after some time, to remove the peer. @richard-ramos may have comments here on how well this method will work in go-waku.

Note that this leaves some questions to be addressed in future, including

  • What about filter clients that we never contact (because they have registered a filter that never receives any messages). These clients could also grow stale, but we would never know because we never try to dial them. One solution here may be to also track the "last dialed time" to remove clients that we've not contacted for a very long time.

@jimstir
Copy link
Contributor

jimstir commented Jun 14, 2024

Issue moved here

@jimstir jimstir closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants