Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serval clients have trouble finding each other #101

Open
Rubyfi opened this issue Apr 26, 2016 · 7 comments
Open

Serval clients have trouble finding each other #101

Rubyfi opened this issue Apr 26, 2016 · 7 comments

Comments

@Rubyfi
Copy link

Rubyfi commented Apr 26, 2016

We're trying to run performance tests for serval using different network topologies.
One of them, for example, is a chained topology where 64 nodes are set up in a way that they have two network interfaces, each of them connecting the node to one of its neighbours.

This means we have 63 independent subnets, containing two nodes each. Sometimes when we boot up the nodes, the chain separates into two parts, where the nodes contained in one part don't find the ones contained in the other part. Let's assume the chain breaks between node 40 and 41.

If this happens you have to manually execute

mdp ping <ID of 40>

on node 41 and vice versa for node 40. Usually the ping will fail from one side, but will succeed from the other and the chain will complete automatically from that point.

Even scanning every single link of the chain using

servald scan <Broadcast IP>

doesn't help.

Any ideas what could cause this issue?

We're basically logging everything that serval does. So if you need any further details we're happy to send them to you.

@lakeman
Copy link
Member

lakeman commented Apr 27, 2016

Birthday paradox?

When we send SID's in packets between network peers we abbreviate them to save bandwidth. We make sure to send more than enough bits to be disambiguated, based on the prefixes of every other SID we know, rounded up to the next whole byte. We add an extra byte when encoding our own SID in the packet header.
https://github.com/servalproject/serval-dna/blob/development/overlay_address.c#L281

The client API requires a full SID to be entered from the command line and sent to the local node. Where it is added to the set of known SID's, which would force more bits to be sent for any other similar SID, resolving the ambiguity.

So we need a better method to detect collisions... ?

@gardners
Copy link
Member

I've made a temporary work-around for this on the development branch, that allows a keypair to be generated with a specific 4 - 16 bit prefix. The idea is that this will allow a temporary work-around and also the means to write tests for the SID prefix-collision problem.

14ed84d

@lakeman
Copy link
Member

lakeman commented Apr 29, 2016

How about we don't push that to the development branch, I've added a branch
called "asserts" for temporary changes for issues #101 & #102

On Fri, Apr 29, 2016 at 7:21 PM, gardners notifications@github.com wrote:

I've made a temporary work-around for this on the development branch, that
allows a keypair to be generated with a specific 4 - 16 bit prefix. The
idea is that this will allow a temporary work-around and also the means to
write tests for the SID prefix-collision problem.

14ed84d
14ed84d


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#101 (comment)

gardners added a commit that referenced this issue Apr 29, 2016
in an attempt to reproduce Issue #101.  Currently the test passes.
@gardners
Copy link
Member

The above commit adds a new test to the tests/routing script that tries to reproduce this problem. However, so far, the test still passes.

Lars: You can now search for multiple_ids_same_prefix in tests/routing to see what I have done, and adjust the topology to suit the one you are having trouble with, possibly just by extending the chain from 4 nodes to more, or having more nodes being neighbours of each other, until we can reproduce the problem. Post a comment if you succeed in getting it to fail.

Paul.

@jonashoechst
Copy link

jonashoechst commented Apr 29, 2016

We tried to build your commit 382694464bd4f6d3722475ae9a78a480d5c407ed with openwrt, which failed with this message:

SERVALD CC overlay_buffer.c
In file included from strbuf.h:87:0,
                 from instance.h:25,
                 from serval.h:124,
                 from overlay_buffer.c:21:
overlay_buffer.c: In function '_ob_append_byte':
overlay_buffer.c:242:13: error: comparison of unsigned expression >= 0 is always true [-Werror=type-limits]
     assert(b->position >= 0);
             ^
cc1: all warnings being treated as errors

@gardners
Copy link
Member

I've just committed the deletion of that line. Please try with that.

@lakeman
Copy link
Member

lakeman commented May 2, 2016

Triggering the birthday paradox with immediate neighbours would require at least 2 bytes of matching prefix.
Waiting for keyring add to generate a one in 64k probability value four times every time the test runs would make for a very annoying test. Though we could pay the price to generate keys once and import them during the test.
A configuration that is easier to reproduce would involve at least 4 devices in a line, with the ends having only a one byte prefix match. Then the middle two nodes might get confused about which peer is which. We'd probably also need to ensure that the two edge links were enabled first, then the link between the middle nodes.

As I see it there are two issues here that we can improve.

Firstly if two neighbours have a prefix collision, we would currently just silently drop the packet. This is how we deal with broadcast packets that are echoed back by the network layer. We would need to check the source address of the packet against our interfaces to detect the collision and swap full SID's.

Secondly, if there is a collision between distant network nodes, the right fix for this would involve providing a signature for each link, for validation and to make forgery more difficult. To save bandwidth, I'd suggest that the receiver of a link signs only the transmitter -> receiver pair with some long lived nonce value.
When you receive a signature you must verify it before using that link in your routing table or forwarding the signature onwards. Preferring links with signatures over paths with nodes running older versions of the software.
If you can't verify it, assume that you and your neighbour are trying to talk about 2 SID's with the same prefix.
So send the full SID of the transmitter and receiver you decoded back to the neighbour that sent you this signature. (We already have a message specifically for this). That should be enough to fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants