Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next Hop-based routing with fallback to flooding #2856

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Conversation

GUVWAF
Copy link
Member

@GUVWAF GUVWAF commented Oct 2, 2023

This adds a NextHopRouter which only relays if it is the next hop for a packet. The next hop is set by the current relayer of a packet, which bases this on information from either the NeighborInfoModule, or a previous successful delivery via flooding. Right now, it is only used for DMs and not used for broadcasts. Using the NeighborInfoModule, it can derive the next hop of neighbors and that of neighbors of neighbors. For others, it has no information in the beginning, which results into falling back to the FloodingRouter. Upon successful delivery via flooding, it updates the next hop of the recipient to the node that last relayed the ACK to us. When the ReliableRouter is doing retransmissions because it didn't get an ACK, for the last retry, it will reset the next hop, in order to fall back to the FloodingRouter.

Some implementation details: the next_hop and current_relayer are added to the unencrypted PacketHeader. Since we only used 14 bytes out of the 16-byte header (it got this size due to memory alignment), I used the 2 unused bytes to save the last byte of the NodeNum of the next hop and current relayer. This means that there is a chance that the last byte of two nodes match, and then they will both relay. However, that’s not a big issue as that would be similar to flooding. The chance that only the intended next hop relays depends on the amount of nodes that can hear the packet. If that are 10 nodes (that would be a lot), the chance that only the next hop relays is 83.7% ((255/256*254/256*253/256 ... 247/256*)*100), for 5 nodes the chance is 96.1% ((255/256*254/256*253/256*252/256)*100).
This means that there is actually no additional overhead, except that per node, the next_hop has to be stored in the NodeDB (only 4 bytes).

With this, you get rid of the unnecessary rebroadcasts caused by flooding like the one from node 0 as shown below. (Note that an arrow to a node means that it received it (so there might be multiple arrows for one packet), but not necessarily that it was addressed.)
image
With the NextHopRouter, node 0 doesn't try to relay:
image

Furthermore, it solves this issue of the current implementation where the wrong node (1, because it has the lowest SNR) is relaying:
image
With the NextHopRouter, because 0 knew 2 is a neighbor of 3, 2 was set as next hop and the DM to 3 correctly arrives, as well as the real ACK.
image

I'm hoping to get some feedback, and maybe new ideas on assigning the next hop from this. Also, I would love to get ideas on how to improve broadcasts as well.

Edit: There are at least two remaining issues, see: #2856 (comment)

@loodydo
Copy link
Contributor

loodydo commented Oct 2, 2023

I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

🤖 Pull request artifacts

file commit
pr2856-firmware-2.2.10.44dc270.zip 44dc270

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 3, 2023
@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 3, 2023

I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

It depends on the amount of nodes that can hear the packet (so the amount of immediate neighbors). But indeed, the chance that then only the intended next hop relays is a bit different than I presented before. I've updated the description with two calculations.

src/mesh/RadioInterface.h Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 13, 2023

I pushed a new commit to resolve @loodydo’s comments. Also added the hop limit setting of the original transmitter to the header flags (using 3 of the 4 currently unused bits), such that we can determine the amount of hops the packet has already traveled. This comes in handy to set the next hop for immediate neighbors, and I think it’s also useful for displaying in the apps. Currently it will always show the SNR/RSSI of nodes in the list, but this is actually the SNR/RSSI to the last relayer. In cases that a node is not an immediate neighbor, we could display the hops towards that node.

I also changed the way how the next_hop and relay_node are stored in the MeshPacket and NodeDB. Now only the last byte is stored, since only that byte is sent over the air. Besides, it mitigates the case where you might assign the wrong NodeNum if you happen to have multiple nodes with the same last byte in the NodeDB.

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 13, 2023
@GUVWAF
Copy link
Member Author

GUVWAF commented Nov 26, 2023

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 7, 2024

The NodeNum is only 32 bits and based on the MAC address of the device.
Assuming that that number is uniformly distributed I don't get why using modulo of the largest prime number is a better approach.

Where does your 0.4% probability refer to?

@roha-github
Copy link

UINT32 modulo 251 = 0..250
1/251 = 0.4%
Using prim numbers for checksum is a common pattern

@GUVWAF
Copy link
Member Author

GUVWAF commented May 7, 2024

I'm sorry but I don't get why that is better than using the last byte, which is 0..255.

It's not a checksum, it's to uniquely identify a node. If the NodeNum is already uniformly distributed, taking the last byte should not have any bias.

@roha-github
Copy link

roha-github commented May 7, 2024

About the routing itself ...

Each node could regularly publish a broadcast of its direct neighbors and communicate the reception quality of RSSI and SNR.

The list is sorted by reception quality and only the best 32 nodes are published.

1 byte or 8 bits are used per node:
Bit 1 = RSSI quality > -115 dB
Bit 2 = SNR quality > -11 dB
Bit 3..8 = NodeNum mod 64

Together with the node DB and the GPS coordinates, either the node itself or the app (smartphone, web browser, CLI) can then create a network topology.

Only 7 bytes would be required for a route with 7 hops

@caveman99
Copy link
Sponsor Member

That's what the neighborinfo module does.

@roha-github
Copy link

That's what the neighborinfo module does.

Ok, now I understand the benefits of the Neighbor Info Module.

Should the node do the routing itself later? Does the node DB then store all the neighbors of each node? How much memory would this require? I once read somewhere that the node DB would only store up to 80 nodes.

The documentation on the Neighbor Info Module does not really encourage activation. The module would be needed for routing and ever larger mesh networks.

@tve
Copy link

tve commented May 19, 2024

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

I do not understand the "wait for 3.0", most likely I don't understand what deploying 3.0 means... I would assume that nodes in a mesh will be upgraded slowly, especially big meshes, like SoCalMesh with hundreds of nodes, some in pretty inaccessible locations. This means that whether it's called 3.0 or not the next hop routing will have to co-exist with 2.x. Unless you count on all the 2.X having been upgraded to a version that sets next_hop to zero by then...

I'm wondering whether it would be possible to use the NeighborInfo to gate next-hop routing. If the NeighborInfo can have a flag that says "I do next-hop routing" then on the outgoing side a relayer knows whether populating next-hop makes sense or not.

On the incoming side it's a bit trickier in that a relayer would look up the NeighborInfo based on relay_node. If relay_node matches a NeighborInfo that says it does next-hop routing then the relayer would obey next_hop, although of course the match may be by chance. As a special case, if hop_start is set and equal to hop_limit then IIUC the message is received from the original sender and thus NeighborInfo can be looked up using the from address.

In any case, it seems worth it to figure out some heuristics to make 3.0 interoperable with pre-2.3, which introduced the explicit next_hop field as far as I can tell and not rely on "it's going to take so long for 3.0 to become available that everyone will have upgraded past 2.3 by then"...

Apologies if I'm completely off-base and don't understand some key fact here.

@tve
Copy link

tve commented May 19, 2024

Shouldn't the NeighborInfo module be enabled by default?

@GUVWAF
Copy link
Member Author

GUVWAF commented May 19, 2024

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

I'm not really sure half-working interoperable solutions are the way to go here, especially if it requires the NeighborInfo module to be enabled.
It's not that this PR has a huge impact anyway, as it only applies to DMs while I think the majority of packets are still broadcasts.

NeighborInfo creates quite a lot of additional traffic and is not necessary for NextHop routing, so I don't think it should be enabled by default.

@KyleMaas
Copy link

Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast) or another sync word to deliberately make nodes incompatible.

@GUVWAF I'd like to chime in here and say that I think this part has some major downsides. This threatens to split deployed infrastructure between old and new and drop coverage levels for both old and new until everything, including all client devices, have been changed over. For anyone who has gone further with deployment than just tinkering, this could be a major undertaking just due to the number of devices. And for anyone with infrastructure on mountaintops or other places which are difficult to access, these devices can be a pain in the neck to try to update. The more people using the same settings, the better it is for everyone's coverage, and the experience for new users would be better if they already have an existing network to connect to without having to build it all out themselves. Changing the default modem preset should not be done lightly.

And I agree with @tve's comments about version interoperability. A thoughtful and gradual upgrade path would make it far more likely for people to actually upgrade. Ideally you want people to upgrade to the new version so that you can continue to innovate moving forward. If the changes are too drastic, you risk having people stuck with "good enough, works for me" instead, or even potentially having someone maintain an old compatible version as an LTS release. Or looking at the project/protocol as so unstable that it's not worth continuing to deploy.

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 19, 2024

Ofcourse the decision will not be taken lightly and people have been saying it will be happing at least a year from now.

I'm not entirely following why you're advocating for gradual changes, as precisely that leads to "good enough, works for me", instead of doing all breaking changes at once, where you have to upgrade to continue to talk to others that have.

@garthvh
Copy link
Member

garthvh commented May 19, 2024

I love the idea of this, but a gradual rollout would make it a much easier pill to swallow.

This pull is not merged because it is a breaking change that needs to wait until 3.0, and there is not enough other stuff to make a breaking change worth it. There is no way to do this gradually. Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

@tve
Copy link

tve commented May 19, 2024

There is no way to do this gradually.

I'm not convinced, that's why I made suggestions above. Yes, there's no way to make the changes without compromises, but I don't know that the compromises have been explored or discussed. Quite possibly just my ignorance.

Meshtastic is just not set up for multiple versions, when version 3 gets promoted version 2 is going to be deprecated.

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare. This kind'a guarantees that no more non-breaking changes can be made going forward which I'd rate as "not good".

At a high level I'm wondering what the plan is for wide-area meshes, which is where the routing improvements have the most impact I believe. I'm part of the southern california mesh with hundreds of nodes spread over hundreds of square km, many mountain ranges dividing the geography, and many nodes in difficult to access/update places. My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool). If the goal is to make this type of mesh usable over longer distances (like sending messages from Santa Barbara to San Diego -- approx 350km) some serious thought needs to go into routing. Maybe the answer is that this should be done via MQTT (internet), maybe the answer is that there should be an overlay backbone network to bridge long distances, maybe the answer lies in just improving the current type of routing. I don't know and it also depends a lot on what the project goals are.

I'm not sure what the best place for this type of discussion is, but I believe it's relevant here because if there's going to be a breaking change then at least make the most out of it. I'm also not trying to piss on this PR: I actually think it's great, I'm just wondering about the bigger picture.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 19, 2024

This is one of the biggest issues I have with the proposal: it uses up all the unused bits for its purposes and doesn't fix the versioning issue which causes the incompatibility nightmare.

But this is really beyond the scope of this PR. Currently these two bytes are useless anyway, because before 2.3 they could take any value depending on what was in that memory location. This is just one of the breaking changes currently proposed for 3.0 (fixing text compression is another one), I believe a change to the header to make it future-proof deserves its own PR. Edit: I now see @KyleMaas already opened a discussion for this: #3932

My overall assessment is that the mesh as a whole is pretty unusable other than to toy around and experiment (which is cool).

What kind of issues are you having and what makes you believe this has to do with routing? For broadcasts the current managed flooding approach is really not too bad. I'm not convinced there even exists a protocol that's guaranteed to be better in all the different use cases for Meshtastic (especially fixed or mobile nodes, or a mix), that doesn't require an internet connection and can run on microprocessors with limited RAM. For "smarter" routing you need periodic control messages, but those eat up the same bandwidth as regular traffic.

Other problems with wireless mesh networks, e.g. the hidden node problem or unstable/changing RF conditions are not solved with better routing.

@KyleMaas
Copy link

@GUVWAF Yeah, starting that discussion to ask about forward compatibility is what led me into joining the discussion on this issue. I have several places in mind to add nodes that would vastly increase coverage in my area, but they are inaccessible enough that once they're up there, I'm probably realistically only going to be able to update the firmware maybe every 12 months or so. So I need to wait for a version of the firmware that has enough interoperability, or have all of the proposed protocol changes interoperable enough with older versions, that I can actually make these nodes available to the community. All of the current nodes I run are in areas which are at least reasonably accessible. But there's no sense in me taking on the time, effort, and expense of building out the expansions if it's going to be completely obsolete in a month - it makes more sense to wait for stability and not recommend Meshtastic to anyone who isn't already in our local coverage zones. So protocol stability and interoperability is extremely important to me.

@garthvh
Copy link
Member

garthvh commented May 20, 2024

3.0 is like 18 months away best case. Honestly not really enough there yet to even consider it.

@roha-github
Copy link

@GUVWAF "Version 3.0 will most likely have a different default modem preset (people have advocated for MediumFast)"

Where are such far-reaching decisions discussed and decided? What exactly is the background? The minimal increase in bandwidth is "bought" with a loss of range and all 2.x routers & repeaters.

Alternatively, 3.x clients switch to Long / Fast or both networks run in parallel and apart.

@GUVWAF
Copy link
Member Author

GUVWAF commented May 20, 2024

Where are such far-reaching decisions discussed and decided?

This was discussed by admins on Discord, but is also just an idea right now. When 3.0 comes closer, there will likely be an RFC for this.
Let's now continue this discussion in #3932.

@GUVWAF
Copy link
Member Author

GUVWAF commented Aug 18, 2024

After looking at this again, I realized I made a big assumption that’s likely often not valid. I set the next_hop of a node to the one that was forwarding the ACK/response of a DM for that node. However, with asymmetric links, it might well be that that node is not the best next_hop, and it might even be you can’t reach it, while it can reach you. In my simulator I don’t simulate asymmetric links, but judging from how often people mention they receive “beacons” very well, but can’t send a message to someone, I think this is quite common.

Another thing I didn’t consider is that only the original transmitter will fall back to flooding when it doesn’t hear anyone rebroadcast, but it might also be that intermediate hops loose their next_hop link. In my fork I’ve fixed this by letting the intermediate hops also retransmit once with fall back to flooding, but I’m unsure about the possible overhead this creates.

At least before the first issue is resolved, I’m not confident on merging this yet. Happy to hear any suggestions on this.

@daramos
Copy link

daramos commented Sep 4, 2024

I think this PR is ready to merge, except that it’s a breaking change unfortunately. I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message. We’ll have to wait for 3.0 with this.

@GUVWAF I got curious about this and dug around the git history a bit. I think maybe a way to determine if you can depend on these bytes being initialized is to see if mp->hop_start is non-zero?
Looks like both #3479 and #3321 were both first publicly released in v2.3.2.63.

Since they were both publicly released in the same official release, I think you could use mp->hop_start being set as a condition to trust the last two bytes being zero-initialized. If hop_start is non-zero then the transmitting node is at least running a fw version that also initialized the unused bytes.

I think the only way this doesn't hold true is if someone compiled a firmware themselves between those two commits.

Unfortunately these really are the last free bits in the header. I dug around a bit see if there was a backwards-compatible way to add a trailing struct at the end of the packet for expansion but I don't see a way that would be possible with the way the rest of the packet is immediately memcpy'ed into a encrypted meshpacket variant.

But - I wonder if it would be possible to use the bytes to indicate protocol version? The benefit would be a way to accomplish mostly backwards compatible extensibility in the transport layer.

For instance, the first byte could indicate what version of the packet header is used (packet_version).
The second byte could be used to record the size of an extended header struct (extended_header_size). Firmware versions that are not up to date can use this field to skip bytes that are not in it's own extended header struct and get the payload data.

for instance:
packet_version: 0, extended_header_size: 0 - describes the current situation we're in. Both bytes are initialized to 0.

packet_version: 1, extended_header_size: 1 - We added a new byte field to the header (maybe a flags field), we'll increment the version number as well so that nodes running the latest firmware knows we are compatible with the new features. Older firmware nodes won't have the new feature but it knows to skip a byte after the header to start reading the payload.

packet_version: 2, extended_header_size: 1 - We've added a new flag or changed some functionality but didn't grow the header.

This of course would only be possible for non-broadcast messages currently but eventually if most clients upgrade to a version that support the extended header - we could deprecate the old handling of packet headers and use the extended header broadcast messages as well.

I think this also allows room for V3 without totally breaking interoperability with V2.

I don't know enough about Lora's MTU to know if it's even possible to increase the total packet size that way and this is probably the wrong place to discuss it anyway but I figured I'd throw that out here.
Thank you to all the developers of this project; you guys are awesome!

@GUVWAF
Copy link
Member Author

GUVWAF commented Sep 4, 2024

@daramos Indeed using hopStart to determine whether the bytes are valid is something that was discussed on Discord, and would be doable.

While I’m open to other ideas for those bytes, I’m not sure if a protocol version and extended header is worthwhile.

First, you will still need a breaking change to introduce this the first time, because old firmware wouldn’t know about an extended header.
Second, everything you add cannot be required; it can only be optional. If you keep on working with nodes that do not use a new functionality you introduced, those might break it. Sometimes a breaking change is for the better. Adding optional things can also be done in the encrypted part of the payload, which would still work for all nodes on a common channel.
Third, with this you can only add things to the header, and not change the existing one. It would thus only lead to more and more overhead, which is not insignificant for LoRa. We are already hitting almost the packet size limit for certain types.
Lastly, there are other things beside the header that might break compatibility. For example, a new default LoRa modem preset, new rules for who rebroadcasts a packet first, breaking changes to the protobuf encoding in the encrypted part, etc.

@daramos
Copy link

daramos commented Sep 4, 2024

Makes sense, thanks!

@KodinLanewave
Copy link
Contributor

So I was thinking... Is there any reason NeighborInfo packets must honor the same number of hops as other packet types? For NextHopRouting this could default to zero hops (only transmitted to immediately-connected nodes) as an example, severely reducing overhead on a high-saturation mesh. Any thoughts? I figure if you want telemetry from a node you can manually query NeighborInfo with a DM to get that data "phoned home" if you will, and otherwise neighbors are only sent to their adjacent nodes for supporting NextHop operation.

@CamFlyerCH
Copy link

So I was thinking... Is there any reason NeighborInfo packets must honor the same number of hops as other packet types? For NextHopRouting this could default to zero hops (only transmitted to immediately-connected nodes) as an example, severely reducing overhead on a high-saturation mesh.

I like this a lot! I exactly like to find a solution like this including telemetry. Would be great to have different Hop Settings for all messages sent by intervall with an additional option to only send them via MQTT. (for example for home node telemetry, where i don't want to disable TX on LoRa)

@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 8, 2024

@KodinLanewave By default, if you don't have the NeighborInfo Module enabled, you'll also not rebroadcast those packets, so essentially they are already 0-hop if others are not interested in them.

I agree it would be nice to have different intervals for LoRa, PhoneAPI and MQTT (or disable one of them), but it also complicates things quite a lot. Maybe something for Meshtastic 3.0.

@meshtastic-bot
Copy link

This pull request has been mentioned on Meshtastic. There might be relevant details there:

https://meshtastic.discourse.group/t/swiss-meshtastic-user/9688/153

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.