-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward new external IP addresses to Dendrite #1464
Comments
One other thought here. In #823, we've been tracking some way of knowing whether any given bootstrap agent is on a Scrimlet. In the product, we'll ultimately know this only when and whether a Tofino PCIe device comes up. In the meantime, using the presence of One of the details of this is we'll need to add an |
Related to that: the |
@Nieuwejaar Thanks, that's helpful. We can block starting the |
I'd like to have Nexus be fully independent of whatever mechanism sled agent chooses - so if sled agent uses For example, when Sled Agent makes the call to Nexus (to upsert a row to the omicron/nexus/src/internal_api/http_entrypoints.rs Lines 74 to 89 in b858ed7
This could have some auxiliary info identifying Then, if we shift away from |
Yeah, for sure. I was assuming sled-agent would have some |
If the device goes away, then If the device goes away, then we should prevent |
It's much more helpful to services to have an API server be up and return a 500 when it can't process the request rather than be down and trying to cascade that. As the service will still be listed in DNS and trying to design a system to correctly enable/disable seems like quite a lot of extra work given the SMF maintenance mode behavior. When we have other entities in a distributed system that have dependents come and go, they just handle it and communicate upstack what's down rather than relying on something to have to be there. I realize that this is a bit different and the daemon initialization and state tracking are very different and we may have to figure out a way to get dpd into a callback notification when the instance wants to close so we can clean up refs, but seems something worth considering. Dunno. On the other hand, if it's going to be up to sled agent to insert the device into the zone every time, maybe we'll have an easier place to enable and disable the daemon and that'll be ok, but I think it'll still be weird since we'll have things that want to create TCP connections for this and make requests and it seems healthier and easier to diagnose what's going on if we can get a semantic 500 of there's no Tofino here versus TCP connection time outs. |
I hadn't thought of that, but it seems obvious now that you've said it. I'm sold. |
I was just rereading the issue description and noticed "the rack IPv6 network". In a single-rack environment, we need some way to know which of the two sidecars owns the port with the guest's external IP. In a multi-rack environment, we'll need some way to identify the rack on which its external IP exists. |
Based on the proposed API in RFD 267, like this example, a Sidecar port (called Later in that example, an IP pool is created, and an address range within that subnet is provided. It's not really clear to me how to implement this efficiently or store it all in the control plane database, but technically we have all the information we need there to describe the relationship between a guest external IP address; the pool it's derived from; and the Sidecar port(s) through which traffic destined for that guest must arrive. While multi-rack is something I've not thought much about in this context, the rack ID is also part of that API path, so we can link up the above pieces with a rack as well. |
The mapping we are talking about here is
This mapping is realized concretely as the NAT ingress table in the P4 code. We cannot tie the need for this table entry to the assignment of sidecar IP addresses. Consider the case where an entire subnet of external IPs is routed to the rack and the IP addresses used between the sidecar and the upstream router are unrelated to that subnet. In the example below the customer has designated
It's also becoming increasingly common to route IPv4 addresses over IPv6 link local addresses to avoid burning V4 addresses. RFC 5549. |
The point I was trying to make is that the table entry has to land in a specific sidecar's Tofino tables. In a multi-rack environment, presumably only a subset of the racks will have external connectivity. Thus, we have to assume that any given guest's NAT traffic will go through a sidecar on a different rack, and something in |
Yeah, totally. In a multirack setup, I think it makes sense for a sidecar to be designated as a "NAT-ing" sidecar via the API - as I'm not thinking of a good bullet-proof way to determine this dynamically based on some other information. We could use the presence of an egress routing configuration (static, BGP, etc...) as an indicator. However, a sidecar could be used purely for ingress without any egress and that strategy would fall apart. |
@internet-diglett @FelixMcFelix Has this actually been completed now? I think so, but y'all have done most of the work, so you can answer better! |
My understanding is that we're in a good place on the v4 front, but as you've indicated via #5090 we aren't yet there for v6. |
Background
All guest instances will have private IP addresses in their VPC Subnet. To communicate with the outside world, they'll also need external addresses. These addresses ultimately come from operators, for example as part of setting up IP Pools, since they are addresses under the customer's control (e.g., public IPv4 addresses they own or addresses within their datacenter). We're currently passing these addresses to OPTE. When the guest makes an outbound network connection, OPTE will
The P4 program running on the switch decapsulates this, and delivers it to the broader customer network.
On the way back in, the reverse process needs to happen: encapsulating the external packet in a rack-specific IPv6 packet, destined for the right sled. The Dendrite data-plane daemon,
dpd
, needs to know what the "right" sled is. This issue tracks the initial work communicating the external-IP-to-sled mapping out todpd
.Initial thoughts
The control plane needs to communicate the mapping from external IP address to the sled "hosting" that address. This needs to happen in a few places:
The text was updated successfully, but these errors were encountered: