Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External IP allocation query will break with huge IP ranges #1371

Open
bnaecker opened this issue Jul 7, 2022 · 3 comments
Open

External IP allocation query will break with huge IP ranges #1371

bnaecker opened this issue Jul 7, 2022 · 3 comments
Assignees
Labels
bug Something that isn't working. database Related to database access

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 7, 2022

The query for allocating external IPs from an IP Pool range currently generates the sequence of all IPs in that range. That's inefficient, but works at this point. However, there's a latent bug here. That generates the sequence of IP addresses by computing offsets, using generate_series(0, last_address - first_address), and adding those offsets to the first address. That subtraction will not fit in an i64 if the address range is huge. For example:

demo@127.0.0.1:26257/movr> select 'fd00::'::INET - 'fc00::'::INET;
ERROR: result out of range
SQLSTATE: 22003
demo@127.0.0.1:26257/movr>

That's not super likely, but it should be detected and handled correctly, by taking the maximum allowed offset from the first address. Something like:

demo@127.0.0.1:26257/movr> select if('fc00::'::INET + 9223372036854775807 > 'fd00::'::INET, 'fd00::'::INET, 'fc00::'::INET + 9223372036854775807) - 'fc00::'::INET;
       ?column?
-----------------------
  9223372036854775807
(1 row)


Time: 0ms total (execution 0ms / network 0ms)

demo@127.0.0.1:26257/movr>

That constant is i64::MAX, and can be supplied on the Rust side or in the database. The IF is needed because there's no maximum function for INET types.

@bnaecker bnaecker added bug Something that isn't working. database Related to database access labels Jul 7, 2022
@bnaecker bnaecker self-assigned this Jul 7, 2022
@bnaecker
Copy link
Collaborator Author

This brings up a related and maybe larger question. The current query will actually not break, but have horrible performance if the range is huge but doesn't overflow an i64. Because all subqueries are currently cached in their entirety in memory, CRDB will happily generate and store a sequence of i64::MAX - 1 addresses, assuming the system has that much memory. We need to also limit the size of ranges.

However, the size of ranges isn't enough. The existing query searches over all ranges, of all pools, that are not reserved for another project. So even if we limit the size of each range, that's not enough to limit the size of the sequence generated by this query.

One way to do that would be limit the size of each pool, and fail requests to add a new range if the resulting size would be larger than that limit. We'd also have to limit the number of pools, to have a hard cap on the size of that query.

Another option would be to rewrite the query. This has come up in a number of ways, but we would really like a way to write queries that return the next available item without generating the full sequence of available items. As an example, we'd keep track of the allocated IPs in rows that track the first and last free address; allocating the next IP is finding the row with the lowest free address and bumping that by 1. Freeing an address is harder though, since we'll have to handle coalescing ranges in a few different ways, depending on whether there are neighboring ranges.

@bnaecker
Copy link
Collaborator Author

I talked with @davepacheco about this. A short-term solution is to (1) limit the size of IP Pool ranges at this point, and (2) work in batches (both here and other similar queries). For the second part, we'd modify the query to accept a start and batch size, and then issue the same query with those as the first and second argument to the generate_series call which creates the sequence of IP addresses for each pool and range.

This changes the order in which the addresses will be allocated. Previously, it would be sequential over all pools and IP ranges. The addresses would now be striped across the ranges, though still sequential within reach stripe. That's not a problem I don't think, just a difference.

@bnaecker
Copy link
Collaborator Author

It occurs to me that this same problem will affect guest VPC-private addresses in the network_interface table as well. Those use a NextItem query, which internally does generate_series(0, n_addresses_in_subnet - 1). When we actually support IPv6 addresses for those interfaces, that query will not never complete, given CRDB's implementation of materializing all subqueries. This needs to be fixed, e.g., by:

  • limiting the number of addresses we search in one query, ideally inside a retry loop.
  • use the still-hypothetical mechanism described elsewhere, tracking the IP addresses in a free-list-esque scheme
  • change the query, e.g., to use an index on the (subnet, IP address) that would allow quickly finding the next address in the given subnet. That doesn't work well with multiple versions of IP addresses. I believe CRDB sorts all IPv4 addresses before any IPv6 address, so it's not obvious to me that'll support quickly finding the next address of a particular protocol version.

leftwo pushed a commit that referenced this issue Jun 26, 2024
Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts.  These scripts are extracted into
the global zone at /opt/oxide/crucible_dtrace/

Update Crucible to latest includes these updates:
Clean up dependency checking, fixing space leak (#1372)
Make a DTrace package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371)
Remove `WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366)
Move 'do work for one job' into a helper function (#1365)
Remove `DownstairsWork` from map when handling it (#1361)
Using `block_in_place` for IO operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369)
Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350)
Support arbitrary Volumes during replace compare (#1349)
Remove the SQLite backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341)
Move Work into ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333)
Move disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330)
Move cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328)
Move remaining local state into a `struct ConnectionState` (#1327)
Consolidate negotiation + IO operations into one loop (#1322)
Allow replacement of a target in a read_only_parent (#1281)
Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326)
Move negotiation into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317)
Reuse a reqwest client when creating repair client (#1324)
Add % to keep buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314)
Added more DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)
leftwo added a commit that referenced this issue Jun 26, 2024
Update Crucible and Propolis to the latest

Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts. These scripts are extracted into the 
global zone at /opt/oxide/crucible_dtrace/

Crucible latest includes these updates:
Clean up dependency checking, fixing space leak (#1372) Make a DTrace
package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371) Remove
`WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366) Move 'do
work for one job' into a helper function (#1365) Remove `DownstairsWork`
from map when handling it (#1361) Using `block_in_place` for IO
operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer
configuration (#1369) Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support
arbitrary Volumes during replace compare (#1349) Remove the SQLite
backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into
ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333) Move
disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330) Move
cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328) Move remaining
local state into a `struct ConnectionState` (#1327) Consolidate
negotiation + IO operations into one loop (#1322) Allow replacement of a
target in a read_only_parent (#1281) Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326) Move negotiation
into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317) Reuse a
reqwest client when creating repair client (#1324) Add % to keep
buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314) Added more
DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)

Propolis just has this one update:
Allow boot order config in propolis-standalone
---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. database Related to database access
Projects
None yet
Development

No branches or pull requests

1 participant