-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail if there isn't available space for a disk #1231
Fail if there isn't available space for a disk #1231
Conversation
Nexus' current region allocation algorithm does not check to see if the dataset usage goes over the zpool's total size. Add this check, plus a test to validate it. The algorithm currently does an all or nothing check: it tries to find three datasets with available space, and if it cannot then (with this commit) returns an error - in this commit, the test `test_disk_backed_by_multiple_regions` fails. When the algorithm changes to support chunking a disk across multiple dataset triples, that test will pass. The reason that this was committed too was to make this explicit to people searching the tests for usage examples.
nexus/src/db/datastore.rs
Outdated
if size_used > zpool_total_size { | ||
return Err(TxnError::CustomError( | ||
RegionAllocateError::NotEnoughAvailableSpace, | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only compares the size of the requested datasets relative to the entire zpool, not to other datasets, right?
So if we have a zpool of size "10 GiB", any number of datasets could be allocated to it, as long as each comes in at "under 10 GiB used space"? Like, if we let callers allocate 1,000 datasets of 5 GiB each on that zpool, that's clearly a problem.
I definitely think this change is an improvement, but I'm concerned that it makes this check looks "done", when IMO this is a highly-related case that doesn't seem solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true. I feel like the dataset object should have a quota field, and the comparison should be against that instead. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current size_used
value is a bit of a sham right now:
omicron/nexus/src/db/model/dataset.rs
Lines 52 to 55 in a399b09
let size_used = match kind { | |
DatasetKind::Crucible => Some(0), | |
_ => None, | |
}; |
I agree with you about the notion of a quota. Perhaps we do the following:
size_used
is renamed toquota
- We apply the quota when initializing the dataset
- We make it non-optional
Without all those steps, it seems possible for one rogue dataset to consume all the space on disk, which kinda defeats the purpose of this accounting. WDYT?
(If we want to do this implementation as a follow-up, we can file an issue and iterate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with summing up all datasets in a zpool, and comparing to the zpool's total size. See 7b5686f.
fix bug in regions_hard_delete: reset size_used for all datasets, not just the first one returned.
nexus/src/db/datastore.rs
Outdated
for (dataset_id, size) in datasets_id_and_size { | ||
diesel::update(dataset_dsl::dataset) | ||
.filter(dataset_dsl::id.eq(dataset_id)) | ||
.set(dataset_dsl::size_used.eq(dataset_dsl::size_used - size)) | ||
.execute_async(self.pool()) | ||
.await | ||
.map_err(|e| { | ||
Error::internal_error(&format!( | ||
"error updating dataset space: {:?}", | ||
e | ||
)) | ||
})?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this sequence of operations is idempotent, which I think makes it problematic in the context of a saga.
If a saga fails partway through operation, it may be re-tried. This can happen in the "action" direction (during the "disk delete" saga) or in the "undo" direction (during the "disk create" saga). Admittedly, it looks like this was true before this PR too, but I think it became more obvious to me with the addition of a for
loop here.
The easiest short-term fix here is probably to move these operations into a transaction - otherwise:
- If we crash between "delete" and "update", the size accounting will be wrong
- If we crash between updating some but not all of the datasets, the accounting will be wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding and subtracting, size_used
is now summed up after regions are inserted or deleted, meaning both region_allocate and regions_hard_delete should be idempotent.
nexus/src/db/datastore.rs
Outdated
if size_used > zpool_total_size { | ||
return Err(TxnError::CustomError( | ||
RegionAllocateError::NotEnoughAvailableSpace, | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current size_used
value is a bit of a sham right now:
omicron/nexus/src/db/model/dataset.rs
Lines 52 to 55 in a399b09
let size_used = match kind { | |
DatasetKind::Crucible => Some(0), | |
_ => None, | |
}; |
I agree with you about the notion of a quota. Perhaps we do the following:
size_used
is renamed toquota
- We apply the quota when initializing the dataset
- We make it non-optional
Without all those steps, it seems possible for one rogue dataset to consume all the space on disk, which kinda defeats the purpose of this accounting. WDYT?
(If we want to do this implementation as a follow-up, we can file an issue and iterate)
Remove `impl Default` for IdentityMetadata. Remove checking for saga to be done in test.
Instead of adding and subtracting dataset's size_used field, compute it each time by summing up all the sizes of regions that belong to a dataset. Also sum up all regions belonging to a zpool to validate that the zpool is not running out of space. This involved changing the disk related tests to create three zpools, each with one dataset each, instead of one zpool with three datasets. Code that was iterating over each dataset now has to iterate over zpools, then datasets in the zpool. Updated tests that are now correctly checking for out-of-space.
This should be good for a re-review now. |
One potential fallout of this PR going in is that the virtual hardware script will need to make larger file-based vdevs, because the 10 GiB zpools won't be able to have more than that size allocated to them. We may want a way to use actual disks too but that would require a machine to have three additional disks. |
impl TryFrom<diesel::pg::data_types::PgNumeric> for ByteCount { | ||
type Error = anyhow::Error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do the conversion to/from the PgNumeric
type specifically? Why not the https://docs.diesel.rs/master/diesel/sql_types/struct.Numeric.html type? (coping with individual digits at a time seems like a pain in the butt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a pain, but I don't see a way to chain Into
or From
here to go from either PgNumeric
or Numeric
into i64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a step back - I don't just mean this function. I mean, why are we using PgNumeric
in the db? There are other DB types which could be used instead here, like https://docs.diesel.rs/master/diesel/sql_types/struct.BigInt.html , if you're just going to / from i64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using PgNumeric
in the db - as far as I understand, diesel::sql_types::Numeric
is being returned due to the use of the sum. Trying to use any other type, I kept running into errors like
error[E0277]: the trait bound `BigInt: FromSql<diesel::sql_types::Numeric, Pg>` is not satisfied
--> nexus/src/db/datastore/region.rs:214:26
|
214 | .get_result(conn)?;
| ^^^^^^^^^^ the trait `FromSql<diesel::sql_types::Numeric, Pg>` is not implemented for `BigInt`
|
and eventually just went with PgNumeric
because I could get at the enum variants to actually turn it into ByteCount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The usage of Numeric, or the usage of PgNumeric, to me implies "this is a variable precision number, not an integer"
- The usage of
smallint
/integer
/bigint
implies "this is an integer, of a particular size"
"byte count" seems like it should be an integer, not a variable-precision number, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, AFAICT:
sum
acts on types which areFoldable
Foldable
is implemented for the integer types: https://docs.diesel.rs/master/diesel/sql_types/trait.Foldable.html#impl-Foldable-for-BigInt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but if I follow the Foldable link to the source, I see
foldable_impls! {
...
sql_types::BigInt => (sql_types::Numeric, sql_types::Numeric),
...
}
which I think means that the sum widens the type from BigInt to Numeric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. Okay, don't hold up the PR on this point. I think we're in agreement that "it would be nice if bytes could be integer types", but if the implicit SQL conversions make that difficult, we can punt on it.
(But dang, would be nice if we could coerce them to stay as integer-based types!)
disk_source: params::DiskSource::Blank { | ||
block_size: params::BlockSize::try_from(4096).unwrap(), | ||
}, | ||
size: ByteCount::from_gibibytes_u32(10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're relying on this being 10GiB in your tests, I recommend some way of checking that in the test itself. I'm seeing a lot of comments in test_disk_backed_by_multiple_region_sets
that look like they expect the disk / zpool size to be exactly 10 gibibytes - maybe we should pull this into a pub const: usize DISK_SIZE_GiB = 10
?
If someone else writing a disk-based test modifies this (seemingly arbitrary) value tomorrow:
- Best (?) case: your test would still pass, as the size would be "within expected bounds"
- Bad case: your test would suddenly fail
- Really bad case: your test would pass, even if it wasn't supposed to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, good suggestion - implemented in c2e713a
nexus/src/db/datastore/region.rs
Outdated
.filter( | ||
dataset_dsl::kind | ||
.eq(DatasetKind::Crucible), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we actually be filtering by this?
- Suppose we're using a 10 GiB zpool
- Suppose 5 GiB of that pool is used by crucible
- Suppose 4 GiB of that pool is used by Cockroach + Clickhouse
- Suppose someone tries requesting a 2 GiB region
According to this query, the allocation should succeed, using 11 GiB of the 10 GiB zpool (!).
I filed #1630 to track this more completely, but TL;DR I don't think we should be filtering by dataset kind here. I think we care about all datasets within the zpool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, yeah. I reordered the operations in the transaction to first insert the all regions, then check if each zpool's total_size was exceeded (by asking about every dataset, not just crucible). Done in 181c5d0
.select(diesel::dsl::sum( | ||
region_dsl::block_size | ||
* region_dsl::blocks_per_extent | ||
* region_dsl::extent_count, | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that my prior comment about "do this check for all dataset types, not just crucible" makes this more complicated. Could we perform this calculation using
omicron/common/src/sql/dbinit.sql
Lines 172 to 174 in 48285ce
/* An upper bound on the amount of space that might be in-use */ | |
size_used INT | |
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kinda looks like you're updating this value below anyway - my main point here is that "we should be considering all the non-Crucible datasets too; they also take up space". size_used
seems like the dataset-type-agnostic way of doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 181c5d0
NexusRequest::new( | ||
RequestBuilder::new(client, Method::POST, &disks_url) | ||
.body(Some(&new_disk)) | ||
// TODO: this fails! the current allocation algorithm does not split |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we attach a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #1644
This PR looks good, but I'd like to make the check validate size across all datasets using a zpool, not just crucible's usage. We can enforce that other datasets accurately report / enforce their sizes as a follow-up, if you want: #1630 |
which is strange? I'm rerunning it now. |
The |
This is also now done, so I'm merging now. |
Propolis changes: Add `IntrPin::import_state` and migrate LPC UART pin states (#669) Attempt to set WCE for raw file backends Fix clippy/lint nits for rust 1.77.0 Crucible changes: Correctly (and robustly) count bytes (#1237) test-replay.sh fix name of DTrace script (#1235) BlockReq -> BlockOp (#1234) Simplify `BlockReq` (#1218) DTrace, cmon, cleanup, retry downstairs connections at 10 seconds. (#1231) Remove `MAX_ACTIVE_COUNT` flow control system (#1217) Crucible changes that were in Omicron but not in Propolis before this commit. Return *410 Gone* if volume is inactive (#1232) Update Rust crate opentelemetry to 0.22.0 (#1224) Update Rust crate base64 to 0.22.0 (#1222) Update Rust crate async-recursion to 1.1.0 (#1221) Minor cleanups to extent implementations (#1230) Update Rust crate http to 0.2.12 (#1220) Update Rust crate reedline to 0.30.0 (#1227) Update Rust crate rayon to 1.9.0 (#1226) Update Rust crate nix to 0.28 (#1223) Update Rust crate async-trait to 0.1.78 (#1219) Various buffer optimizations (#1211) Add low-level test for message encoding (#1214) Don't let df failures ruin the buildomat tests (#1213) Activate the NBD server's psuedo file (#1209) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
Nexus' current region allocation algorithm does not check to see if the
dataset usage goes over the zpool's total size. Add this check, plus
a test to validate it.
The algorithm currently does an all or nothing check: it tries to find
three datasets with available space, and if it cannot then (with this
commit) returns an error - in this commit, the test
test_disk_backed_by_multiple_regions
fails. When the algorithm changesto support chunking a disk across multiple dataset triples, that test
will pass. The reason that this was committed too was to make this
explicit to people searching the tests for usage examples.