Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail if there isn't available space for a disk #1231

Merged
merged 14 commits into from
Aug 26, 2022

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Jun 20, 2022

Nexus' current region allocation algorithm does not check to see if the
dataset usage goes over the zpool's total size. Add this check, plus
a test to validate it.

The algorithm currently does an all or nothing check: it tries to find
three datasets with available space, and if it cannot then (with this
commit) returns an error - in this commit, the test
test_disk_backed_by_multiple_regions fails. When the algorithm changes
to support chunking a disk across multiple dataset triples, that test
will pass. The reason that this was committed too was to make this
explicit to people searching the tests for usage examples.

Nexus' current region allocation algorithm does not check to see if the
dataset usage goes over the zpool's total size. Add this check, plus
a test to validate it.

The algorithm currently does an all or nothing check: it tries to find
three datasets with available space, and if it cannot then (with this
commit) returns an error - in this commit, the test
`test_disk_backed_by_multiple_regions` fails. When the algorithm changes
to support chunking a disk across multiple dataset triples, that test
will pass. The reason that this was committed too was to make this
explicit to people searching the tests for usage examples.
@jmpesp jmpesp requested a review from smklein June 28, 2022 19:29
@smklein smklein self-assigned this Jun 28, 2022
nexus/tests/integration_tests/disks.rs Outdated Show resolved Hide resolved
nexus/src/db/datastore.rs Outdated Show resolved Hide resolved
Comment on lines 643 to 647
if size_used > zpool_total_size {
return Err(TxnError::CustomError(
RegionAllocateError::NotEnoughAvailableSpace,
));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only compares the size of the requested datasets relative to the entire zpool, not to other datasets, right?

So if we have a zpool of size "10 GiB", any number of datasets could be allocated to it, as long as each comes in at "under 10 GiB used space"? Like, if we let callers allocate 1,000 datasets of 5 GiB each on that zpool, that's clearly a problem.

I definitely think this change is an improvement, but I'm concerned that it makes this check looks "done", when IMO this is a highly-related case that doesn't seem solved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true. I feel like the dataset object should have a quota field, and the comparison should be against that instead. thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current size_used value is a bit of a sham right now:

let size_used = match kind {
DatasetKind::Crucible => Some(0),
_ => None,
};

I agree with you about the notion of a quota. Perhaps we do the following:

  • size_used is renamed to quota
  • We apply the quota when initializing the dataset
  • We make it non-optional

Without all those steps, it seems possible for one rogue dataset to consume all the space on disk, which kinda defeats the purpose of this accounting. WDYT?

(If we want to do this implementation as a follow-up, we can file an issue and iterate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with summing up all datasets in a zpool, and comparing to the zpool's total size. See 7b5686f.

fix bug in regions_hard_delete: reset size_used for all datasets, not
just the first one returned.
common/src/api/external/mod.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/disks.rs Outdated Show resolved Hide resolved
Comment on lines 710 to 722
for (dataset_id, size) in datasets_id_and_size {
diesel::update(dataset_dsl::dataset)
.filter(dataset_dsl::id.eq(dataset_id))
.set(dataset_dsl::size_used.eq(dataset_dsl::size_used - size))
.execute_async(self.pool())
.await
.map_err(|e| {
Error::internal_error(&format!(
"error updating dataset space: {:?}",
e
))
})?;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this sequence of operations is idempotent, which I think makes it problematic in the context of a saga.

If a saga fails partway through operation, it may be re-tried. This can happen in the "action" direction (during the "disk delete" saga) or in the "undo" direction (during the "disk create" saga). Admittedly, it looks like this was true before this PR too, but I think it became more obvious to me with the addition of a for loop here.

The easiest short-term fix here is probably to move these operations into a transaction - otherwise:

  • If we crash between "delete" and "update", the size accounting will be wrong
  • If we crash between updating some but not all of the datasets, the accounting will be wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding and subtracting, size_used is now summed up after regions are inserted or deleted, meaning both region_allocate and regions_hard_delete should be idempotent.

Comment on lines 643 to 647
if size_used > zpool_total_size {
return Err(TxnError::CustomError(
RegionAllocateError::NotEnoughAvailableSpace,
));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current size_used value is a bit of a sham right now:

let size_used = match kind {
DatasetKind::Crucible => Some(0),
_ => None,
};

I agree with you about the notion of a quota. Perhaps we do the following:

  • size_used is renamed to quota
  • We apply the quota when initializing the dataset
  • We make it non-optional

Without all those steps, it seems possible for one rogue dataset to consume all the space on disk, which kinda defeats the purpose of this accounting. WDYT?

(If we want to do this implementation as a follow-up, we can file an issue and iterate)

@smklein smklein assigned jmpesp and unassigned smklein Jul 5, 2022
Remove `impl Default` for IdentityMetadata.

Remove checking for saga to be done in test.
Instead of adding and subtracting dataset's size_used field, compute it
each time by summing up all the sizes of regions that belong to a
dataset. Also sum up all regions belonging to a zpool to validate that
the zpool is not running out of space.

This involved changing the disk related tests to create three zpools,
each with one dataset each, instead of one zpool with three datasets.
Code that was iterating over each dataset now has to iterate over
zpools, then datasets in the zpool.

Updated tests that are now correctly checking for out-of-space.
@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 18, 2022

This should be good for a re-review now.

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 18, 2022

One potential fallout of this PR going in is that the virtual hardware script will need to make larger file-based vdevs, because the 10 GiB zpools won't be able to have more than that size allocated to them.

We may want a way to use actual disks too but that would require a machine to have three additional disks.

Comment on lines +80 to +81
impl TryFrom<diesel::pg::data_types::PgNumeric> for ByteCount {
type Error = anyhow::Error;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do the conversion to/from the PgNumeric type specifically? Why not the https://docs.diesel.rs/master/diesel/sql_types/struct.Numeric.html type? (coping with individual digits at a time seems like a pain in the butt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a pain, but I don't see a way to chain Into or From here to go from either PgNumeric or Numeric into i64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a step back - I don't just mean this function. I mean, why are we using PgNumeric in the db? There are other DB types which could be used instead here, like https://docs.diesel.rs/master/diesel/sql_types/struct.BigInt.html , if you're just going to / from i64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using PgNumeric in the db - as far as I understand, diesel::sql_types::Numeric is being returned due to the use of the sum. Trying to use any other type, I kept running into errors like

error[E0277]: the trait bound `BigInt: FromSql<diesel::sql_types::Numeric, Pg>` is not satisfied
    --> nexus/src/db/datastore/region.rs:214:26
     |
214  |                         .get_result(conn)?;
     |                          ^^^^^^^^^^ the trait `FromSql<diesel::sql_types::Numeric, Pg>` is not implemented for `BigInt`
     |

and eventually just went with PgNumeric because I could get at the enum variants to actually turn it into ByteCount.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The usage of Numeric, or the usage of PgNumeric, to me implies "this is a variable precision number, not an integer"
  • The usage of smallint / integer / bigint implies "this is an integer, of a particular size"

"byte count" seems like it should be an integer, not a variable-precision number, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, AFAICT:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but if I follow the Foldable link to the source, I see

foldable_impls! {
    ...
    sql_types::BigInt => (sql_types::Numeric, sql_types::Numeric),
    ...
}

which I think means that the sum widens the type from BigInt to Numeric?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. Okay, don't hold up the PR on this point. I think we're in agreement that "it would be nice if bytes could be integer types", but if the implicit SQL conversions make that difficult, we can punt on it.

(But dang, would be nice if we could coerce them to stay as integer-based types!)

disk_source: params::DiskSource::Blank {
block_size: params::BlockSize::try_from(4096).unwrap(),
},
size: ByteCount::from_gibibytes_u32(10),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're relying on this being 10GiB in your tests, I recommend some way of checking that in the test itself. I'm seeing a lot of comments in test_disk_backed_by_multiple_region_sets that look like they expect the disk / zpool size to be exactly 10 gibibytes - maybe we should pull this into a pub const: usize DISK_SIZE_GiB = 10 ?

If someone else writing a disk-based test modifies this (seemingly arbitrary) value tomorrow:

  • Best (?) case: your test would still pass, as the size would be "within expected bounds"
  • Bad case: your test would suddenly fail
  • Really bad case: your test would pass, even if it wasn't supposed to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, good suggestion - implemented in c2e713a

Comment on lines 220 to 223
.filter(
dataset_dsl::kind
.eq(DatasetKind::Crucible),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually be filtering by this?

  • Suppose we're using a 10 GiB zpool
  • Suppose 5 GiB of that pool is used by crucible
  • Suppose 4 GiB of that pool is used by Cockroach + Clickhouse
  • Suppose someone tries requesting a 2 GiB region

According to this query, the allocation should succeed, using 11 GiB of the 10 GiB zpool (!).

I filed #1630 to track this more completely, but TL;DR I don't think we should be filtering by dataset kind here. I think we care about all datasets within the zpool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, yeah. I reordered the operations in the transaction to first insert the all regions, then check if each zpool's total_size was exceeded (by asking about every dataset, not just crucible). Done in 181c5d0

Comment on lines +227 to +231
.select(diesel::dsl::sum(
region_dsl::block_size
* region_dsl::blocks_per_extent
* region_dsl::extent_count,
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that my prior comment about "do this check for all dataset types, not just crucible" makes this more complicated. Could we perform this calculation using

/* An upper bound on the amount of space that might be in-use */
size_used INT
);
instead of individual region allocations within a dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kinda looks like you're updating this value below anyway - my main point here is that "we should be considering all the non-Crucible datasets too; they also take up space". size_used seems like the dataset-type-agnostic way of doing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 181c5d0

NexusRequest::new(
RequestBuilder::new(client, Method::POST, &disks_url)
.body(Some(&new_disk))
// TODO: this fails! the current allocation algorithm does not split
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we attach a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1644

@smklein smklein removed their assignment Aug 22, 2022
@smklein
Copy link
Collaborator

smklein commented Aug 22, 2022

This PR looks good, but I'd like to make the check validate size across all datasets using a zpool, not just crucible's usage. We can enforce that other datasets accurately report / enforce their sizes as a follow-up, if you want: #1630

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 24, 2022

build-and-test (helios) seems to have failed due to

ld.so.1: nexus_db_model-1ffe2bb0346a9854: fatal: libpq.so.5: open failed: No such file or directory

which is strange? I'm rerunning it now.

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 25, 2022

The libpq.so.5 failure reliably happens, and it's due to the crate shattering that broke off nexus-db-model. That didn't link against libpq, and was fixed in 86422ce.

@jmpesp
Copy link
Contributor Author

jmpesp commented Aug 26, 2022

This PR looks good, but I'd like to make the check validate size across all datasets using a zpool, not just crucible's usage.

This is also now done, so I'm merging now.

@jmpesp jmpesp merged commit 76a4201 into oxidecomputer:main Aug 26, 2022
@jmpesp jmpesp deleted the not_enough_available_space branch August 26, 2022 19:08
leftwo added a commit that referenced this pull request Mar 29, 2024
Propolis changes:
Add `IntrPin::import_state` and migrate LPC UART pin states (#669)
Attempt to set WCE for raw file backends
Fix clippy/lint nits for rust 1.77.0

Crucible changes:
Correctly (and robustly) count bytes (#1237)
test-replay.sh fix name of DTrace script (#1235)
BlockReq -> BlockOp (#1234)
Simplify `BlockReq` (#1218)
DTrace, cmon, cleanup, retry downstairs connections at 10 seconds.
(#1231)
Remove `MAX_ACTIVE_COUNT` flow control system (#1217)

Crucible changes that were in Omicron but not in Propolis before this commit.
Return *410 Gone* if volume is inactive (#1232)
Update Rust crate opentelemetry to 0.22.0 (#1224)
Update Rust crate base64 to 0.22.0 (#1222)
Update Rust crate async-recursion to 1.1.0 (#1221)
Minor cleanups to extent implementations (#1230)
Update Rust crate http to 0.2.12 (#1220)
Update Rust crate reedline to 0.30.0 (#1227)
Update Rust crate rayon to 1.9.0 (#1226)
Update Rust crate nix to 0.28 (#1223)
Update Rust crate async-trait to 0.1.78 (#1219)
Various buffer optimizations (#1211)
Add low-level test for message encoding (#1214)
Don't let df failures ruin the buildomat tests (#1213)
Activate the NBD server's psuedo file (#1209)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants