Fail if there isn't available space for a disk #1231

jmpesp · 2022-06-20T18:38:17Z

Nexus' current region allocation algorithm does not check to see if the
dataset usage goes over the zpool's total size. Add this check, plus
a test to validate it.

The algorithm currently does an all or nothing check: it tries to find
three datasets with available space, and if it cannot then (with this
commit) returns an error - in this commit, the test
test_disk_backed_by_multiple_regions fails. When the algorithm changes
to support chunking a disk across multiple dataset triples, that test
will pass. The reason that this was committed too was to make this
explicit to people searching the tests for usage examples.

Nexus' current region allocation algorithm does not check to see if the dataset usage goes over the zpool's total size. Add this check, plus a test to validate it. The algorithm currently does an all or nothing check: it tries to find three datasets with available space, and if it cannot then (with this commit) returns an error - in this commit, the test `test_disk_backed_by_multiple_regions` fails. When the algorithm changes to support chunking a disk across multiple dataset triples, that test will pass. The reason that this was committed too was to make this explicit to people searching the tests for usage examples.

nexus/tests/integration_tests/disks.rs

nexus/src/db/datastore.rs

smklein · 2022-06-28T20:01:41Z

nexus/src/db/datastore.rs

+                        if size_used > zpool_total_size {
+                            return Err(TxnError::CustomError(
+                                RegionAllocateError::NotEnoughAvailableSpace,
+                            ));
+                        }


This only compares the size of the requested datasets relative to the entire zpool, not to other datasets, right?

So if we have a zpool of size "10 GiB", any number of datasets could be allocated to it, as long as each comes in at "under 10 GiB used space"? Like, if we let callers allocate 1,000 datasets of 5 GiB each on that zpool, that's clearly a problem.

I definitely think this change is an improvement, but I'm concerned that it makes this check looks "done", when IMO this is a highly-related case that doesn't seem solved.

this is true. I feel like the dataset object should have a quota field, and the comparison should be against that instead. thoughts?

The current size_used value is a bit of a sham right now:

omicron/nexus/src/db/model/dataset.rs

Lines 52 to 55 in a399b09

let size_used = match kind {

DatasetKind::Crucible => Some(0),

_ => None,

};

I agree with you about the notion of a quota. Perhaps we do the following:

size_used is renamed to quota

We apply the quota when initializing the dataset

We make it non-optional

Without all those steps, it seems possible for one rogue dataset to consume all the space on disk, which kinda defeats the purpose of this accounting. WDYT?

(If we want to do this implementation as a follow-up, we can file an issue and iterate)

I went with summing up all datasets in a zpool, and comparing to the zpool's total size. See 7b5686f.

fix bug in regions_hard_delete: reset size_used for all datasets, not just the first one returned.

common/src/api/external/mod.rs

nexus/tests/integration_tests/disks.rs

smklein · 2022-07-05T15:28:26Z

nexus/src/db/datastore.rs

+        for (dataset_id, size) in datasets_id_and_size {
+            diesel::update(dataset_dsl::dataset)
+                .filter(dataset_dsl::id.eq(dataset_id))
+                .set(dataset_dsl::size_used.eq(dataset_dsl::size_used - size))
+                .execute_async(self.pool())
+                .await
+                .map_err(|e| {
+                    Error::internal_error(&format!(
+                        "error updating dataset space: {:?}",
+                        e
+                    ))
+                })?;
+        }


I don't think this sequence of operations is idempotent, which I think makes it problematic in the context of a saga.

If a saga fails partway through operation, it may be re-tried. This can happen in the "action" direction (during the "disk delete" saga) or in the "undo" direction (during the "disk create" saga). Admittedly, it looks like this was true before this PR too, but I think it became more obvious to me with the addition of a for loop here.

The easiest short-term fix here is probably to move these operations into a transaction - otherwise:

If we crash between "delete" and "update", the size accounting will be wrong

If we crash between updating some but not all of the datasets, the accounting will be wrong

Instead of adding and subtracting, size_used is now summed up after regions are inserted or deleted, meaning both region_allocate and regions_hard_delete should be idempotent.

smklein · 2022-07-05T15:40:02Z

nexus/src/db/datastore.rs

+                        if size_used > zpool_total_size {
+                            return Err(TxnError::CustomError(
+                                RegionAllocateError::NotEnoughAvailableSpace,
+                            ));
+                        }


The current size_used value is a bit of a sham right now:

omicron/nexus/src/db/model/dataset.rs

Lines 52 to 55 in a399b09

let size_used = match kind {

DatasetKind::Crucible => Some(0),

_ => None,

};

I agree with you about the notion of a quota. Perhaps we do the following:

size_used is renamed to quota

We apply the quota when initializing the dataset

We make it non-optional

Without all those steps, it seems possible for one rogue dataset to consume all the space on disk, which kinda defeats the purpose of this accounting. WDYT?

(If we want to do this implementation as a follow-up, we can file an issue and iterate)

Remove `impl Default` for IdentityMetadata. Remove checking for saga to be done in test.

Instead of adding and subtracting dataset's size_used field, compute it each time by summing up all the sizes of regions that belong to a dataset. Also sum up all regions belonging to a zpool to validate that the zpool is not running out of space. This involved changing the disk related tests to create three zpools, each with one dataset each, instead of one zpool with three datasets. Code that was iterating over each dataset now has to iterate over zpools, then datasets in the zpool. Updated tests that are now correctly checking for out-of-space.

jmpesp · 2022-08-18T18:40:07Z

This should be good for a re-review now.

jmpesp · 2022-08-18T18:51:56Z

One potential fallout of this PR going in is that the virtual hardware script will need to make larger file-based vdevs, because the 10 GiB zpools won't be able to have more than that size allocated to them.

We may want a way to use actual disks too but that would require a machine to have three additional disks.

smklein · 2022-08-19T15:57:29Z

nexus/db-model/src/bytecount.rs

+impl TryFrom<diesel::pg::data_types::PgNumeric> for ByteCount {
+    type Error = anyhow::Error;


Is there a reason to do the conversion to/from the PgNumeric type specifically? Why not the https://docs.diesel.rs/master/diesel/sql_types/struct.Numeric.html type? (coping with individual digits at a time seems like a pain in the butt)

It is a pain, but I don't see a way to chain Into or From here to go from either PgNumeric or Numeric into i64.

Taking a step back - I don't just mean this function. I mean, why are we using PgNumeric in the db? There are other DB types which could be used instead here, like https://docs.diesel.rs/master/diesel/sql_types/struct.BigInt.html , if you're just going to / from i64.

We're not using PgNumeric in the db - as far as I understand, diesel::sql_types::Numeric is being returned due to the use of the sum. Trying to use any other type, I kept running into errors like

error[E0277]: the trait bound `BigInt: FromSql<diesel::sql_types::Numeric, Pg>` is not satisfied --> nexus/src/db/datastore/region.rs:214:26 | 214 | .get_result(conn)?; | ^^^^^^^^^^ the trait `FromSql<diesel::sql_types::Numeric, Pg>` is not implemented for `BigInt` |

and eventually just went with PgNumeric because I could get at the enum variants to actually turn it into ByteCount.

The usage of Numeric, or the usage of PgNumeric, to me implies "this is a variable precision number, not an integer"

The usage of smallint / integer / bigint implies "this is an integer, of a particular size"

"byte count" seems like it should be an integer, not a variable-precision number, right?

Also, AFAICT:

sum acts on types which are Foldable

Foldable is implemented for the integer types: https://docs.diesel.rs/master/diesel/sql_types/trait.Foldable.html#impl-Foldable-for-BigInt

Right but if I follow the Foldable link to the source, I see

foldable_impls! { ... sql_types::BigInt => (sql_types::Numeric, sql_types::Numeric), ... }

which I think means that the sum widens the type from BigInt to Numeric?

got it. Okay, don't hold up the PR on this point. I think we're in agreement that "it would be nice if bytes could be integer types", but if the implicit SQL conversions make that difficult, we can punt on it.

(But dang, would be nice if we could coerce them to stay as integer-based types!)

smklein · 2022-08-22T14:02:56Z

nexus/tests/integration_tests/endpoints.rs

+            disk_source: params::DiskSource::Blank {
+                block_size: params::BlockSize::try_from(4096).unwrap(),
+            },
+            size: ByteCount::from_gibibytes_u32(10),


If you're relying on this being 10GiB in your tests, I recommend some way of checking that in the test itself. I'm seeing a lot of comments in test_disk_backed_by_multiple_region_sets that look like they expect the disk / zpool size to be exactly 10 gibibytes - maybe we should pull this into a pub const: usize DISK_SIZE_GiB = 10 ?

If someone else writing a disk-based test modifies this (seemingly arbitrary) value tomorrow:

Best (?) case: your test would still pass, as the size would be "within expected bounds"

Bad case: your test would suddenly fail

Really bad case: your test would pass, even if it wasn't supposed to

Got it, good suggestion - implemented in c2e713a

smklein · 2022-08-22T14:16:18Z

nexus/src/db/datastore/region.rs

+                                    .filter(
+                                        dataset_dsl::kind
+                                            .eq(DatasetKind::Crucible),
+                                    )


Should we actually be filtering by this?

Suppose we're using a 10 GiB zpool

Suppose 5 GiB of that pool is used by crucible

Suppose 4 GiB of that pool is used by Cockroach + Clickhouse

Suppose someone tries requesting a 2 GiB region

According to this query, the allocation should succeed, using 11 GiB of the 10 GiB zpool (!).

I filed #1630 to track this more completely, but TL;DR I don't think we should be filtering by dataset kind here. I think we care about all datasets within the zpool.

Agreed, yeah. I reordered the operations in the transaction to first insert the all regions, then check if each zpool's total_size was exceeded (by asking about every dataset, not just crucible). Done in 181c5d0

smklein · 2022-08-22T14:17:57Z

nexus/src/db/datastore/region.rs

+                        .select(diesel::dsl::sum(
+                            region_dsl::block_size
+                                * region_dsl::blocks_per_extent
+                                * region_dsl::extent_count,
+                        ))


I can see that my prior comment about "do this check for all dataset types, not just crucible" makes this more complicated. Could we perform this calculation using

omicron/common/src/sql/dbinit.sql

Lines 172 to 174 in 48285ce

/* An upper bound on the amount of space that might be in-use */

size_used INT

);

instead of individual region allocations within a dataset?

It kinda looks like you're updating this value below anyway - my main point here is that "we should be considering all the non-Crucible datasets too; they also take up space". size_used seems like the dataset-type-agnostic way of doing that.

Done in 181c5d0

smklein · 2022-08-22T14:26:30Z

nexus/tests/integration_tests/disks.rs

+    NexusRequest::new(
+        RequestBuilder::new(client, Method::POST, &disks_url)
+            .body(Some(&new_disk))
+            // TODO: this fails! the current allocation algorithm does not split


Nit: can we attach a bug?

Opened #1644

smklein · 2022-08-22T14:32:15Z

This PR looks good, but I'd like to make the check validate size across all datasets using a zpool, not just crucible's usage. We can enforce that other datasets accurately report / enforce their sizes as a follow-up, if you want: #1630

…e_space

…ake sure tests are valid

jmpesp · 2022-08-24T17:02:02Z

build-and-test (helios) seems to have failed due to

ld.so.1: nexus_db_model-1ffe2bb0346a9854: fatal: libpq.so.5: open failed: No such file or directory

which is strange? I'm rerunning it now.

jmpesp · 2022-08-25T20:15:04Z

The libpq.so.5 failure reliably happens, and it's due to the crate shattering that broke off nexus-db-model. That didn't link against libpq, and was fixed in 86422ce.

jmpesp · 2022-08-26T19:02:38Z

This PR looks good, but I'd like to make the check validate size across all datasets using a zpool, not just crucible's usage.

This is also now done, so I'm merging now.

Propolis changes: Add `IntrPin::import_state` and migrate LPC UART pin states (#669) Attempt to set WCE for raw file backends Fix clippy/lint nits for rust 1.77.0 Crucible changes: Correctly (and robustly) count bytes (#1237) test-replay.sh fix name of DTrace script (#1235) BlockReq -> BlockOp (#1234) Simplify `BlockReq` (#1218) DTrace, cmon, cleanup, retry downstairs connections at 10 seconds. (#1231) Remove `MAX_ACTIVE_COUNT` flow control system (#1217) Crucible changes that were in Omicron but not in Propolis before this commit. Return *410 Gone* if volume is inactive (#1232) Update Rust crate opentelemetry to 0.22.0 (#1224) Update Rust crate base64 to 0.22.0 (#1222) Update Rust crate async-recursion to 1.1.0 (#1221) Minor cleanups to extent implementations (#1230) Update Rust crate http to 0.2.12 (#1220) Update Rust crate reedline to 0.30.0 (#1227) Update Rust crate rayon to 1.9.0 (#1226) Update Rust crate nix to 0.28 (#1223) Update Rust crate async-trait to 0.1.78 (#1219) Various buffer optimizations (#1211) Add low-level test for message encoding (#1214) Don't let df failures ruin the buildomat tests (#1213) Activate the NBD server's psuedo file (#1209) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

jmpesp requested a review from smklein June 28, 2022 19:29

smklein self-assigned this Jun 28, 2022

smklein reviewed Jun 28, 2022

View reviewed changes

jmpesp added 2 commits July 1, 2022 13:24

rename method to test_disk_backed_by_multiple_region_sets

1b4a03b

add test_disk_size_accounting

ce62445

fix bug in regions_hard_delete: reset size_used for all datasets, not just the first one returned.

smklein reviewed Jul 5, 2022

View reviewed changes

smklein assigned jmpesp and unassigned smklein Jul 5, 2022

jmpesp added 6 commits August 16, 2022 11:56

Merge remote-tracking branch 'upstream/main'

b333434

Remove `impl Default` for IdentityMetadata. Remove checking for saga to be done in test.

add TryFrom<diesel::pg::data_types::PgNumeric> for ByteCount

9940fc1

fmt

a601010

update diesel-dtrace

797fbde

add conversions for i64 + fmt

84d6d0a

fmt

da3cdcd

smklein reviewed Aug 19, 2022

View reviewed changes

smklein assigned smklein and unassigned jmpesp Aug 22, 2022

smklein mentioned this pull request Aug 22, 2022

Dataset upper bound should be non-optional, enforced #1630

Open

smklein approved these changes Aug 22, 2022

View reviewed changes

smklein removed their assignment Aug 22, 2022

jmpesp added 3 commits August 23, 2022 13:04

Merge remote-tracking branch 'upstream/main' into not_enough_availabl…

4f8cc65

…e_space

set DiskTest::DEFAULT_ZPOOL_SIZE_GIB, and assert test conditions to m…

c2e713a

…ake sure tests are valid

make sure all dataset's size_used does not exceed zpool total_size

181c5d0

nexus-db-model now requires pq-sys (and rpath fix)

86422ce

jmpesp merged commit 76a4201 into oxidecomputer:main Aug 26, 2022

jmpesp deleted the not_enough_available_space branch August 26, 2022 19:08

leftwo mentioned this pull request Mar 29, 2024

Update Propolis and Crucible #5352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail if there isn't available space for a disk #1231

Fail if there isn't available space for a disk #1231

jmpesp commented Jun 20, 2022

smklein Jun 28, 2022

jmpesp Jul 1, 2022

smklein Jul 5, 2022

jmpesp Aug 18, 2022

smklein Jul 5, 2022

jmpesp Aug 18, 2022

smklein Jul 5, 2022

jmpesp commented Aug 18, 2022

jmpesp commented Aug 18, 2022

smklein Aug 19, 2022

jmpesp Aug 23, 2022

smklein Aug 23, 2022

jmpesp Aug 25, 2022

smklein Aug 25, 2022

smklein Aug 25, 2022

jmpesp Aug 25, 2022

smklein Aug 25, 2022

smklein Aug 22, 2022

jmpesp Aug 24, 2022

smklein Aug 22, 2022

jmpesp Aug 24, 2022

smklein Aug 22, 2022

smklein Aug 22, 2022

jmpesp Aug 24, 2022

smklein Aug 22, 2022

jmpesp Aug 24, 2022

smklein commented Aug 22, 2022

jmpesp commented Aug 24, 2022

jmpesp commented Aug 25, 2022

jmpesp commented Aug 26, 2022

	let size_used = match kind {
	DatasetKind::Crucible => Some(0),
	_ => None,
	};

		impl TryFrom<diesel::pg::data_types::PgNumeric> for ByteCount {
		type Error = anyhow::Error;

	/* An upper bound on the amount of space that might be in-use */
	size_used INT
	);

Fail if there isn't available space for a disk #1231

Fail if there isn't available space for a disk #1231

Conversation

jmpesp commented Jun 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp commented Aug 18, 2022

jmpesp commented Aug 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein commented Aug 22, 2022

jmpesp commented Aug 24, 2022

jmpesp commented Aug 25, 2022

jmpesp commented Aug 26, 2022