[reconfigurator] Reject clickhouse configurations from old generations #7347

karencfv · 2025-01-15T09:03:39Z

Overview

This commit adds functionality to clickhouse-admin to keep track of the blueprint generation number. There is also a new validation check where if reconfigurator attempts to generate a configuration file from a previous generation, clickhouse-admin will not generate such configuration file, and exit with an error.

Additionally, there's been a small clean up of the clickhouse-admin code.

Manual testing

In a local omicron deployment first tell reconfigurator to deploy a clickhouse policy both with the default number of replicas and keepers.

root@oxz_switch:~# omdb nexus blueprints diff target d6a6c153-76aa-4933-98bd-1009d95f03d2
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::c]:12221
from: blueprint fb9d6881-3c8a-44e2-b9f3-b8222ebdae99
to:   blueprint d6a6c153-76aa-4933-98bd-1009d95f03d2

<...>

 CLICKHOUSE CLUSTER CONFIG:
+   generation:::::::::::::::::::::::::::::::::::::   2
+   max used server id:::::::::::::::::::::::::::::   3
+   max used keeper id:::::::::::::::::::::::::::::   5
+   cluster name:::::::::::::::::::::::::::::::::::   oximeter_cluster
+   cluster secret:::::::::::::::::::::::::::::::::   750a492f-1c3d-430c-8d18-c74596fd2ec8
+   highest seen keeper leader committed log index:   0

    clickhouse keepers at generation 2:
    ------------------------------------------------
    zone id                                keeper id
    ------------------------------------------------
+   13c665e9-d7bd-43a5-b780-47acf8326feb   1        
+   325e3ac5-6cc8-4aec-9ac0-ea8d9a60c40f   2        
+   37e41e42-3b0c-49a6-8403-99fe66e84897   3        
+   4e65bf56-c7d6-485d-9b7f-8513a55838f9   4        
+   8a5df7fa-8633-4bf9-a7fa-567d5e62ffbf   5        

    clickhouse servers at generation 2:
    ------------------------------------------------
    zone id                                server id
    ------------------------------------------------
+   45af8162-253a-494c-992e-137d2bd5f350   1        
+   676772e0-d0c4-425b-a0d1-f6df46e4d10c   2        
+   84d249d1-9c13-460a-9c7c-08a979471246   3

We can see keepers and servers are at generation 2.

Now we zlogin into a keeper zone to check we have recorded that information and that the node has joined the quorum.

root@oxz_clickhouse_keeper_37e41e42:~# curl http://[fd00:1122:3344:101::23]:8888/generation                   
2
root@oxz_clickhouse_keeper_37e41e42:~# head -n 1 /opt/oxide/clickhouse_keeper/keeper_config.xml 
<!-- generation:2 -->
root@oxz_clickhouse_keeper_37e41e42:~# curl http://[fd00:1122:3344:101::23]:8888/4lw-lgif  
{"first_log_idx":1,"first_log_term":1,"last_log_idx":7123,"last_log_term":1,"last_committed_log_idx":7123,"leader_committed_log_idx":7123,"target_committed_log_idx":7123,"last_snapshot_idx":0}

We zlogin into a replica zone and check we have recorded that information, and the database contains the expected oximeter table and fields.

root@oxz_clickhouse_server_676772e0:~# curl http://[fd00:1122:3344:101::28]:8888/generation 
2
root@oxz_clickhouse_server_676772e0:~# head -n 1 /opt/oxide/clickhouse_server/config.d/replica-server-config.xml 
<!-- generation:2 -->
root@oxz_clickhouse_server_676772e0:~# /opt/oxide/clickhouse_server/clickhouse client --host fd00:1122:3344:101::28
ClickHouse client version 23.8.7.1.
Connecting to fd00:1122:3344:101::28:9000 as user default.
Connected to ClickHouse server version 23.8.7 revision 54465.

oximeter_cluster_2 :) show tables in oximeter

SHOW TABLES FROM oximeter

Query id: 1baa160b-3332-4fa4-a91d-0032fd917a96

┌─name─────────────────────────────┐
│ fields_bool                      │
│ fields_bool_local                │
│ fields_i16                       │
│ fields_i16_local                 │
│ <...>                            │
│ version                          │
└──────────────────────────────────┘

81 rows in set. Elapsed: 0.009 sec.

No we want to force a new generation number, so we set a clickhouse policy with an additional server and keeper

root@oxz_switch:~# omdb nexus blueprints diff target a598ce1b-1413-47d6-bc8c-7b63b6d09158
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::c]:12221
from: blueprint d6a6c153-76aa-4933-98bd-1009d95f03d2
to:   blueprint a598ce1b-1413-47d6-bc8c-7b63b6d09158

<...>

 CLICKHOUSE CLUSTER CONFIG:
*   generation:::::::::::::::::::::::::::::::::::::   2 -> 3
*   max used server id:::::::::::::::::::::::::::::   3 -> 4
*   max used keeper id:::::::::::::::::::::::::::::   5 -> 6
    cluster name:::::::::::::::::::::::::::::::::::   oximeter_cluster (unchanged)
    cluster secret:::::::::::::::::::::::::::::::::   750a492f-1c3d-430c-8d18-c74596fd2ec8 (unchanged)
*   highest seen keeper leader committed log index:   0 -> 13409

    clickhouse keepers generation 2 -> 3:
    ------------------------------------------------
    zone id                                keeper id
    ------------------------------------------------
    13c665e9-d7bd-43a5-b780-47acf8326feb   1        
    325e3ac5-6cc8-4aec-9ac0-ea8d9a60c40f   2        
    37e41e42-3b0c-49a6-8403-99fe66e84897   3        
    4e65bf56-c7d6-485d-9b7f-8513a55838f9   4        
    8a5df7fa-8633-4bf9-a7fa-567d5e62ffbf   5        
+   ccb1b5cf-7ca8-4c78-b9bc-970d156e6109   6        

    clickhouse servers generation 2 -> 3:
    ------------------------------------------------
    zone id                                server id
    ------------------------------------------------
    45af8162-253a-494c-992e-137d2bd5f350   1        
+   497f4829-f3fe-4c94-86b2-dbd4e814cc90   4        
    676772e0-d0c4-425b-a0d1-f6df46e4d10c   2        
    84d249d1-9c13-460a-9c7c-08a979471246   3

We deploy it and do the same checks on the same zones we checked previously and in the new zones

Old keeper zone:

root@oxz_clickhouse_keeper_37e41e42:~# curl http://[fd00:1122:3344:101::23]:8888/generation
3
root@oxz_clickhouse_keeper_37e41e42:~# head -n 1 /opt/oxide/clickhouse_keeper/keeper_config.xml 
<!-- generation:3 -->
root@oxz_clickhouse_keeper_37e41e42:~# curl http://[fd00:1122:3344:101::23]:8888/4lw-lgif
{"first_log_idx":1,"first_log_term":1,"last_log_idx":25198,"last_log_term":1,"last_committed_log_idx":25198,"leader_committed_log_idx":25198,"target_committed_log_idx":25198,"last_snapshot_idx":0}

New keeper zone:

root@oxz_clickhouse_keeper_ccb1b5cf:~# curl http://[fd00:1122:3344:101::29]:8888/generation
3
root@oxz_clickhouse_keeper_ccb1b5cf:~# head -n 1 /opt/oxide/clickhouse_keeper/keeper_config.xml 
<!-- generation:3 -->
root@oxz_clickhouse_keeper_ccb1b5cf:~# curl http://[fd00:1122:3344:101::29]:8888/4lw-lgif   
{"first_log_idx":1,"first_log_term":1,"last_log_idx":35857,"last_log_term":1,"last_committed_log_idx":35853,"leader_committed_log_idx":35853,"target_committed_log_idx":35853,"last_snapshot_idx":0}

Old replica zone:

root@oxz_clickhouse_server_676772e0:~# curl http://[fd00:1122:3344:101::28]:8888/generation
3
root@oxz_clickhouse_server_676772e0:~# head -n 1 /opt/oxide/clickhouse_server/config.d/replica-server-config.xml 
<!-- generation:3 -->
root@oxz_clickhouse_server_676772e0:~# /opt/oxide/clickhouse_server/clickhouse client --host fd00:1122:3344:101::28
ClickHouse client version 23.8.7.1.
Connecting to fd00:1122:3344:101::28:9000 as user default.
Connected to ClickHouse server version 23.8.7 revision 54465.

oximeter_cluster_2 :) show tables in oximeter

SHOW TABLES FROM oximeter

Query id: d4500915-d5b5-452f-a404-35e1e172b8f8

┌─name─────────────────────────────┐
│ fields_bool                      │
│ fields_bool_local                │
│ fields_i16                       │
│ fields_i16_local                 │
│ <...>                            │
│ version                          │
└──────────────────────────────────┘

81 rows in set. Elapsed: 0.002 sec.

New replica zone:

root@oxz_clickhouse_server_497f4829:~# curl http://[fd00:1122:3344:101::2a]:8888/generation
3
root@oxz_clickhouse_server_497f4829:~# head -n 1 /opt/oxide/clickhouse_server/config.d/replica-server-config.xml 
<!-- generation:3 -->
root@oxz_clickhouse_server_497f4829:~# /opt/oxide/clickhouse_server/clickhouse client --host fd00:1122:3344:101::2a
ClickHouse client version 23.8.7.1.
Connecting to fd00:1122:3344:101::2a:9000 as user default.
Connected to ClickHouse server version 23.8.7 revision 54465.

oximeter_cluster_4 :) show tables in oximeter

SHOW TABLES FROM oximeter

Query id: 9e02b839-e938-44ef-8b2e-a61d0b8c25af

┌─name─────────────────────────────┐
│ fields_bool                      │
│ fields_bool_local                │
│ fields_i16                       │
│ fields_i16_local                 │
│ <...>                            │
│ version                          │
└──────────────────────────────────┘

81 rows in set. Elapsed: 0.014 sec.

To verify clickhouse-admin exits with an error if the incoming generation number is lower than the current one, I tested by runing clickhouse-admin against a local clickward deployment:

# clickhouse-admin-server

karcar@ixchel:~/src/omicron$ curl http://[::1]:8888/generation
34
karcar@ixchel:~/src/omicron$ curl --header "Content-Type: application/json" --request PUT "http://[::1]:8888/config" -d '
> {
>     "generation": 3,
>     "settings": {
>         "config_dir": "/tmp/ch-dir/",
>         "id": 1,
>         "datastore_path": "/tmp/ch-dir/",
>         "listen_addr": "::1",
>         "keepers": [{"ipv6": "::1"}],
>         "remote_servers": [{"ipv6": "::1"}]
>     }
> }'
{
  "request_id": "01809997-b9da-4e9c-837f-11413a6254b7",
  "error_code": "Internal",
  "message": "Internal Server Error"
}

# From the logs

{"msg":"request completed","v":0,"name":"clickhouse-admin-server","level":30,"time":"2025-01-21T01:08:24.946465Z","hostname":"ixchel","pid":58943,"uri":"/config","method":"PUT","req_id":"01809997-b9da-4e9c-837f-11413a6254b7","remote_addr":"[::1]:54628","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:851","error_message_external":"Internal Server Error","error_message_internal":"current generation is greater than incoming generation","latency_us":227,"response_code":"500"}

# clickhouse-admin-keeper

karcar@ixchel:~/src/omicron$ curl http://[::1]:8888/generation
23
karcar@ixchel:~/src/omicron$ curl --header "Content-Type: application/json" --request PUT "http://[::1]:8888/config" -d '
{
    "generation": 2,
    "settings": {
        "config_dir": "/tmp/ch-dir/",
        "id": 1,
        "datastore_path": "/tmp/ch-dir/",
        "listen_addr": "::1",
        "raft_servers": [
            {
                "id": 1,
                "host": {"ipv6": "::1"}
            }
        ]
    }
}'
{
  "request_id": "e6b66ca9-10fa-421b-ac46-0e470d8e5512",
  "error_code": "Internal",
  "message": "Internal Server Error"

# From the logs

{"msg":"request completed","v":0,"name":"clickhouse-admin-keeper","level":30,"time":"2025-01-21T02:28:12.925343Z","hostname":"ixchel","pid":59371,"uri":"/config","method":"PUT","req_id":"e6b66ca9-10fa-421b-ac46-0e470d8e5512","remote_addr":"[::1]:64494","local_addr":"[::1]:8888","component":"dropshot","file":"/Users/karcar/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:851","error_message_external":"Internal Server Error","error_message_internal":"current generation is greater than incoming generation","latency_us":180,"response_code":"500"}

Closes: #7137

karencfv · 2025-01-21T02:33:44Z

clickhouse-admin/src/clickhouse_cli.rs

+        log: &Logger,
    ) -> Self {
+        let log = log.new(slog::o!("component" => "ClickhouseCli"));


This is part of the refactoring, the logs were a bit of a mess.

karencfv · 2025-01-21T02:37:22Z

sled-agent/src/services.rs

+                let clickhouse_server_config =
+                    PropertyGroupBuilder::new("config")
+                        .add_property(
+                            "config_path",
+                            "astring",
+                            format!("{CLICKHOUSE_SERVER_CONFIG_DIR}/{CLICKHOUSE_SERVER_CONFIG_FILE}"),
+                        );


Also part of the refactoring. Let's use the constants we are using for the configuration files in the SMF service as well, so we don't have to hardcode things into an SMF method script.

karencfv · 2025-01-21T02:40:14Z

clickhouse-admin/src/context.rs

+    pub fn new(
+        log: &Logger,
+        binary_path: Utf8PathBuf,
+        listen_address: SocketAddrV6,
+    ) -> Result<Self> {
+        let clickhouse_cli =
+            ClickhouseCli::new(binary_path, listen_address, log);


Refactor as well, there was no need to pass clickhouse_cli as a parameter, but not clickward etc.

andrewjstone

Great stuff @karencfv!

andrewjstone · 2025-01-22T20:03:44Z

clickhouse-admin/src/context.rs

+        // If there is already a configuration file with a generation number we'll
+        // use that. Otherwise, we set the generation number to None.
+        let gen = read_generation_from_file(config_path)?;
+        let generation = Mutex::new(gen);


It's become practice at Oxide to avoid tokio mutexes wherever possible as they have significant problems when cancelled and generally just don't do what we want. I realize there's already some usage here with regards to initialization. We don't have to fix that in this PR, but we should avoid adding new uses. We should instead use a std::sync::mutex. I left a comment below about this as well.

See the following for more details:
https://rfd.shared.oxide.computer/rfd/0400#no_mutex
https://rfd.shared.oxide.computer/rfd/0397#_example_with_mutexes

lol I was definitely on the fence on that one, I went for consistency in the end be1afc7#diff-c816600501b7aaa7de4a2eb9dc86498662030cea6390fa23e11a22c990efb510L28-L29

Thanks for the links! Hadn't seen those RFDs, will read them both

andrewjstone · 2025-01-22T20:05:14Z

clickhouse-admin/src/context.rs

@@ -36,6 +60,10 @@ impl KeeperServerContext {
    pub fn log(&self) -> &Logger {
        &self.log
    }
+
+    pub async fn generation(&self) -> Option<Generation> {
+        *self.generation.lock().await


We only need read access here, and so we can easily avoid an async mutex here. Generation is also Copy, so this is cheap. I'd suggest making this a synchronous function and calling *self.generation.lock() instead.

Yeah, I was wrong here. I wasn't considering the usage of the generation with regards to concurrent requests.

andrewjstone · 2025-01-22T20:11:31Z

clickhouse-admin/src/context.rs

+    }
+
+    pub fn initialization_lock(&self) -> Arc<Mutex<()>> {
+        self.initialization_lock.clone()


I'm not sure if this usage of a tokio lock is safe or not due to cancellation. It looks like it aligns with the exact usage we have in our ServerContext. I also don't have an easy workaround for this right now, and so I guess I'm fine leaving this in to keep moving.

@sunshowers @jgallagher Do you have any ideas here?

Various thoughts; sorry if some of this is obvious, but I don't have much context here so am just hopping in:

Cloning an Arc<tokio::Mutex<_>> is fine (the clone is fully at the Arc layer)

... that said I don't think we need to clone here? Returning &Mutex<()> looks like it'd be okay.

Mutex<()> is kinda fishy and probably worthy of a comment, since typically the mutex is protecting some data. (Maybe there is one somewhere that I'm not seeing!)

It looks like the use of this is to prevent the /init_db endpoint from running concurrently? That is definitely not cancel safe. If dropshot were configured to cancel handlers on client disconnect, a client could start an /init_db, drop the request (unlocking the mutex), then start it again while the first one was still running.

On the last point: I think this is "fine" as long as dropshot is configured correctly (i.e., to not cancel handlers). If we wanted this to be correct even under cancellation, I'd probably move the init process into a separate tokio task and manage that either with channels or a sync mutex. Happy to expand on those ideas if it'd be helpful.

Thanks for the input!

Mutex<()> is kinda fishy and probably worthy of a comment, since typically the mutex is protecting some data. (Maybe there is one somewhere that I'm not seeing!)

Tbh, I'm just moving code around that was already here. I'm not really sure what the intention was initially.

On the last point: I think this is "fine" as long as dropshot is configured correctly (i.e., to not cancel handlers). If we wanted this to be correct even under cancellation, I'd probably move the init process into a separate tokio task and manage that either with channels or a sync mutex.

That sounds like a good idea regardless of what the initial intention was. Do you mind expanding a little on those ideas? It'd definitely be helpful

Sure thing! One pattern we've used in a bunch places is to spawn a long-lived tokio task and then communicate with it via channels. This looks something like (untested and lots of details omitted):

// kinds of things we can ask the task to do enum Request { DoSomeThing { // any inputs from us the task needs data: DataNeededToDoSomeThing, // a oneshot channel the task uses to send us the result of our request response: oneshot::Sender<ResultOfSomeThing>, }, } // the long-lived task: loop over incoming requests and handle them fn long_running_task(incoming: Receiver<Request>) { // run until the sending half of `incoming` is dropped while let Some(request) = incoming.recv().await { match request { Request::DoSomeThing { data, response } => { let result = do_some_thing(data); response.send(response); } } } } // our main code: one time up front, create the channel we use to talk to the inner task and spawn that task let (inner_tx, inner_rx) = mpsc::channel(N); // picking N here can be hard let join_handle = tokio::spawn(long_running_task(inner_rx)); // ... somewhere else, when we want the task to do something for us ... let (response_tx, response_rx) = oneshot::channel(); inner_tx.send(Request::DoSomeThing { data, response_tx }); let result = response_rx.await;

A real example of this pattern (albeit more complex; I'm not finding any super simple ones at the moment) is in the bootstrap agent: here's where we spawn the inner task. It has a couple different channels for incoming requests, so its run loop is a tokio::select over those channels but is otherwise pretty similar to the outline above.

This pattern is nice because regardless of how many concurrent callers try to send messages to the inner task, it itself can do things serially. In my pseudocode above, if the ... somewhere else bit is an HTTP handler, even if we get a dozen concurrent requests, the inner task will process them one at a time because it's forcing serialization via the channel it's receiving on.

I really like this pattern. But it has some problems:

Picking the channel depth is hard. Whatever N we pick, that means up to that many callers can be waiting in line. Sometimes we don't want that at all, but tokio's mpsc channels don't allow N=0. (There are other channel implementations that do if we decide we need this.)

If we just use inner_tx.send(_) as in my pseudocode, even if the channel is full, that will just block until there's room, so we actually have an infinite line. This can be avoided via try_send instead, which allows us to bubble out some kind of "we're too busy for more requests" backpressure to our caller.

If do_some_thing() is slow, this can all compound and make everybody slow.

If do_some_thing() hangs, then everybody trying to send requests to the inner task hangs too. (This recently happened to us in sled-agent!)

A "build your own" variant of the above in the case where you want at most one instance of some operation is to use a sync::Mutex around a tokio task join handle. This would look something like (again untested, details omitted):

// one time up front, create a sync mutex around an optional tokio task join handle let task_lock = sync::Mutex::new(None); // ... somewhere else, where we want to do work ... // acquire the lock let mut task_lock = task_lock.lock().unwrap(); // if there's a previous task running, is it still running? let still_running = match task_lock.as_ref() { Some(joinhandle) => !joinhandle.is_finished(), None => false, }; if still_running { // return a "we're busy" error } // any previous task is done; start a new one *task_lock = Some(tokio::spawn(do_some_work()));

This has its own problems; the biggest one is that we can't wait for the result of do_some_work() while holding the lock, so this really only works for background stuff that either doesn't need to return results at all, or the caller is in a position to poll us for completion at some point in the future. (In the joinhandle.is_finished() case, we can .await it to get the result of do_some_work().)

We don't use this pattern as much. One example is in installinator, where we do want to get the result of previously-completed tasks.

Thanks for the write up John. I think, overall, it's probably simpler to have a long running task and issue requests that way. However, as you mentioned this has its own problems. However, we know what those problems are and we use this pattern all over sled agent.

In this case we can constraint the problem such that we only want to handle one in flight request at a time, since reconfigurator execution will retry again later anyway. I'd suggest using a flume bounded channel with a size of 0 to act as a rendezvous channel. That should give the behavior we want. We could have separate tasks for performing initialization and config writing so we don't have one block out the other.

excellent! Thanks a bunch for the write up!

We could have separate tasks for performing initialization and config writing so we don't have one block out the other.

@andrewjstone , do we really not want them to block out each other? It'd be problematic to have the db init job trying to run when the generate config one hasn't finished and vice versa no?

andrewjstone · 2025-01-22T20:16:55Z

clickhouse-admin/src/http_entrypoints.rs

+        // file generation.
+        if let Some(current) = current_generation {
+            if current > incoming_generation {
+                return Err(HttpError::for_internal_error(


This doesn't feel like an internal error to me. This is an expected race condition, and so I think we should return a 400 level error instead of a 500 level error. I think 412 is an appropriate error code, even though we are not using etags for a precondition. @davepacheco does that make sense to you?

Definitely agreed it's not a 500. It looks like Sled Agent uses 409 (Conflict) for this and I'd suggest using that for consistency.

omicron/sled-agent/src/services.rs

Lines 3418 to 3424 in 1f0c185

// Absolutely refuse to downgrade the configuration.

if ledger_zone_config.omicron_generation > request.generation {

return Err(Error::RequestedConfigOutdated {

requested: request.generation,

current: ledger_zone_config.omicron_generation,

});

}

omicron/sled-agent/src/services.rs

Lines 308 to 310 in 1f0c185

Error::RequestedConfigOutdated { .. } => {

omicron_common::api::external::Error::conflict(&err.to_string())

}

andrewjstone · 2025-01-22T20:18:19Z

clickhouse-admin/src/http_entrypoints.rs

+        // file generation.
+        if let Some(current) = current_generation {
+            if current > incoming_generation {
+                return Err(HttpError::for_internal_error(


Same thing as above. I think this should be a 400-level error.

jgallagher · 2025-01-22T20:39:39Z

clickhouse-admin/src/http_entrypoints.rs

+
+        // We want to update the generation number only if the config file has been
+        // generated successfully.
+        *ctx.generation.lock().await = Some(incoming_generation);


Is there a TOCTOU problem here, in that ctx.generation could have changed between when we checked it above and when we reacquire the lock here to set it?

Hm, I guess that depends on how reconfigurator works? How often is the generation changing?

I decided to update the generation number once the config file had been successfully generated, because if it hadn't, then the zone wouldn't be fully in that generation. Do you think it makes more sense to update the generation immediately?

I think I'd consider this outside the context of reconfigurator. If this endpoint is called multiple times concurrently with different incoming generations, does it behave correctly? That way we don't have an implicit dependency between the correctness of this endpoint and the behavior or timing of reconfigurator.

Sorry for the dumb questions, but - is it safe for two instances of generate_server_config() to be running concurrently? I think that has implications on what we need to do with the lock on generation.

I think I'd consider this outside the context of reconfigurator. If this endpoint is called multiple times concurrently with different incoming generations, does it behave correctly?

I guess there could be an error if two generate_server_config()s with different generation numbers are running
and they both read an initial value for generation, but one with the lower number manages to write after the one with the higher one.

Thanks for the input! I guess that settles it, I'll update the number immediately after reading. I was on the fence about this one anyway. Even if the config is borked, it'll be borked in that generation

Hm, I'm not sure that's enough. We may need to write the config file while holding the lock too, I think?

Imagine we're currently on gen 1 and we get two concurrent requests, one that gives us gen 2 and one that gives us gen 3. If our code is something like:

{ let gen = acquire_generation_lock().await; if *gen > incoming_generation { return an error; } *gen = incoming_generation; } // release `gen` lock write_new_config_file();

then one possible ordering is:

The request for gen 2 acquires the lock. We're currently on gen 1, so this is fine. We update to gen=2 and release the lock. Then we get parked for some reason.

The request for gen 3 acquires the lock. We're currently on gen 2, so this is fine. We update to gen=3 and release the lock. We write our config file.

The gen 2 request gets unparked. It writes its config file.

Then at this point we think we're on gen=3 but the config file on disk is the one from gen=2.

Thanks for the detailed answer!

Hm, I'm not sure that's enough. We may need to write the config file while holding the lock too, I think?

Yep, that makes total sense

Yeah, you are right @jgallagher. These requests all need to be serialized. (I know you are currently writing up some options, just wanted to drop a note).

karencfv · 2025-01-24T05:40:09Z

Thanks for the reviews everyone! I'm not finished here, but leaving it for today.
I've updated a couple of endpoints just to try out the new pattern and it seems to be working fine. I just need to move the init_db() functionality to the task and do a bit of clean up, but generally this is the direction I'm taking.

karencfv

I think I've addressed all of the comments, let me know if there's something I'm missing!

I've run all the manual tests I did before and received the same results as before.

karencfv · 2025-01-28T06:01:45Z

clickhouse-admin/src/context.rs

+    pub fn generate_config_and_enable_svc(
+        &self,
+        replica_settings: ServerConfigurableSettings,
+    ) -> Result<ReplicaConfig, HttpError> {
+        let mut current_generation = self.generation.lock().unwrap();
+        let incoming_generation = replica_settings.generation();
+
+        // If the incoming generation number is lower, then we have a problem.
+        // We should return an error instead of silently skipping the configuration
+        // file generation.
+        if let Some(current) = *current_generation {
+            if current > incoming_generation {
+                return Err(HttpError::for_client_error(
+                    Some(String::from("Conflict")),
+                    StatusCode::CONFLICT,
+                    format!(
+                        "current generation '{}' is greater than incoming generation '{}'",
+                        current,
+                        incoming_generation,
+                    )
+                ));
+            }
+        };
+
+        let output =
+            self.clickward().generate_server_config(replica_settings)?;
+
+        // We want to update the generation number only if the config file has been
+        // generated successfully.
+        *current_generation = Some(incoming_generation);
+
+        // Once we have generated the client we can safely enable the clickhouse_server service
+        let fmri = "svc:/oxide/clickhouse_server:default".to_string();
+        Svcadm::enable_service(fmri)?;
+
+        Ok(output)
+    }
+
+    pub async fn init_db(&self) -> Result<(), HttpError> {
+        let log = self.log();
+        // Initialize the database only if it was not previously initialized.
+        // TODO: Migrate schema to newer version without wiping data.
+        let client = self.oximeter_client();
+        let version = client.read_latest_version().await.map_err(|e| {
+            HttpError::for_internal_error(format!(
+                "can't read ClickHouse version: {e}",
+            ))
+        })?;
+        if version == 0 {
+            info!(
+                log,
+                "initializing replicated ClickHouse cluster to version {OXIMETER_VERSION}"
+            );
+            let replicated = true;
+            self.oximeter_client()
+                .initialize_db_with_version(replicated, OXIMETER_VERSION)
+                .await
+                .map_err(|e| {
+                    HttpError::for_internal_error(format!(
+                        "can't initialize replicated ClickHouse cluster \
+                         to version {OXIMETER_VERSION}: {e}",
+                    ))
+                })?;
+        } else {
+            info!(
+                log,
+                "skipping initialization of replicated ClickHouse cluster at version {version}"
+            );
        }
+
+        Ok(())


This is a mechanical change, moving most of the functionality from context.rs to here so we can these from long_running_ch_server_task

karencfv

Thanks for taking the time to do a live review of this PR @jgallagher @andrewjstone 🙇‍♀️

I think I've addressed all of the changes we discussed. Please let me know if I missed anything!

karencfv · 2025-01-30T08:35:02Z

clickhouse-admin/src/context.rs

+            } => {
+                let result =
+                    init_db(clickhouse_address, log.clone(), replicated).await;
+                if let Err(e) = response.send(result) {


@jgallagher Didn't change this to try_send because it's a oneshot channel so it doesn't have that method

Right, that makes sense: oneshot channels are single-use only, so it's not possible for them block due to the channel being full.

karencfv · 2025-01-30T08:44:14Z

Ran all the previous manual tests and additionally grabbed a bit of the logs to show they're happily moving along:

08:03:53.840Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:1023
    local_addr = [fd00:1122:3344:101::26]:8888
    remote_addr = [fd00:1122:3344:101::c]:33133
08:03:53.900Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:863
    latency_us = 32935
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::c]:33133
    req_id = d11525f4-c81c-492c-ae42-2f29d698e178
    response_code = 201
    uri = /config
08:03:53.904Z INFO clickhouse-admin-server (ServerContext): skipping initialization of oximeter database at version 13
    file = clickhouse-admin/src/context.rs:296
08:03:53.904Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:863
    latency_us = 3324
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::c]:33133
    req_id = 66a91f3e-af45-43ef-a570-175c04e2436e
    response_code = 204
    uri = /init
08:04:37.917Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:1023
    local_addr = [fd00:1122:3344:101::26]:8888
    remote_addr = [fd00:1122:3344:101::a]:61669
08:04:37.982Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:863
    latency_us = 30612
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::a]:61669
    req_id = a07d07b6-0c66-4df8-af55-ff74202d9822
    response_code = 201
    uri = /config
08:04:37.989Z INFO clickhouse-admin-server (ServerContext): skipping initialization of oximeter database at version 13
    file = clickhouse-admin/src/context.rs:296
08:04:37.989Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.13.0/src/server.rs:863
    latency_us = 3178
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::a]:61669
    req_id = 99cbf23a-7995-4847-87d8-b7939d214ec1
    response_code = 204
    uri = /init

jgallagher

Thanks for all the back and forth on this! I think there are a few structural things to address on the async / inner task / channel side of things.

jgallagher · 2025-01-31T15:57:21Z

clickhouse-admin/src/http_entrypoints.rs

+        // If the incoming generation number is lower, then we have a problem.
+        // We should return an error instead of silently skipping the configuration
+        // file generation.
+        if let Some(current) = current_generation {


I think this method has lost its concurrency protection - it has neither a mutex nor a task that enforces serialization. I think KeeperServerContext needs to spawn a long_running_generate_config_task just like ServerContext, and then we need to communicate with that task in this endpoint.

whoops forgot that one 😅

jgallagher · 2025-01-31T16:04:21Z

clickhouse-admin/src/context.rs

+        // If there is already a configuration file with a generation number we'll
+        // use that. Otherwise, we set the generation number to None.
+        let gen = read_generation_from_file(config_path)?;
+        let (generation_tx, _rx) = watch::channel(gen);


I think this is probably correct, but it isn't really making use of the watch channel (it's just treating it as a fancy mutex). I think what should happen here is:

When the watch channel is created, the tx side is given to long_running_generate_config_task.

ServerContext should only hold the rx side.

The watch channel should not be involved in the message to generate a config at all; long_running_generate_config_task already has the sending side, so it can update it as needed. This means the generate_config endpoint no longer needs to know anything about the generation channel at all.

The generation endpoint can .borrow() the receiving end of the channel held by ServerContext to read the latest value set by long_running_generate_config_task.

jgallagher · 2025-01-31T16:06:03Z

clickhouse-admin/src/context.rs

+            } => {
+                let result =
+                    init_db(clickhouse_address, log.clone(), replicated).await;
+                if let Err(e) = response.send(result) {


Right, that makes sense: oneshot channels are single-use only, so it's not possible for them block due to the channel being full.

jgallagher · 2025-01-31T16:14:58Z

clickhouse-admin/src/http_entrypoints.rs

-        Svcadm::enable_service(fmri)?;
+        let (response_tx, response_rx) = oneshot::channel();
+        ctx.generate_config_tx
+            .send_async(GenerateConfigRequest::GenerateConfig {


A couple comments here, one substantive and one stylistic:

I think based on our conversation, all the .send_asyncs in these endpoints should be .try_send, right? So they don't block if the channel is full and instead return some kind of HTTP busy error?

The fact that we're using an inner task to serialize requests can be an implementation detail of ServerContext; the http endpoints shouldn't need to know that there's a channel underneath IMO. Could we move these into methods on the context types? Untested and probably has typos, but something like:

// inside impl ServerContext pub async fn generate_config(&self, replica_settings: ReplicaSettings) -> Result<ReplicaConfig, SomeErrorType> { let (response_tx, response_rx) = oneshot::channel(); self.generate_config_tx.try_send(GenerateConfigRequest::GenerateConfig { clickward: self.clickward, log: self.log.clone(), replica_settings, response: response_tx, }).map_err(/* error handling */)?; response_rx.await.map_err(/* error handling */) }

Then this endpoint can probably be reduced to something like

let ctx = rqctx.context(); let replica_settings = body.into_inner(); let result = ctx.generate_config(replica_settings).map_err(/* error handling */)?; Ok(HttpResponseCreated(result))

and not need to use all the exposed details from how the server contexts are implemented.

karencfv · 2025-02-03T09:35:24Z

Thanks for taking the time to review, and leave such detailed comments @jgallagher ! I think I've addressed all of them. Please let me know if there's something missing :)

Ran all my manual tests again, and received same results

jgallagher

Thanks, this is looking great! I left a bunch of nitpicky comments, mostly around error handling. I'll defer to @andrewjstone on all the clickhouse bits, but the async / concurrency stuff looks like it's in good shape. 👍

jgallagher · 2025-02-03T15:33:03Z

clickhouse-admin/src/http_entrypoints.rs

+        rqctx: RequestContext<Self::Context>,
+    ) -> Result<HttpResponseOk<Generation>, HttpError> {
+        let ctx = rqctx.context();
+        let gen = match *ctx.generation_rx.borrow() {


This might be a little cleaner if ctx exposed a generation(&self) -> Option<Generation> method? That way (a) generation_rx could be private and (b) HTTP handlers wouldn't need to know the details that the context is managing generations via a watch channel.

I think this is also better hygiene for watch channels. The docs on borrow() note:

Outstanding borrows hold a read lock on the inner value. This means that long-lived borrows could cause the producer half to block. It is recommended to keep the borrow as short-lived as possible.

If context exposes the channel directly, then every user of it is responsible for keeping their borrows short; if instead it only provides a helper method for reading the current value, that helper method guarantees all borrows are short. (This is easy to do in this case because Generation is Copy; it's harder to do with watch channels over types that aren't cheap to clone.)

jgallagher · 2025-02-03T15:35:24Z

clickhouse-admin/src/http_entrypoints.rs

-        Svcadm::enable_service(fmri)?;
+        let replica_settings = body.into_inner();
+        let result =
+            ctx.send_generate_config_and_enable_svc(replica_settings).await?;


Naming nit - I think I'd remove the send_ prefix on all of the generate config / init db methods. It makes sense when looking at the implementation of the method (it's send-ing messages on channels), but I think it's kinda confusing at the callsite here: where are we sending the config generation?

jgallagher · 2025-02-03T15:51:21Z

clickhouse-admin/src/context.rs

+    pub generate_config_tx: Sender<GenerateConfigRequest>,
+    pub generation_rx: watch::Receiver<Option<Generation>>,


Can we drop the pub from these? (In conjunction with the earlier comment about a helper method for reading the current generation)

jgallagher · 2025-02-03T15:53:17Z

clickhouse-admin/src/context.rs

        let clickward = Clickward::new();
-        Self { clickward, clickhouse_cli, log }
+        let config_path = Utf8PathBuf::from_str(CLICKHOUSE_KEEPER_CONFIG_DIR)


Nit - I think all the uses of Utf8PathBuf::from_str(..) could instead be UtfPathBuf::from(..) and then not need to be .unwrap()'d.

jgallagher · 2025-02-03T16:01:02Z

clickhouse-admin/src/context.rs

+            })
+            .map_err(|e| {
+                HttpError::for_internal_error(format!(
+                    "failure to send request: {e}"


I'm not sure including the {e} here will be useful. Maybe instead we should match on it and return different kinds of HTTP errors for the two cases? Something like

.map_err(|e| match e { TrySendError::Full(_) => HttpError::for_unavail( None, "channel full: another config request is still running" .to_string(), ), TrySendError::Disconnected(_) => { HttpError::for_internal_error( "long-running generate-config task died".to_string(), ) } })?;

jgallagher · 2025-02-03T16:10:12Z

clickhouse-admin/src/context.rs

+    generation_tx.send(Some(incoming_generation)).map_err(|e| {
+        HttpError::for_internal_error(format!("failure to send request: {e}"))
+    })?;


This is fine but kind of awkward: send can only fail if there are no subscribers, which in our case would mean the context object is gone, which presumably means there isn't anyone around to receive the HTTP error we're creating.

watch::Sender has a few methods that let you update the value without failing even if there are no receivers (send_replace, send_if_modified, send_modify). Since Generation is basically just an integer, send_replace might be the cleanest here?

Suggested change

generation_tx.send(Some(incoming_generation)).map_err(|e| {

HttpError::for_internal_error(format!("failure to send request: {e}"))

})?;

generation_tx.send_replace(Some(incoming_generation));

jgallagher · 2025-02-03T16:12:56Z

clickhouse-admin/src/context.rs

+    // TODO: Migrate schema to newer version without wiping data.
+    let version = client.read_latest_version().await.map_err(|e| {
+        HttpError::for_internal_error(format!(
+            "can't read ClickHouse version: {e}",


Including the error here is good, but as written {e} will only include the top-most error and not the full error chain. Can we add a dependency on slog-error-chain and use it to get the full chain of errors?

Suggested change

"can't read ClickHouse version: {e}",

"can't read ClickHouse version: {}", InlineErrorChain::new(e),

jgallagher · 2025-02-03T16:13:15Z

clickhouse-admin/src/context.rs

+            .map_err(|e| {
+            HttpError::for_internal_error(format!(
+                "can't initialize oximeter database \
+                     to version {OXIMETER_VERSION}: {e}",


(Same note as above - suggest InlineErrorChain::new(&e) here)

jgallagher · 2025-02-03T16:14:10Z

clickhouse-admin/src/context.rs

+        return Ok(None);
+    }
+
+    let file = File::open(&path)?;


Can we add context to this error? Something like

Suggested change

let file = File::open(&path)?;

let file = File::open(&path).with_context(|| format!("failed to open {path}"))?;

jgallagher · 2025-02-03T16:14:50Z

clickhouse-admin/src/context.rs

+
+    let line_parts: Vec<&str> = first_line.rsplit(':').collect();
+    if line_parts.len() != 2 {
+        bail!("first line of configuration file is malformed: {}", first_line);


Can we add path to this error (and the other bails / anyhows below)?

karencfv added 12 commits January 15, 2025 20:56

generate replica files with generation number

8138c4c

generation number for keepers

50ab66c

parse gen number from file

2b3f10f

add some tests

dd0a9b1

Extract server context for single node

96ac8cf

check incoming generation number is larger

948fc93

return init context error

2937b66

server context refactor

30155e0

small refactor in SMF services structure

608b911

save generation to cache

0cd9654

use tokio mutex

be1afc7

clean up

6a500e6

karencfv commented Jan 21, 2025

View reviewed changes

karencfv marked this pull request as ready for review January 21, 2025 02:44

karencfv requested a review from andrewjstone January 21, 2025 02:44

andrewjstone reviewed Jan 22, 2025

View reviewed changes

jgallagher reviewed Jan 22, 2025

View reviewed changes

karencfv added 7 commits January 24, 2025 12:06

remove tokio mutex

779a549

change error messages

eb69ee5

hold the lock

df4ba8d

poc for long running task

c36522e

Use tokio task for generation

e961a60

fmt

d279711

use task for generating config

ef04ac7

karencfv added 4 commits January 27, 2025 19:48

add init_db to task

7d335b6

clean up

b6e63fa

same functionality for single node

d067a71

remove handles

861875d

karencfv added 2 commits January 28, 2025 18:17

use flume channel

a75ebb9

clean up

def40c2

karencfv commented Jan 28, 2025

View reviewed changes

karencfv requested review from jgallagher and andrewjstone January 28, 2025 06:08

karencfv added 10 commits January 29, 2025 15:53

extract methods into functions

810e88b

slim down the implementation

c59d90f

add logging

cb21b84

Begin expanding funciton parameters instead of taking a function

cbff5cd

initialise oximteter client differently

33ab408

watcher for generation

b8954ab

same for keeper

cde78c3

Clean up

089a2a4

Separate tasks

56b1fcc

fmt

169a923

karencfv commented Jan 30, 2025

View reviewed changes

jgallagher reviewed Jan 31, 2025

View reviewed changes

karencfv added 6 commits February 3, 2025 16:27

use try_send

ca6b4c3

clean up

f30ed7d

restructure watch channel

cc0f280

unify long running task

f5f78d4

implement keeper

ec8c12b

clean up

5057b8e

karencfv requested a review from jgallagher February 3, 2025 09:35

jgallagher reviewed Feb 3, 2025

View reviewed changes

	// Absolutely refuse to downgrade the configuration.
	if ledger_zone_config.omicron_generation > request.generation {
	return Err(Error::RequestedConfigOutdated {
	requested: request.generation,
	current: ledger_zone_config.omicron_generation,
	});
	}

	Error::RequestedConfigOutdated { .. } => {
	omicron_common::api::external::Error::conflict(&err.to_string())
	}

		pub generate_config_tx: Sender<GenerateConfigRequest>,
		pub generation_rx: watch::Receiver<Option<Generation>>,

	"can't read ClickHouse version: {e}",
	"can't read ClickHouse version: {}", InlineErrorChain::new(e),

	let file = File::open(&path)?;
	let file = File::open(&path).with_context(\|\| format!("failed to open {path}"))?;

[reconfigurator] Reject clickhouse configurations from old generations #7347

Are you sure you want to change the base?

[reconfigurator] Reject clickhouse configurations from old generations #7347

Conversation

karencfv commented Jan 15, 2025 • edited Loading

Overview

Manual testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgallagher Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgallagher Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgallagher Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv commented Jan 24, 2025

karencfv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv commented Jan 30, 2025

jgallagher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv commented Feb 3, 2025 • edited Loading

jgallagher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv commented Jan 15, 2025 •

edited

Loading

jgallagher Jan 22, 2025 •

edited

Loading

jgallagher Jan 22, 2025 •

edited

Loading

jgallagher Jan 22, 2025 •

edited

Loading

karencfv left a comment •

edited

Loading

karencfv commented Feb 3, 2025 •

edited

Loading