Release 2023-04-26 #4086

koivunej · 2023-04-26T12:15:41Z

Release 2023-04-26

NB: this PR must be merged only by 'Create a merge commit'!

Checklist when preparing for release

Read or refresh the release flow guide
Ask in the cloud Slack channel that you are going to rollout the release. Any blockers?
Does this release contain any db migrations? Destructive ones? What is the rollback plan?

Checklist after release

Make sure instructions from PRs included in this release and labeled manual_release_instructions are executed (either by you or by people who wrote them).
- Rebase #3398 #3954
  - post-release: fix "broken" => "Broken" in prod alert
Based on the merged commits write release notes and open a PR into website repo (example)
Check #dev-production-stream Slack channel
Check stuck projects page
Check recent operation failures
Check cloud SLO dashboard
Check compute startup metrics dashboard

) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707

Aarch64 doesn't implement some old syscalls like open and select. Use openat instead of open to check if seccomp is supported. Leave both select and pselect6 in the allowlist since we don't call select syscall directly and may hope that libc will call pselect6 on aarch64. To check whether some syscall is supported it is possible to use `scmp_sys_resolver` from seccopm package: ``` > apt install seccopm > scmp_sys_resolver -a x86_64 select 23 > scmp_sys_resolver -a aarch64 select -10101 > scmp_sys_resolver -a aarch64 pselect6 72 ``` Negative value means that syscall is not supported. Another cross-check is to look up for the actuall syscall table in `unistd.h`. To resolve all the macroses one can use `gcc -E` as it is done in `dump_sys_aarch64()` function in libseccomp/src/arch-syscall-validate. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.

Stronger types are generally nicer.

'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.

The PR changes module function-based walreceiver interface with a `WalReceiver` struct that exposes a few public methods, `new`, `start` and `stop` now. Later, the same struct is planned to be used for getting walreceiver stats (and, maybe, other extra data) to display during missing wal errors for #2106 Now though, the change required extra logic changes: * due to the `WalReceiver` struct added, it became easier to pass `ctx` and later do a `detached_child` instead of https://github.com/neondatabase/neon/blob/bfee4127014022a43bd85bccb562ed4bc62dc075/pageserver/src/tenant/timeline.rs#L1379-L1381 * `WalReceiver::start` which is now the public API to start the walreceiver, could return an `Err` which now may turn a tenant into `Broken`, same as the timeline that it tries to load during startup. * `WalReceiverConf` was added to group walreceiver parameters from pageserver's tenant config

All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated. 3 sets of crates remain unupdated: * `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13). * `opentelemetry` and `opentelemetry-*` crates ``` error[E0308]: mismatched types --> libs/tracing-utils/src/http.rs:65:21 | 65 | span.set_parent(parent_ctx); | ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context` | | | arguments to this method are incorrect | = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `opentelemetry_api` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8 | 43 | fn set_parent(&self, cx: Context); | ^^^^^^^^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `tracing-utils` due to previous error warning: build failed, waiting for other jobs to finish... error: could not compile `tracing-utils` due to previous error ``` `tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need. * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue: ``` error[E0308]: mismatched types --> libs/postgres_backend/tests/simple_select.rs:112:78 | 112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg); | --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig` | | | arguments to this function are incorrect | = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types note: struct `ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1 | 125 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1 | 91 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `rustls` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12 | 23 | pub fn new(config: ClientConfig) -> Self { | ^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `postgres_backend` due to previous error warning: build failed, waiting for other jobs to finish... ``` * aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed. Other our tests passed, further investigation is worth to be done in #4008

…)" (#4013) This reverts commit a64044a. See https://neondb.slack.com/archives/C03H1K0PGKH/p1681306682795559

When no SNI is provided use the default certificate, otherwise we can't get to the options parameter which can be used to set endpoint name too. That means that non-SNI flow will not work for CNAME domains in verify-full mode.

## Describe your changes ## Issue ticket number and link #3673 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere. Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted. Requires changes to grafana dashboard that monitors tenant states. Closes #3001 --------- Co-authored-by: theirix <theirix@gmail.com>

## Describe your changes Do not forget to process required manual stuff after release ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>

We use the term "endpoint" in for compute Postgres nodes in the web UI and user-facing documentation now. Adjust the nomenclature in the code. This changes the name of the "neon_local pg" command to "neon_local endpoint". Also adjust names of classes, variables etc. in the python tests accordingly. This also changes the directory structure so that endpoints are now stored in: .neon/endpoints/<endpoint id> instead of: .neon/pgdatadirs/tenants/<tenant_id>/<endpoint (node) name> The tenant ID is no longer part of the path. That means that you cannot have two endpoints with the same name/ID in two different tenants anymore. That's consistent with how we treat endpoints in the real control plane and proxy: the endpoint ID must be globally unique.

Looking at the git history of this test, I think "size == 0" used to have a special meaning earlier, but now it should never happen.

To not be taken by surprise by upstream git re-tag or by malicious activity, let's verify the checksum for extensions we download Also, unify the installation of `pg_graphql` and `pg_tiktoken` with other extensions.

With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433

## Describe your changes Makes Proxy start draining connections on SIGTERM. ## Issue ticket number and link #3333

) Before this patch, if a tenant would override its eviction_policy setting to use a lower LayerAccessThreshold::threshold than the `evictions_low_residence_duration_metric_threshold`, the evictions done for that tenant would count towards the `evictions_with_low_residence_duration` metric. That metric is used to identify pre-mature evictions, commonly triggered by disk-usage-based eviction under disk pressure. We don't want that to happen for the legitimate evictions of the tenant that overrides its eviction_policy. So, this patch - moves the setting into TenantConf - adds test coverage - updates the staging & prod yamls Forward Compatibility: Software before this patch will ignore the new tenant conf field and use the global one instead. So we can roll back safely. Backward Compatibility: Parsing old configs with software as of this patch will fail in `PageServerConf::parse_and_validate` with error `unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'` if the option is still present in the global section. We deal with this by updating the configs in Ansible. fixes #3940

…4019) Initially, idea was to ensure that when we come and check data availability, special service table already contains one row. So if we loose it for some reason, we will error out. Yet, to do availability check we anyway start compute first! So it doesn't really add some value, but we affect each compute start as we update at least one row in the database. Also this writes some WAL, so if timeline is close to `neon.max_cluster_size` it could prevent compute from starting up. That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the `/check_writability` handler.

I didn't get through #3775 fast enough so we wanted to remove this metric. Fixes #3705.

this should give us complete deployments while a new one is being brought up.

Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.

It had a couple of inherent races: 1) Even if compute is killed before the call, some more data might still arrive to safekeepers after commit_lsn on them is polled, advancing it. Then checkpoint on pageserver might not include this tail, and so upload of expected LSN won't happen until one more checkpoint. 2) commit_lsn is updated asynchronously -- compute can commit transaction before communicating commit_lsn to even single safekeeper (sync-safekeepers can be used to forces the advancement). This makes semantics of wait_for_sk_commit_lsn_to_reach_remote_storage quite complicated. Replace it with last_flush_lsn_upload which 1) Learns last flush LSN on compute; 2) Waits for it to arrive to pageserver; 3) Checkpoints it; 4) Waits for the upload. In some tests this keeps compute alive longer than before, but this doesn't seem to be important. There is a chance this fixes #3209

github-actions · 2023-04-26T12:33:37Z

Test results for `11df2ee`:

debug build: 219 tests run: 209 passed, 0 failed, 10 (full report)

release build: 219 tests run: 209 passed, 0 failed, 10 (full report)

dismissing the approval as into "request changes"

koivunej · 2023-04-26T15:10:21Z

TODO:

staging still has stuck projects (16)
severity of test_pageserver_chaos: duplicate L1 in test #4088 needs to be understood

koivunej · 2023-04-28T13:32:29Z

Outdated, new one will be built by @vadim2404 with #4104.

kelvich and others added 30 commits April 11, 2023 14:54

Add more proxy cnames

de99ee2

Use Lsn, TenantId, TimelineId types in compute_ctl.

ef68321

Stronger types are generally nicer.

Remove unused 'timestamp' field from ComputeSpec struct.

8ace7a7

Tolerate missing 'operation_uuid' field in spec file.

06ce83c

'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.

Add support for ip4r extension

8d29578

GitHub Workflows: use ref_name instead of ref

218062c

GitHub Workflows: print error messages to stderr

c94b899

GitHub Workflows: use '!cancelled' instead of 'success or failure'

13e53e5

Revert "Update most of the dependencies to their latest versions (#3991…

f7995b3

…)" (#4013) This reverts commit a64044a. See https://neondb.slack.com/archives/C03H1K0PGKH/p1681306682795559

Add support for non-SNI case in multi-cert proxy

5d0ecad

When no SNI is provided use the default certificate, otherwise we can't get to the options parameter which can be used to set endpoint name too. That means that non-SNI flow will not work for CNAME domains in verify-full mode.

Compile pg_hint_plan extension

c237a2f

Tenant size should never be zero. Simplify test.

89b5589

Looking at the git history of this test, I think "size == 0" used to have a special meaning earlier, but now it should never happen.

Verify extensions checksums (#4014)

36c2094

To not be taken by surprise by upstream git re-tag or by malicious activity, let's verify the checksum for extensions we download Also, unify the installation of `pg_graphql` and `pg_tiktoken` with other extensions.

Make proxy shutdown when all connections are closed (#3764)

fd31faf

## Describe your changes Makes Proxy start draining connections on SIGTERM. ## Issue ticket number and link #3333

Bump h2 from 0.3.15 to 0.3.17 (#4020)

b6c7c32

test_runner: add Timeline Inspector to Grafana links (#4021)

0c82ff3

Compile postgres with lz4 and zstd support

017d3a3

Add procps into compute containers

75ea810

SergeyMelnikov and others added 6 commits April 26, 2023 09:36

Configure extra domain for us-east-1 (#4078)

f19b70b

refactor: drop pageserver_ondisk_layers (#4071)

850f6b1

I didn't get through #3775 fast enough so we wanted to remove this metric. Fixes #3705.

build: remove busted sk-1.us-east-2 from staging hosts (#4082)

4625da3

this should give us complete deployments while a new one is being brought up.

feat: log how long tenant activation takes (#4080)

381c8fc

Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.

Add safekeeper-3.us-east-2.aws.neon.build (#4085)

11df2ee

koivunej requested review from a team as code owners April 26, 2023 12:15

koivunej requested review from petuhovskiy, Daniel-ef, save-buffer and shanyp and removed request for a team April 26, 2023 12:15

koivunej changed the title ~~Release 2023 04 26~~ Release 2023-04-26 Apr 26, 2023

koivunej mentioned this pull request Apr 26, 2023

Release 2023-04-25 #4066

Closed

koivunej requested review from vadim2404 and removed request for shanyp April 26, 2023 13:19

vadim2404 previously approved these changes Apr 26, 2023

View reviewed changes

koivunej mentioned this pull request Apr 26, 2023

test_pageserver_chaos: duplicate L1 in test #4088

Closed

koivunej closed this Apr 28, 2023

koivunej deleted the release-2023-04-26 branch April 28, 2023 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2023-04-26 #4086

Release 2023-04-26 #4086

koivunej commented Apr 26, 2023 •

edited

Loading

github-actions bot commented Apr 26, 2023

koivunej commented Apr 26, 2023

koivunej commented Apr 28, 2023

Release 2023-04-26 #4086

Release 2023-04-26 #4086

Conversation

koivunej commented Apr 26, 2023 • edited Loading