Release 2023 04 28 #4110

vadim2404 · 2023-04-28T14:25:00Z

Release 2023-04-26

NB: this PR must be merged only by 'Create a merge commit'!

Checklist when preparing for release

Read or refresh the release flow guide
Ask in the cloud Slack channel that you are going to rollout the release. Any blockers?
Does this release contain any db migrations? Destructive ones? What is the rollback plan?

Checklist after release

Make sure instructions from PRs included in this release and labeled manual_release_instructions are executed (either by you or by people who wrote them).
- Rebase #3398 #3954
  - post-release: fix "broken" => "Broken" in prod alert
Based on the merged commits write release notes and open a PR into website repo (example)
Check #dev-production-stream Slack channel
Check stuck projects page
Check recent operation failures
Check cloud SLO dashboard
Check compute startup metrics dashboard

) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707

Aarch64 doesn't implement some old syscalls like open and select. Use openat instead of open to check if seccomp is supported. Leave both select and pselect6 in the allowlist since we don't call select syscall directly and may hope that libc will call pselect6 on aarch64. To check whether some syscall is supported it is possible to use `scmp_sys_resolver` from seccopm package: ``` > apt install seccopm > scmp_sys_resolver -a x86_64 select 23 > scmp_sys_resolver -a aarch64 select -10101 > scmp_sys_resolver -a aarch64 pselect6 72 ``` Negative value means that syscall is not supported. Another cross-check is to look up for the actuall syscall table in `unistd.h`. To resolve all the macroses one can use `gcc -E` as it is done in `dump_sys_aarch64()` function in libseccomp/src/arch-syscall-validate. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.

Stronger types are generally nicer.

'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.

The PR changes module function-based walreceiver interface with a `WalReceiver` struct that exposes a few public methods, `new`, `start` and `stop` now. Later, the same struct is planned to be used for getting walreceiver stats (and, maybe, other extra data) to display during missing wal errors for #2106 Now though, the change required extra logic changes: * due to the `WalReceiver` struct added, it became easier to pass `ctx` and later do a `detached_child` instead of https://github.com/neondatabase/neon/blob/bfee4127014022a43bd85bccb562ed4bc62dc075/pageserver/src/tenant/timeline.rs#L1379-L1381 * `WalReceiver::start` which is now the public API to start the walreceiver, could return an `Err` which now may turn a tenant into `Broken`, same as the timeline that it tries to load during startup. * `WalReceiverConf` was added to group walreceiver parameters from pageserver's tenant config

All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated. 3 sets of crates remain unupdated: * `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13). * `opentelemetry` and `opentelemetry-*` crates ``` error[E0308]: mismatched types --> libs/tracing-utils/src/http.rs:65:21 | 65 | span.set_parent(parent_ctx); | ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context` | | | arguments to this method are incorrect | = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `opentelemetry_api` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8 | 43 | fn set_parent(&self, cx: Context); | ^^^^^^^^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `tracing-utils` due to previous error warning: build failed, waiting for other jobs to finish... error: could not compile `tracing-utils` due to previous error ``` `tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need. * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue: ``` error[E0308]: mismatched types --> libs/postgres_backend/tests/simple_select.rs:112:78 | 112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg); | --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig` | | | arguments to this function are incorrect | = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types note: struct `ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1 | 125 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1 | 91 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `rustls` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12 | 23 | pub fn new(config: ClientConfig) -> Self { | ^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `postgres_backend` due to previous error warning: build failed, waiting for other jobs to finish... ``` * aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed. Other our tests passed, further investigation is worth to be done in #4008

…)" (#4013) This reverts commit a64044a. See https://neondb.slack.com/archives/C03H1K0PGKH/p1681306682795559

When no SNI is provided use the default certificate, otherwise we can't get to the options parameter which can be used to set endpoint name too. That means that non-SNI flow will not work for CNAME domains in verify-full mode.

## Describe your changes ## Issue ticket number and link #3673 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere. Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted. Requires changes to grafana dashboard that monitors tenant states. Closes #3001 --------- Co-authored-by: theirix <theirix@gmail.com>

## Describe your changes Do not forget to process required manual stuff after release ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>

We use the term "endpoint" in for compute Postgres nodes in the web UI and user-facing documentation now. Adjust the nomenclature in the code. This changes the name of the "neon_local pg" command to "neon_local endpoint". Also adjust names of classes, variables etc. in the python tests accordingly. This also changes the directory structure so that endpoints are now stored in: .neon/endpoints/<endpoint id> instead of: .neon/pgdatadirs/tenants/<tenant_id>/<endpoint (node) name> The tenant ID is no longer part of the path. That means that you cannot have two endpoints with the same name/ID in two different tenants anymore. That's consistent with how we treat endpoints in the real control plane and proxy: the endpoint ID must be globally unique.

Looking at the git history of this test, I think "size == 0" used to have a special meaning earlier, but now it should never happen.

To not be taken by surprise by upstream git re-tag or by malicious activity, let's verify the checksum for extensions we download Also, unify the installation of `pg_graphql` and `pg_tiktoken` with other extensions.

With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433

## Describe your changes Makes Proxy start draining connections on SIGTERM. ## Issue ticket number and link #3333

) Before this patch, if a tenant would override its eviction_policy setting to use a lower LayerAccessThreshold::threshold than the `evictions_low_residence_duration_metric_threshold`, the evictions done for that tenant would count towards the `evictions_with_low_residence_duration` metric. That metric is used to identify pre-mature evictions, commonly triggered by disk-usage-based eviction under disk pressure. We don't want that to happen for the legitimate evictions of the tenant that overrides its eviction_policy. So, this patch - moves the setting into TenantConf - adds test coverage - updates the staging & prod yamls Forward Compatibility: Software before this patch will ignore the new tenant conf field and use the global one instead. So we can roll back safely. Backward Compatibility: Parsing old configs with software as of this patch will fail in `PageServerConf::parse_and_validate` with error `unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'` if the option is still present in the global section. We deal with this by updating the configs in Ansible. fixes #3940

…4019) Initially, idea was to ensure that when we come and check data availability, special service table already contains one row. So if we loose it for some reason, we will error out. Yet, to do availability check we anyway start compute first! So it doesn't really add some value, but we affect each compute start as we update at least one row in the database. Also this writes some WAL, so if timeline is close to `neon.max_cluster_size` it could prevent compute from starting up. That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the `/check_writability` handler.

Commit e6ec240 introduced some trivial whitespace issues.

Refactoring part of #4093. Numerious `Send + Sync` bounds were a distraction, that were not needed at all. The proper `Bytes` usage and one `"error_message".to_string()` are just drive-by fixes. Not using the `PostgresBackendTCP` allows us to start setting read timeouts (and more). `PostgresBackendTCP` is still used from proxy, so it cannot be removed.

- Increase `connect_timeout` to 30s, which should be enough for most of the cases - If the script cannot connect to the DB (or any other `psycopg2.OperationalError` occur) — do not fail the script, log the error and proceed. Problems with fetching flaky tests shouldn't block the PR

Introduce read timeouts to our `page_service` connections. Without read timeouts, we essentially leak connections. This is a port of #3995. Split the refactorings to the other PR: #4097. Fixes #4028.

Refactors walsenders out of timeline.rs to makes it less convoluted into separate WalSenders with its own lock, but otherwise having the same structure. Tracking of in-memory remote_consistent_lsn is also moved there as it is mainly received from pageserver. State of walsender (feedback) is also restructured to be cleaner; now it is either PageserverFeedback or StandbyFeedback(StandbyReply, HotStandbyFeedback), but not both.

It allows to replace u64 with proper Lsn and pretty print PageserverFeedback with serde(_json). Now walsenders on safekeepers queried with debug_dump look like "walsenders": [ { "ttid": "fafe0cf39a99c608c872706149de9d2a/b4fb3be6f576935e7f0fcb84bdb909a1", "addr": "127.0.0.1:48774", "conn_id": 3, "appname": "pageserver", "feedback": { "Pageserver": { "current_timeline_size": 32096256, "last_received_lsn": "0/2415298", "disk_consistent_lsn": "0/1696628", "remote_consistent_lsn": "0/0", "replytime": "2023-04-12T13:54:53.958856+00:00" } } } ],

We used `display_serialize` previously, but it works only for Serialize. `DisplayFromStr` does the same, but also works for Deserialize.

And log postgres to stdout. Probably fixes #3778

This reverts commit 732acc5. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088

petuhovskiy

LGTM. Note that there is still 1 stuck project on staging.

github-actions · 2023-04-28T14:58:59Z

Test results for `ec9dcb1`:

debug build: 220 tests run: 210 passed, 0 failed, 10 (full report)

release build: 220 tests run: 210 passed, 0 failed, 10 (full report)

koivunej · 2023-04-28T15:01:16Z

#4056 was left out from the release compared to current main.

kelvich and others added 30 commits April 11, 2023 14:54

Add more proxy cnames

de99ee2

Use Lsn, TenantId, TimelineId types in compute_ctl.

ef68321

Stronger types are generally nicer.

Remove unused 'timestamp' field from ComputeSpec struct.

8ace7a7

Tolerate missing 'operation_uuid' field in spec file.

06ce83c

'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.

Add support for ip4r extension

8d29578

GitHub Workflows: use ref_name instead of ref

218062c

GitHub Workflows: print error messages to stderr

c94b899

GitHub Workflows: use '!cancelled' instead of 'success or failure'

13e53e5

Revert "Update most of the dependencies to their latest versions (#3991…

f7995b3

…)" (#4013) This reverts commit a64044a. See https://neondb.slack.com/archives/C03H1K0PGKH/p1681306682795559

Add support for non-SNI case in multi-cert proxy

5d0ecad

When no SNI is provided use the default certificate, otherwise we can't get to the options parameter which can be used to set endpoint name too. That means that non-SNI flow will not work for CNAME domains in verify-full mode.

Compile pg_hint_plan extension

c237a2f

Tenant size should never be zero. Simplify test.

89b5589

Looking at the git history of this test, I think "size == 0" used to have a special meaning earlier, but now it should never happen.

Verify extensions checksums (#4014)

36c2094

To not be taken by surprise by upstream git re-tag or by malicious activity, let's verify the checksum for extensions we download Also, unify the installation of `pg_graphql` and `pg_tiktoken` with other extensions.

Make proxy shutdown when all connections are closed (#3764)

fd31faf

## Describe your changes Makes Proxy start draining connections on SIGTERM. ## Issue ticket number and link #3333

Bump h2 from 0.3.15 to 0.3.17 (#4020)

b6c7c32

test_runner: add Timeline Inspector to Grafana links (#4021)

0c82ff3

Compile postgres with lz4 and zstd support

017d3a3

Add procps into compute containers

75ea810

hlinnaka and others added 8 commits April 27, 2023 18:45

Run rustfmt to fix whitespace.

d1e86d6

Commit e6ec240 introduced some trivial whitespace issues.

feat(page_service): read timeouts (#4093)

fe0b616

Introduce read timeouts to our `page_service` connections. Without read timeouts, we essentially leak connections. This is a port of #3995. Split the refactorings to the other PR: #4097. Fixes #4028.

Use serde_as DisplayFromStr everywhere (#4103)

b03143d

We used `display_serialize` previously, but it works only for Serialize. `DisplayFromStr` does the same, but also works for Deserialize.

Increase connection_timeout to PG in find end of WAL test.

42798e6

And log postgres to stdout. Probably fixes #3778

vadim2404 requested a review from koivunej April 28, 2023 14:25

vadim2404 requested review from a team as code owners April 28, 2023 14:25

vadim2404 requested review from knizhnik, petuhovskiy and tychoish and removed request for a team April 28, 2023 14:25

vadim2404 changed the title ~~Vk/release 2023 04 28~~ Release 2023 04 28 Apr 28, 2023

vadim2404 force-pushed the vk/release_2023-04-28 branch from e123f2a to d11d781 Compare April 28, 2023 14:32

Merge branch 'release' into vk/release_2023-04-28

ec9dcb1

vadim2404 requested a review from kelvich April 28, 2023 14:34

petuhovskiy approved these changes Apr 28, 2023

View reviewed changes

koivunej approved these changes Apr 28, 2023

View reviewed changes

vadim2404 merged commit 1407174 into release Apr 28, 2023

vadim2404 deleted the vk/release_2023-04-28 branch April 28, 2023 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2023 04 28 #4110

Release 2023 04 28 #4110

vadim2404 commented Apr 28, 2023

petuhovskiy left a comment

github-actions bot commented Apr 28, 2023

koivunej commented Apr 28, 2023 •

edited

Loading

Release 2023 04 28 #4110

Release 2023 04 28 #4110

Conversation

vadim2404 commented Apr 28, 2023

Release 2023-04-26

Checklist when preparing for release

Checklist after release

petuhovskiy left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 28, 2023

Test results for ec9dcb1:

debug build: 220 tests run: 210 passed, 0 failed, 10 (full report)

release build: 220 tests run: 210 passed, 0 failed, 10 (full report)

koivunej commented Apr 28, 2023 • edited Loading

Test results for `ec9dcb1`:

koivunej commented Apr 28, 2023 •

edited

Loading