-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 2023 04 28 #4110
Merged
Merged
Release 2023 04 28 #4110
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707
Aarch64 doesn't implement some old syscalls like open and select. Use openat instead of open to check if seccomp is supported. Leave both select and pselect6 in the allowlist since we don't call select syscall directly and may hope that libc will call pselect6 on aarch64. To check whether some syscall is supported it is possible to use `scmp_sys_resolver` from seccopm package: ``` > apt install seccopm > scmp_sys_resolver -a x86_64 select 23 > scmp_sys_resolver -a aarch64 select -10101 > scmp_sys_resolver -a aarch64 pselect6 72 ``` Negative value means that syscall is not supported. Another cross-check is to look up for the actuall syscall table in `unistd.h`. To resolve all the macroses one can use `gcc -E` as it is done in `dump_sys_aarch64()` function in libseccomp/src/arch-syscall-validate. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.
Stronger types are generally nicer.
'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.
The PR changes module function-based walreceiver interface with a `WalReceiver` struct that exposes a few public methods, `new`, `start` and `stop` now. Later, the same struct is planned to be used for getting walreceiver stats (and, maybe, other extra data) to display during missing wal errors for #2106 Now though, the change required extra logic changes: * due to the `WalReceiver` struct added, it became easier to pass `ctx` and later do a `detached_child` instead of https://github.com/neondatabase/neon/blob/bfee4127014022a43bd85bccb562ed4bc62dc075/pageserver/src/tenant/timeline.rs#L1379-L1381 * `WalReceiver::start` which is now the public API to start the walreceiver, could return an `Err` which now may turn a tenant into `Broken`, same as the timeline that it tries to load during startup. * `WalReceiverConf` was added to group walreceiver parameters from pageserver's tenant config
All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated. 3 sets of crates remain unupdated: * `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13). * `opentelemetry` and `opentelemetry-*` crates ``` error[E0308]: mismatched types --> libs/tracing-utils/src/http.rs:65:21 | 65 | span.set_parent(parent_ctx); | ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context` | | | arguments to this method are incorrect | = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `opentelemetry_api` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8 | 43 | fn set_parent(&self, cx: Context); | ^^^^^^^^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `tracing-utils` due to previous error warning: build failed, waiting for other jobs to finish... error: could not compile `tracing-utils` due to previous error ``` `tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need. * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue: ``` error[E0308]: mismatched types --> libs/postgres_backend/tests/simple_select.rs:112:78 | 112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg); | --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig` | | | arguments to this function are incorrect | = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types note: struct `ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1 | 125 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1 | 91 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `rustls` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12 | 23 | pub fn new(config: ClientConfig) -> Self { | ^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `postgres_backend` due to previous error warning: build failed, waiting for other jobs to finish... ``` * aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed. Other our tests passed, further investigation is worth to be done in #4008
When no SNI is provided use the default certificate, otherwise we can't get to the options parameter which can be used to set endpoint name too. That means that non-SNI flow will not work for CNAME domains in verify-full mode.
## Describe your changes ## Issue ticket number and link #3673 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere. Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted. Requires changes to grafana dashboard that monitors tenant states. Closes #3001 --------- Co-authored-by: theirix <theirix@gmail.com>
## Describe your changes Do not forget to process required manual stuff after release ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
We use the term "endpoint" in for compute Postgres nodes in the web UI and user-facing documentation now. Adjust the nomenclature in the code. This changes the name of the "neon_local pg" command to "neon_local endpoint". Also adjust names of classes, variables etc. in the python tests accordingly. This also changes the directory structure so that endpoints are now stored in: .neon/endpoints/<endpoint id> instead of: .neon/pgdatadirs/tenants/<tenant_id>/<endpoint (node) name> The tenant ID is no longer part of the path. That means that you cannot have two endpoints with the same name/ID in two different tenants anymore. That's consistent with how we treat endpoints in the real control plane and proxy: the endpoint ID must be globally unique.
Looking at the git history of this test, I think "size == 0" used to have a special meaning earlier, but now it should never happen.
To not be taken by surprise by upstream git re-tag or by malicious activity, let's verify the checksum for extensions we download Also, unify the installation of `pg_graphql` and `pg_tiktoken` with other extensions.
With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433
## Describe your changes Makes Proxy start draining connections on SIGTERM. ## Issue ticket number and link #3333
) Before this patch, if a tenant would override its eviction_policy setting to use a lower LayerAccessThreshold::threshold than the `evictions_low_residence_duration_metric_threshold`, the evictions done for that tenant would count towards the `evictions_with_low_residence_duration` metric. That metric is used to identify pre-mature evictions, commonly triggered by disk-usage-based eviction under disk pressure. We don't want that to happen for the legitimate evictions of the tenant that overrides its eviction_policy. So, this patch - moves the setting into TenantConf - adds test coverage - updates the staging & prod yamls Forward Compatibility: Software before this patch will ignore the new tenant conf field and use the global one instead. So we can roll back safely. Backward Compatibility: Parsing old configs with software as of this patch will fail in `PageServerConf::parse_and_validate` with error `unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'` if the option is still present in the global section. We deal with this by updating the configs in Ansible. fixes #3940
…4019) Initially, idea was to ensure that when we come and check data availability, special service table already contains one row. So if we loose it for some reason, we will error out. Yet, to do availability check we anyway start compute first! So it doesn't really add some value, but we affect each compute start as we update at least one row in the database. Also this writes some WAL, so if timeline is close to `neon.max_cluster_size` it could prevent compute from starting up. That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the `/check_writability` handler.
Commit e6ec240 introduced some trivial whitespace issues.
Refactoring part of #4093. Numerious `Send + Sync` bounds were a distraction, that were not needed at all. The proper `Bytes` usage and one `"error_message".to_string()` are just drive-by fixes. Not using the `PostgresBackendTCP` allows us to start setting read timeouts (and more). `PostgresBackendTCP` is still used from proxy, so it cannot be removed.
- Increase `connect_timeout` to 30s, which should be enough for most of the cases - If the script cannot connect to the DB (or any other `psycopg2.OperationalError` occur) — do not fail the script, log the error and proceed. Problems with fetching flaky tests shouldn't block the PR
Refactors walsenders out of timeline.rs to makes it less convoluted into separate WalSenders with its own lock, but otherwise having the same structure. Tracking of in-memory remote_consistent_lsn is also moved there as it is mainly received from pageserver. State of walsender (feedback) is also restructured to be cleaner; now it is either PageserverFeedback or StandbyFeedback(StandbyReply, HotStandbyFeedback), but not both.
It allows to replace u64 with proper Lsn and pretty print PageserverFeedback with serde(_json). Now walsenders on safekeepers queried with debug_dump look like "walsenders": [ { "ttid": "fafe0cf39a99c608c872706149de9d2a/b4fb3be6f576935e7f0fcb84bdb909a1", "addr": "127.0.0.1:48774", "conn_id": 3, "appname": "pageserver", "feedback": { "Pageserver": { "current_timeline_size": 32096256, "last_received_lsn": "0/2415298", "disk_consistent_lsn": "0/1696628", "remote_consistent_lsn": "0/0", "replytime": "2023-04-12T13:54:53.958856+00:00" } } } ],
We used `display_serialize` previously, but it works only for Serialize. `DisplayFromStr` does the same, but also works for Deserialize.
And log postgres to stdout. Probably fixes #3778
vadim2404
requested review from
knizhnik,
petuhovskiy and
tychoish
and removed request for
a team
April 28, 2023 14:25
This reverts commit 732acc5. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088
vadim2404
force-pushed
the
vk/release_2023-04-28
branch
from
April 28, 2023 14:32
e123f2a
to
d11d781
Compare
petuhovskiy
approved these changes
Apr 28, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Note that there is still 1 stuck project on staging.
koivunej
approved these changes
Apr 28, 2023
Test results for ec9dcb1:debug build: 220 tests run: 210 passed, 0 failed, 10 (full report)release build: 220 tests run: 210 passed, 0 failed, 10 (full report) |
#4056 was left out from the release compared to current |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Release 2023-04-26
NB: this PR must be merged only by 'Create a merge commit'!
Checklist when preparing for release
Checklist after release
manual_release_instructions
are executed (either by you or by people who wrote them).website
repo (example)