-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 2023-04-26 #4086
Release 2023-04-26 #4086
Commits on Apr 11, 2023
-
Configuration menu - View commit details
-
Copy full SHA for de99ee2 - Browse repository at this point
Copy the full SHA de99ee2View commit details -
[compute_ctl] Add timeout for
tracing_utils::shutdown_tracing()
(#3982) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707
Configuration menu - View commit details
-
Copy full SHA for 40a68e9 - Browse repository at this point
Copy the full SHA 40a68e9View commit details -
Support aarch64 in walredo seccomp code (#3996)
Aarch64 doesn't implement some old syscalls like open and select. Use openat instead of open to check if seccomp is supported. Leave both select and pselect6 in the allowlist since we don't call select syscall directly and may hope that libc will call pselect6 on aarch64. To check whether some syscall is supported it is possible to use `scmp_sys_resolver` from seccopm package: ``` > apt install seccopm > scmp_sys_resolver -a x86_64 select 23 > scmp_sys_resolver -a aarch64 select -10101 > scmp_sys_resolver -a aarch64 pselect6 72 ``` Negative value means that syscall is not supported. Another cross-check is to look up for the actuall syscall table in `unistd.h`. To resolve all the macroses one can use `gcc -E` as it is done in `dump_sys_aarch64()` function in libseccomp/src/arch-syscall-validate. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Configuration menu - View commit details
-
Copy full SHA for 3c9f42a - Browse repository at this point
Copy the full SHA 3c9f42aView commit details -
Refactor 'spec' in ComputeState.
Sometimes, it contained real values, sometimes just defaults if the spec was not received yet. Make the state more clear by making it an Option instead. One consequence is that if some of the required settings like neon.tenant_id are missing from the spec file sent to the /configure endpoint, it is spotted earlier and you get an immediate HTTP error response. Not that it matters very much, but it's nicer nevertheless.
Configuration menu - View commit details
-
Copy full SHA for 6064a26 - Browse repository at this point
Copy the full SHA 6064a26View commit details
Commits on Apr 12, 2023
-
Use Lsn, TenantId, TimelineId types in compute_ctl.
Stronger types are generally nicer.
Configuration menu - View commit details
-
Copy full SHA for ef68321 - Browse repository at this point
Copy the full SHA ef68321View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8ace7a7 - Browse repository at this point
Copy the full SHA 8ace7a7View commit details -
Tolerate missing 'operation_uuid' field in spec file.
'compute_ctl' doesn't use the operation_uuid for anything, it just prints it to the log.
Configuration menu - View commit details
-
Copy full SHA for 06ce83c - Browse repository at this point
Copy the full SHA 06ce83cView commit details -
Move walreceiver start and stop behind a struct (#3973)
The PR changes module function-based walreceiver interface with a `WalReceiver` struct that exposes a few public methods, `new`, `start` and `stop` now. Later, the same struct is planned to be used for getting walreceiver stats (and, maybe, other extra data) to display during missing wal errors for #2106 Now though, the change required extra logic changes: * due to the `WalReceiver` struct added, it became easier to pass `ctx` and later do a `detached_child` instead of https://github.com/neondatabase/neon/blob/bfee4127014022a43bd85bccb562ed4bc62dc075/pageserver/src/tenant/timeline.rs#L1379-L1381 * `WalReceiver::start` which is now the public API to start the walreceiver, could return an `Err` which now may turn a tenant into `Broken`, same as the timeline that it tries to load during startup. * `WalReceiverConf` was added to group walreceiver parameters from pageserver's tenant config
Kirill Bulatov authoredApr 12, 2023 Configuration menu - View commit details
-
Copy full SHA for d8939d4 - Browse repository at this point
Copy the full SHA d8939d4View commit details -
Update most of the dependencies to their latest versions (#3991)
All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated. 3 sets of crates remain unupdated: * `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13). * `opentelemetry` and `opentelemetry-*` crates ``` error[E0308]: mismatched types --> libs/tracing-utils/src/http.rs:65:21 | 65 | span.set_parent(parent_ctx); | ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context` | | | arguments to this method are incorrect | = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `opentelemetry_api` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8 | 43 | fn set_parent(&self, cx: Context); | ^^^^^^^^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `tracing-utils` due to previous error warning: build failed, waiting for other jobs to finish... error: could not compile `tracing-utils` due to previous error ``` `tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need. * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue: ``` error[E0308]: mismatched types --> libs/postgres_backend/tests/simple_select.rs:112:78 | 112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg); | --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig` | | | arguments to this function are incorrect | = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types note: struct `ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1 | 125 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1 | 91 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `rustls` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12 | 23 | pub fn new(config: ClientConfig) -> Self { | ^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `postgres_backend` due to previous error warning: build failed, waiting for other jobs to finish... ``` * aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed. Other our tests passed, further investigation is worth to be done in #4008
Kirill Bulatov authoredApr 12, 2023 Configuration menu - View commit details
-
Copy full SHA for a64044a - Browse repository at this point
Copy the full SHA a64044aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 8d29578 - Browse repository at this point
Copy the full SHA 8d29578View commit details -
Configuration menu - View commit details
-
Copy full SHA for 218062c - Browse repository at this point
Copy the full SHA 218062cView commit details -
Configuration menu - View commit details
-
Copy full SHA for c94b899 - Browse repository at this point
Copy the full SHA c94b899View commit details -
Configuration menu - View commit details
-
Copy full SHA for 13e53e5 - Browse repository at this point
Copy the full SHA 13e53e5View commit details -
Revert "Update most of the dependencies to their latest versions (#3991…
…)" (#4013) This reverts commit a64044a. See https://neondb.slack.com/archives/C03H1K0PGKH/p1681306682795559
Kirill Bulatov authoredApr 12, 2023 Configuration menu - View commit details
-
Copy full SHA for f7995b3 - Browse repository at this point
Copy the full SHA f7995b3View commit details -
Add support for non-SNI case in multi-cert proxy
When no SNI is provided use the default certificate, otherwise we can't get to the options parameter which can be used to set endpoint name too. That means that non-SNI flow will not work for CNAME domains in verify-full mode.
Configuration menu - View commit details
-
Copy full SHA for 5d0ecad - Browse repository at this point
Copy the full SHA 5d0ecadView commit details
Commits on Apr 13, 2023
-
Add check for duplicates of generated image layers (#3869)
## Describe your changes ## Issue ticket number and link #3673 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Configuration menu - View commit details
-
Copy full SHA for 732acc5 - Browse repository at this point
Copy the full SHA 732acc5View commit details -
Add reason to TenantState::Broken (#3954)
Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere. Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted. Requires changes to grafana dashboard that monitors tenant states. Closes #3001 --------- Co-authored-by: theirix <theirix@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 15d1f85 - Browse repository at this point
Copy the full SHA 15d1f85View commit details -
Configuration menu - View commit details
-
Copy full SHA for c237a2f - Browse repository at this point
Copy the full SHA c237a2fView commit details -
Add note about
manual_release_instructions
label (#4015)## Describe your changes Do not forget to process required manual stuff after release ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Configuration menu - View commit details
-
Copy full SHA for 356439a - Browse repository at this point
Copy the full SHA 356439aView commit details -
Rename "Postgres nodes" in control_plane to endpoints.
We use the term "endpoint" in for compute Postgres nodes in the web UI and user-facing documentation now. Adjust the nomenclature in the code. This changes the name of the "neon_local pg" command to "neon_local endpoint". Also adjust names of classes, variables etc. in the python tests accordingly. This also changes the directory structure so that endpoints are now stored in: .neon/endpoints/<endpoint id> instead of: .neon/pgdatadirs/tenants/<tenant_id>/<endpoint (node) name> The tenant ID is no longer part of the path. That means that you cannot have two endpoints with the same name/ID in two different tenants anymore. That's consistent with how we treat endpoints in the real control plane and proxy: the endpoint ID must be globally unique.
Configuration menu - View commit details
-
Copy full SHA for 53f438a - Browse repository at this point
Copy the full SHA 53f438aView commit details -
Tenant size should never be zero. Simplify test.
Looking at the git history of this test, I think "size == 0" used to have a special meaning earlier, but now it should never happen.
Configuration menu - View commit details
-
Copy full SHA for 89b5589 - Browse repository at this point
Copy the full SHA 89b5589View commit details -
Verify extensions checksums (#4014)
To not be taken by surprise by upstream git re-tag or by malicious activity, let's verify the checksum for extensions we download Also, unify the installation of `pg_graphql` and `pg_tiktoken` with other extensions.
Configuration menu - View commit details
-
Copy full SHA for 36c2094 - Browse repository at this point
Copy the full SHA 36c2094View commit details -
[compute_ctl] Implement live reconfiguration (#3980)
With this commit one can request compute reconfiguration from the running `compute_ctl` with compute in `Running` state by sending a new spec: ```shell curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure ``` Internally, we start a separate configurator thread that is waiting on `Condvar` for `ConfigurationPending` compute state in a loop. Then it does reconfiguration, sets compute back to `Running` state and notifies other waiters. It will need some follow-ups, e.g. for retry logic for control-plane requests, but should be useful for testing in the current state. This shouldn't affect any existing environment, since computes are configured in a different way there. Resolves neondatabase/cloud#4433
1Configuration menu - View commit details
-
Copy full SHA for db8dd6f - Browse repository at this point
Copy the full SHA db8dd6fView commit details -
Make proxy shutdown when all connections are closed (#3764)
## Describe your changes Makes Proxy start draining connections on SIGTERM. ## Issue ticket number and link #3333
Sasha Krassovsky authoredApr 13, 2023 Configuration menu - View commit details
-
Copy full SHA for fd31faf - Browse repository at this point
Copy the full SHA fd31fafView commit details -
Configuration menu - View commit details
-
Copy full SHA for b6c7c32 - Browse repository at this point
Copy the full SHA b6c7c32View commit details
Commits on Apr 14, 2023
-
make evictions_low_residence_duration_metric_threshold per-tenant (#3949
) Before this patch, if a tenant would override its eviction_policy setting to use a lower LayerAccessThreshold::threshold than the `evictions_low_residence_duration_metric_threshold`, the evictions done for that tenant would count towards the `evictions_with_low_residence_duration` metric. That metric is used to identify pre-mature evictions, commonly triggered by disk-usage-based eviction under disk pressure. We don't want that to happen for the legitimate evictions of the tenant that overrides its eviction_policy. So, this patch - moves the setting into TenantConf - adds test coverage - updates the staging & prod yamls Forward Compatibility: Software before this patch will ignore the new tenant conf field and use the global one instead. So we can roll back safely. Backward Compatibility: Parsing old configs with software as of this patch will fail in `PageServerConf::parse_and_validate` with error `unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'` if the option is still present in the global section. We deal with this by updating the configs in Ansible. fixes #3940
Configuration menu - View commit details
-
Copy full SHA for 8895f28 - Browse repository at this point
Copy the full SHA 8895f28View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0c82ff3 - Browse repository at this point
Copy the full SHA 0c82ff3View commit details -
[compute_ctl] Do not create availability checker data on each start (#…
…4019) Initially, idea was to ensure that when we come and check data availability, special service table already contains one row. So if we loose it for some reason, we will error out. Yet, to do availability check we anyway start compute first! So it doesn't really add some value, but we affect each compute start as we update at least one row in the database. Also this writes some WAL, so if timeline is close to `neon.max_cluster_size` it could prevent compute from starting up. That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the `/check_writability` handler.
Configuration menu - View commit details
-
Copy full SHA for 589cf1e - Browse repository at this point
Copy the full SHA 589cf1eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 017d3a3 - Browse repository at this point
Copy the full SHA 017d3a3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 75ea810 - Browse repository at this point
Copy the full SHA 75ea810View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5ffa20d - Browse repository at this point
Copy the full SHA 5ffa20dView commit details -
Update most of the dependencies to their latest versions (#4026)
See #3991 Brings the changes back with the right way to use new `toml_edit` to deserialize values and a unit test for this. All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated. 3 sets of crates remain unupdated: * `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13). * `opentelemetry` and `opentelemetry-*` crates ``` error[E0308]: mismatched types --> libs/tracing-utils/src/http.rs:65:21 | 65 | span.set_parent(parent_ctx); | ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context` | | | arguments to this method are incorrect | = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1 | 77 | pub struct Context { | ^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `opentelemetry_api` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8 | 43 | fn set_parent(&self, cx: Context); | ^^^^^^^^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `tracing-utils` due to previous error warning: build failed, waiting for other jobs to finish... error: could not compile `tracing-utils` due to previous error ``` `tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need. * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue: ``` error[E0308]: mismatched types --> libs/postgres_backend/tests/simple_select.rs:112:78 | 112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg); | --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig` | | | arguments to this function are incorrect | = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types note: struct `ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1 | 125 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls` --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1 | 91 | pub struct ClientConfig { | ^^^^^^^^^^^^^^^^^^^^^^^ = note: perhaps two different versions of crate `rustls` are being used? note: associated function defined here --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12 | 23 | pub fn new(config: ClientConfig) -> Self { | ^^^ For more information about this error, try `rustc --explain E0308`. error: could not compile `postgres_backend` due to previous error warning: build failed, waiting for other jobs to finish... ``` * aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed. Other our tests passed, further investigation is worth to be done in #4008
Kirill Bulatov authoredApr 14, 2023 Configuration menu - View commit details
-
Copy full SHA for ebea298 - Browse repository at this point
Copy the full SHA ebea298View commit details
Commits on Apr 16, 2023
-
Configuration menu - View commit details
-
Copy full SHA for c2496c7 - Browse repository at this point
Copy the full SHA c2496c7View commit details
Commits on Apr 17, 2023
-
Send AppendResponse keepalive once per second (#4036)
Walproposer sends AppendRequest at least once per second. This patch adds a response to these requests once per second. Fixes #4017
Configuration menu - View commit details
-
Copy full SHA for 73f34ea - Browse repository at this point
Copy the full SHA 73f34eaView commit details -
Configuration menu - View commit details
-
Copy full SHA for d8dd60d - Browse repository at this point
Copy the full SHA d8dd60dView commit details -
Add us-east-1 hosts file and update regions (#4042)
## Describe your changes ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist
Configuration menu - View commit details
-
Copy full SHA for 0c08356 - Browse repository at this point
Copy the full SHA 0c08356View commit details
Commits on Apr 18, 2023
-
Configuration menu - View commit details
-
Copy full SHA for e2a5177 - Browse repository at this point
Copy the full SHA e2a5177View commit details -
Configuration menu - View commit details
-
Copy full SHA for f1b7dc4 - Browse repository at this point
Copy the full SHA f1b7dc4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 0bfbae2 - Browse repository at this point
Copy the full SHA 0bfbae2View commit details -
fix vm-informant dbname: "neondb" -> "postgres" (#4046)
Changes the vm-informant's postgres connection string's dbname from "neondb" (which sometimes doesn't exist) to "postgres" (which _hopefully_ should exist more often?). Currently there are a handful of VMs in prod that aren't working with autoscaling because they don't have the "neondb" database. The vm-informant doesn't require any database in particular; it's just connecting as `cloud_admin` to be able to adjust the file cache settings.
Configuration menu - View commit details
-
Copy full SHA for 02b28ae - Browse repository at this point
Copy the full SHA 02b28aeView commit details
Commits on Apr 21, 2023
-
[compute_ctl] Improve 'empty' compute startup sequence (#4034)
Do several attempts to get spec from the control-plane and retry network errors and all reasonable HTTP response codes. Do not hang waiting for spec without confirmation from the control-plane that compute is known and is in the `Empty` state. Adjust the way we track `total_startup_ms` metric, it should be calculated since the moment we received spec, not from the moment `compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric to track the time spent sleeping and waiting for spec to be delivered from control-plane. Part of neondatabase/cloud#3533
Configuration menu - View commit details
-
Copy full SHA for 7ba5c28 - Browse repository at this point
Copy the full SHA 7ba5c28View commit details
Commits on Apr 24, 2023
-
Adding synthetic size to pageserver swagger (#4049)
## Describe your changes I added synthetic size response to the console swagger. Now I am syncing it back to neon
Configuration menu - View commit details
-
Copy full SHA for afbbc61 - Browse repository at this point
Copy the full SHA afbbc61View commit details
Commits on Apr 25, 2023
-
add libmetric metric for each logged log message (#4055)
This patch extends the libmetrics logging setup functionality with a `tracing` layer that increments a Prometheus counter each time we log a log message. We have the counter per tracing event level. This allows for monitoring WARN and ERR log volume without parsing the log. Also, it would allow cross-checking whether logs got dropped on the way into Loki. It would be nicer if we could hook deeper into the tracing logging layer, to avoid evaluating the filter twice. But I don't know how to do it.
Configuration menu - View commit details
-
Copy full SHA for e83684b - Browse repository at this point
Copy the full SHA e83684bView commit details -
feat: warn when requests get cancelled (#4064)
Add a simple disarmable dropguard to log if request is cancelled before it is completed. We currently don't have this, and it makes for difficult to know when the request was dropped.
Configuration menu - View commit details
-
Copy full SHA for 4911d7c - Browse repository at this point
Copy the full SHA 4911d7cView commit details -
add gauge for in-flight layer uploads (#3951)
For the "worst-case /storage usage panel", we need to compute ``` remote size + local-only size ``` We currently don't have a metric for local-only layers. The number of in-flight layers in the upload queue is just that, so, let Prometheus scrape it. The metric is two counters (started and finished). The delta is the amount of in-flight uploads in the queue. The metrics are incremented in the respective `call_unfinished_metric_*` functions. These track ongoing operations by file_kind and op_kind. We only need this metric for layer uploads, so, there's the new RemoteTimelineClientMetricsCallTrackSize type that forces all call sites to decide whether they want the size tracked or not. If we find that other file_kinds or op_kinds are interesting (metadata uploads, layer downloads, layer deletes) are interesting, we can just enable them, and they'll be just another label combination within the metrics that this PR adds. fixes #3922
Configuration menu - View commit details
-
Copy full SHA for fa20e37 - Browse repository at this point
Copy the full SHA fa20e37View commit details -
feat: add rough timings for basebackup (#4062)
just record the time needed for waiting the lsn and then the basebackup in a log message in millis. this is related to ongoing investigations to cold start performance. this could also be a a counter. it cannot be added next to smgr histograms, because we don't want another histogram per timeline. the aim is to allow drilling deeper into which timelines were slow, and to understand why some need two basebackups.
Configuration menu - View commit details
-
Copy full SHA for cb94739 - Browse repository at this point
Copy the full SHA cb94739View commit details -
neon_local: fix
tenant create -c eviction_policy:...
(#4004)And add corresponding unit test. The fix is to use `.remove()` instead of `.get()` when processing the arugments hash map. The code uses emptiness of the hash map to determine whether all arguments have been processed. This was likely a copy-paste error. refs #3942
Configuration menu - View commit details
-
Copy full SHA for dbbe032 - Browse repository at this point
Copy the full SHA dbbe032View commit details -
Deploy proxies for preview enviroments (#4052)
## Describe your changes Deploy `main` proxies to the preview environments We don't deploy storage there yet, as it's tricky. ## Issue ticket number and link neondatabase/cloud#4737
Configuration menu - View commit details
-
Copy full SHA for 78bbbcc - Browse repository at this point
Copy the full SHA 78bbbccView commit details -
fix: stop dead_code rustc lint (#4070)
only happens without `--all-features` which is what `./run_clippy.sh` uses.
Configuration menu - View commit details
-
Copy full SHA for 7f80230 - Browse repository at this point
Copy the full SHA 7f80230View commit details -
Configuration menu - View commit details
-
Copy full SHA for bfd45dd - Browse repository at this point
Copy the full SHA bfd45ddView commit details -
Login to ECR and Docker Hub at once (#4067)
- Update kaniko to 1.9.2 (from 1.7.0), problem with reproducible build is fixed - Login to ECR and Docker Hub at once, so we can push to several registries, it makes job `push-docker-hub` unneeded - `push-docker-hub` replaced with `promote-images` in `needs:` clause, Pushing images to production ECR moved to `promote-images` job
Configuration menu - View commit details
-
Copy full SHA for 05ac0e2 - Browse repository at this point
Copy the full SHA 05ac0e2View commit details -
Enable OpenTelemetry tracing in proxy in staging. (#4065)
Depends on neondatabase/helm-charts#32 Co-authored-by: Lassi Pölönen <lassi.polonen@iki.fi>
Configuration menu - View commit details
-
Copy full SHA for 8945fbd - Browse repository at this point
Copy the full SHA 8945fbdView commit details -
GitHub Workflows: Fix crane for several registries (#4076)
Follow-up fix after #4067 ``` + crane tag neondatabase/vm-compute-node-v14:3064 latest Error: fetching "neondatabase/vm-compute-node-v14:3064": GET https://index.docker.io/v2/neondatabase/vm-compute-node-v14/manifests/3064: MANIFEST_UNKNOWN: manifest unknown; unknown tag=3064 ``` I reverted back the previous approach for promoting images (login to one registry, save images to local fs, logout and login to another registry, and push images from local fs). It turns out what works for one Google project (kaniko), doesn't work for another (crane) [sigh]
Configuration menu - View commit details
-
Copy full SHA for 2d6fd72 - Browse repository at this point
Copy the full SHA 2d6fd72View commit details
Commits on Apr 26, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 9d0cf08 - Browse repository at this point
Copy the full SHA 9d0cf08View commit details -
Configuration menu - View commit details
-
Copy full SHA for f19b70b - Browse repository at this point
Copy the full SHA f19b70bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 850f6b1 - Browse repository at this point
Copy the full SHA 850f6b1View commit details -
build: remove busted sk-1.us-east-2 from staging hosts (#4082)
this should give us complete deployments while a new one is being brought up.
Configuration menu - View commit details
-
Copy full SHA for 4625da3 - Browse repository at this point
Copy the full SHA 4625da3View commit details -
feat: log how long tenant activation takes (#4080)
Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.
Configuration menu - View commit details
-
Copy full SHA for 381c8fc - Browse repository at this point
Copy the full SHA 381c8fcView commit details -
Remove wait_for_sk_commit_lsn_to_reach_remote_storage.
It had a couple of inherent races: 1) Even if compute is killed before the call, some more data might still arrive to safekeepers after commit_lsn on them is polled, advancing it. Then checkpoint on pageserver might not include this tail, and so upload of expected LSN won't happen until one more checkpoint. 2) commit_lsn is updated asynchronously -- compute can commit transaction before communicating commit_lsn to even single safekeeper (sync-safekeepers can be used to forces the advancement). This makes semantics of wait_for_sk_commit_lsn_to_reach_remote_storage quite complicated. Replace it with last_flush_lsn_upload which 1) Learns last flush LSN on compute; 2) Waits for it to arrive to pageserver; 3) Checkpoints it; 4) Waits for the upload. In some tests this keeps compute alive longer than before, but this doesn't seem to be important. There is a chance this fixes #3209
Configuration menu - View commit details
-
Copy full SHA for 31a3910 - Browse repository at this point
Copy the full SHA 31a3910View commit details -
Configuration menu - View commit details
-
Copy full SHA for 11df2ee - Browse repository at this point
Copy the full SHA 11df2eeView commit details