Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 2023-04-26 #4086

Closed
wants to merge 60 commits into from
Closed

Release 2023-04-26 #4086

wants to merge 60 commits into from

Commits on Apr 11, 2023

  1. Add more proxy cnames

    kelvich committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    de99ee2 View commit details
    Browse the repository at this point in the history
  2. [compute_ctl] Add timeout for tracing_utils::shutdown_tracing() (#3982

    )
    
    Shutting down OTEL tracing provider may hang for quite some time, see,
    for example:
    - open-telemetry/opentelemetry-rust#868
    - and our problems with staging
    neondatabase/cloud#3707 (comment)
    
    Yet, we want computes to shut down fast enough, as we may need a new one
    for the same timeline ASAP. So wait no longer than 2s for the shutdown
    to complete, then just error out and exit the main thread.
    
    Related to neondatabase/cloud#3707
    ololobus authored Apr 11, 2023
    Configuration menu
    Copy the full SHA
    40a68e9 View commit details
    Browse the repository at this point in the history
  3. Support aarch64 in walredo seccomp code (#3996)

    Aarch64 doesn't implement some old syscalls like open and select. Use
    openat instead of open to check if seccomp is supported. Leave both
    select and pselect6 in the allowlist since we don't call select syscall
    directly and may hope that libc will call pselect6 on aarch64.
    
    To check whether some syscall is supported it is possible to use
    `scmp_sys_resolver` from seccopm package:
    
    ```
    > apt install seccopm
    > scmp_sys_resolver -a x86_64 select
    23
    > scmp_sys_resolver -a aarch64 select
    -10101
    > scmp_sys_resolver -a aarch64 pselect6
    72
    ```
    
    Negative value means that syscall is not supported.
    
    Another cross-check is to look up for the actuall syscall table in
    `unistd.h`. To resolve all the macroses one can use `gcc -E` as it is
    done in `dump_sys_aarch64()` function in
    libseccomp/src/arch-syscall-validate.
    
    ---------
    
    Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
    kelvich and hlinnaka authored Apr 11, 2023
    Configuration menu
    Copy the full SHA
    3c9f42a View commit details
    Browse the repository at this point in the history
  4. Refactor 'spec' in ComputeState.

    Sometimes, it contained real values, sometimes just defaults if the
    spec was not received yet. Make the state more clear by making it an
    Option instead.
    
    One consequence is that if some of the required settings like
    neon.tenant_id are missing from the spec file sent to the /configure
    endpoint, it is spotted earlier and you get an immediate HTTP error
    response. Not that it matters very much, but it's nicer nevertheless.
    hlinnaka committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    6064a26 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2023

  1. Use Lsn, TenantId, TimelineId types in compute_ctl.

    Stronger types are generally nicer.
    hlinnaka committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    ef68321 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8ace7a7 View commit details
    Browse the repository at this point in the history
  3. Tolerate missing 'operation_uuid' field in spec file.

    'compute_ctl' doesn't use the operation_uuid for anything, it just prints
    it to the log.
    hlinnaka committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    06ce83c View commit details
    Browse the repository at this point in the history
  4. Move walreceiver start and stop behind a struct (#3973)

    The PR changes module function-based walreceiver interface with a
    `WalReceiver` struct that exposes a few public methods, `new`, `start`
    and `stop` now.
    
    Later, the same struct is planned to be used for getting walreceiver
    stats (and, maybe, other extra data) to display during missing wal
    errors for #2106
    
    Now though, the change required extra logic changes:
    
    * due to the `WalReceiver` struct added, it became easier to pass `ctx`
    and later do a `detached_child` instead of
    
    https://github.com/neondatabase/neon/blob/bfee4127014022a43bd85bccb562ed4bc62dc075/pageserver/src/tenant/timeline.rs#L1379-L1381
    
    * `WalReceiver::start` which is now the public API to start the
    walreceiver, could return an `Err` which now may turn a tenant into
    `Broken`, same as the timeline that it tries to load during startup.
    
    * `WalReceiverConf` was added to group walreceiver parameters from
    pageserver's tenant config
    Kirill Bulatov authored Apr 12, 2023
    Configuration menu
    Copy the full SHA
    d8939d4 View commit details
    Browse the repository at this point in the history
  5. Update most of the dependencies to their latest versions (#3991)

    All non-trivial updates extracted into separate commits, also `carho
    hakari` data and its manifest format were updated.
    
    3 sets of crates remain unupdated:
    
    * `base64` — touches proxy in a lot of places and changed its api (by
    0.21 version) quite strongly since our version (0.13).
    * `opentelemetry` and `opentelemetry-*` crates
    
    ```
    error[E0308]: mismatched types
      --> libs/tracing-utils/src/http.rs:65:21
       |
    65 |     span.set_parent(parent_ctx);
       |          ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
       |          |
       |          arguments to this method are incorrect
       |
       = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
    note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
      --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
       |
    77 | pub struct Context {
       | ^^^^^^^^^^^^^^^^^^
    note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
      --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
       |
    77 | pub struct Context {
       | ^^^^^^^^^^^^^^^^^^
       = note: perhaps two different versions of crate `opentelemetry_api` are being used?
    note: associated function defined here
      --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
       |
    43 |     fn set_parent(&self, cx: Context);
       |        ^^^^^^^^^^
    
    For more information about this error, try `rustc --explain E0308`.
    error: could not compile `tracing-utils` due to previous error
    warning: build failed, waiting for other jobs to finish...
    error: could not compile `tracing-utils` due to previous error
    ```
    
    `tracing-opentelemetry` of version `0.19` is not yet released, that is
    supposed to have the update we need.
    
    * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener`
    crates have similar issue:
    
    ```
    error[E0308]: mismatched types
       --> libs/postgres_backend/tests/simple_select.rs:112:78
        |
    112 |     let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
        |                                --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
        |                                |
        |                                arguments to this function are incorrect
        |
        = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
    note: struct `ClientConfig` is defined in crate `rustls`
       --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
        |
    125 | pub struct ClientConfig {
        | ^^^^^^^^^^^^^^^^^^^^^^^
    note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
       --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
        |
    91  | pub struct ClientConfig {
        | ^^^^^^^^^^^^^^^^^^^^^^^
        = note: perhaps two different versions of crate `rustls` are being used?
    note: associated function defined here
       --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
        |
    23  |     pub fn new(config: ClientConfig) -> Self {
        |            ^^^
    
    For more information about this error, try `rustc --explain E0308`.
    error: could not compile `postgres_backend` due to previous error
    warning: build failed, waiting for other jobs to finish...
    ```
    
    * aws crates: I could not make new API to work with bucket endpoint
    overload, and console e2e tests failed.
    Other our tests passed, further investigation is worth to be done in
    #4008
    Kirill Bulatov authored Apr 12, 2023
    Configuration menu
    Copy the full SHA
    a64044a View commit details
    Browse the repository at this point in the history
  6. Add support for ip4r extension

    samgaw authored and vadim2404 committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    8d29578 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    218062c View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    c94b899 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    13e53e5 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    f7995b3 View commit details
    Browse the repository at this point in the history
  11. Add support for non-SNI case in multi-cert proxy

    When no SNI is provided use the default certificate, otherwise we can't
    get to the options parameter which can be used to set endpoint name too.
    That means that non-SNI flow will not work for CNAME domains in verify-full
    mode.
    kelvich committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    5d0ecad View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2023

  1. Add check for duplicates of generated image layers (#3869)

    ## Describe your changes
    
    ## Issue ticket number and link
    
    #3673
    
    ## Checklist before requesting a review
    - [ ] I have performed a self-review of my code.
    - [ ] If it is a core feature, I have added thorough tests.
    - [ ] Do we need to implement analytics? if so did you add the relevant
    metrics to the dashboard?
    - [ ] If this PR requires public announcement, mark it with
    /release-notes label and add several sentences in this section.
    
    ---------
    
    Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
    knizhnik and hlinnaka authored Apr 13, 2023
    Configuration menu
    Copy the full SHA
    732acc5 View commit details
    Browse the repository at this point in the history
  2. Add reason to TenantState::Broken (#3954)

    Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere.
    
    Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted.
    
    Requires changes to grafana dashboard that monitors tenant states.
    
    Closes #3001
    
    ---------
    
    Co-authored-by: theirix <theirix@gmail.com>
    LizardWizzard and theirix authored Apr 13, 2023
    Configuration menu
    Copy the full SHA
    15d1f85 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    c237a2f View commit details
    Browse the repository at this point in the history
  4. Add note about manual_release_instructions label (#4015)

    ## Describe your changes
    Do not forget to process required manual stuff after release
    
    ## Issue ticket number and link
    
    ## Checklist before requesting a review
    
    - [ ] I have performed a self-review of my code.
    - [ ] If it is a core feature, I have added thorough tests.
    - [ ] Do we need to implement analytics? if so did you add the relevant
    metrics to the dashboard?
    - [ ] If this PR requires public announcement, mark it with
    /release-notes label and add several sentences in this section.
    
    ## Checklist before merging
    
    - [ ] Do not forget to reformat commit message to not include the above
    checklist
    
    ---------
    
    Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
    vadim2404 and LizardWizzard authored Apr 13, 2023
    Configuration menu
    Copy the full SHA
    356439a View commit details
    Browse the repository at this point in the history
  5. Rename "Postgres nodes" in control_plane to endpoints.

    We use the term "endpoint" in for compute Postgres nodes in the web UI
    and user-facing documentation now. Adjust the nomenclature in the code.
    
    This changes the name of the "neon_local pg" command to "neon_local
    endpoint". Also adjust names of classes, variables etc. in the python
    tests accordingly.
    
    This also changes the directory structure so that endpoints are now
    stored in:
    
        .neon/endpoints/<endpoint id>
    
    instead of:
    
        .neon/pgdatadirs/tenants/<tenant_id>/<endpoint (node) name>
    
    The tenant ID is no longer part of the path. That means that you
    cannot have two endpoints with the same name/ID in two different
    tenants anymore. That's consistent with how we treat endpoints in the
    real control plane and proxy: the endpoint ID must be globally unique.
    hlinnaka committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    53f438a View commit details
    Browse the repository at this point in the history
  6. Tenant size should never be zero. Simplify test.

    Looking at the git history of this test, I think "size == 0" used to
    have a special meaning earlier, but now it should never happen.
    hlinnaka committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    89b5589 View commit details
    Browse the repository at this point in the history
  7. Verify extensions checksums (#4014)

    To not be taken by surprise by upstream git re-tag or by malicious activity,
    let's verify the checksum for extensions we download
    
    Also, unify the installation of `pg_graphql` and `pg_tiktoken` 
    with other extensions.
    bayandin authored Apr 13, 2023
    Configuration menu
    Copy the full SHA
    36c2094 View commit details
    Browse the repository at this point in the history
  8. [compute_ctl] Implement live reconfiguration (#3980)

    With this commit one can request compute reconfiguration
    from the running `compute_ctl` with compute in `Running` state
    by sending a new spec:
    ```shell
    curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure
    ```
    
    Internally, we start a separate configurator thread that is waiting on
    `Condvar` for `ConfigurationPending` compute state in a loop. Then it does
    reconfiguration, sets compute back to `Running` state and notifies other
    waiters.
    
    It will need some follow-ups, e.g. for retry logic for control-plane
    requests, but should be useful for testing in the current state. This
    shouldn't affect any existing environment, since computes are configured
    in a different way there.
    
    Resolves neondatabase/cloud#4433
    ololobus authored Apr 13, 2023
    1 Configuration menu
    Copy the full SHA
    db8dd6f View commit details
    Browse the repository at this point in the history
  9. Make proxy shutdown when all connections are closed (#3764)

    ## Describe your changes
    Makes Proxy start draining connections on SIGTERM.
    ## Issue ticket number and link
    #3333
    Sasha Krassovsky authored Apr 13, 2023
    Configuration menu
    Copy the full SHA
    fd31faf View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    b6c7c32 View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2023

  1. make evictions_low_residence_duration_metric_threshold per-tenant (#3949

    )
    
    Before this patch, if a tenant would override its eviction_policy
    setting to use a lower LayerAccessThreshold::threshold than the
    `evictions_low_residence_duration_metric_threshold`, the evictions done
    for that tenant would count towards the
    `evictions_with_low_residence_duration` metric.
    
    That metric is used to identify pre-mature evictions, commonly triggered
    by disk-usage-based eviction under disk pressure.
    
    We don't want that to happen for the legitimate evictions of the tenant
    that overrides its eviction_policy.
    
    So, this patch
    - moves the setting into TenantConf
    - adds test coverage
    - updates the staging & prod yamls
    
    Forward Compatibility:
    Software before this patch will ignore the new tenant conf field and use
    the global one instead.
    So we can roll back safely.
    
    Backward Compatibility:
    Parsing old configs with software as of this patch will fail in
    `PageServerConf::parse_and_validate` with error 
    `unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'`
    if the option is still present in the global section.
    We deal with this by updating the configs in Ansible.
    
    fixes #3940
    problame authored Apr 14, 2023
    Configuration menu
    Copy the full SHA
    8895f28 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0c82ff3 View commit details
    Browse the repository at this point in the history
  3. [compute_ctl] Do not create availability checker data on each start (#…

    …4019)
    
    Initially, idea was to ensure that when we come and check data
    availability, special service table already contains one row. So if we
    loose it for some reason, we will error out.
    
    Yet, to do availability check we anyway start compute first! So it
    doesn't really add some value, but we affect each compute start as we
    update at least one row in the database. Also this writes some WAL, so
    if timeline is close to `neon.max_cluster_size` it could prevent compute
    from starting up.
    
    That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the
    `/check_writability` handler.
    ololobus authored Apr 14, 2023
    Configuration menu
    Copy the full SHA
    589cf1e View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    017d3a3 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    75ea810 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    5ffa20d View commit details
    Browse the repository at this point in the history
  7. Update most of the dependencies to their latest versions (#4026)

    See #3991
    
    Brings the changes back with the right way to use new `toml_edit` to
    deserialize values and a unit test for this.
    
    All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated.
    
    3 sets of crates remain unupdated:
    
    * `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13).
    * `opentelemetry` and `opentelemetry-*` crates
    
    ```
    error[E0308]: mismatched types
      --> libs/tracing-utils/src/http.rs:65:21
       |
    65 |     span.set_parent(parent_ctx);
       |          ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
       |          |
       |          arguments to this method are incorrect
       |
       = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
    note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
      --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
       |
    77 | pub struct Context {
       | ^^^^^^^^^^^^^^^^^^
    note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
      --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
       |
    77 | pub struct Context {
       | ^^^^^^^^^^^^^^^^^^
       = note: perhaps two different versions of crate `opentelemetry_api` are being used?
    note: associated function defined here
      --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
       |
    43 |     fn set_parent(&self, cx: Context);
       |        ^^^^^^^^^^
    
    For more information about this error, try `rustc --explain E0308`.
    error: could not compile `tracing-utils` due to previous error
    warning: build failed, waiting for other jobs to finish...
    error: could not compile `tracing-utils` due to previous error
    ```
    
    `tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need.
    
    * similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue:
    
    ```
    error[E0308]: mismatched types
       --> libs/postgres_backend/tests/simple_select.rs:112:78
        |
    112 |     let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
        |                                --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
        |                                |
        |                                arguments to this function are incorrect
        |
        = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
    note: struct `ClientConfig` is defined in crate `rustls`
       --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
        |
    125 | pub struct ClientConfig {
        | ^^^^^^^^^^^^^^^^^^^^^^^
    note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
       --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
        |
    91  | pub struct ClientConfig {
        | ^^^^^^^^^^^^^^^^^^^^^^^
        = note: perhaps two different versions of crate `rustls` are being used?
    note: associated function defined here
       --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
        |
    23  |     pub fn new(config: ClientConfig) -> Self {
        |            ^^^
    
    For more information about this error, try `rustc --explain E0308`.
    error: could not compile `postgres_backend` due to previous error
    warning: build failed, waiting for other jobs to finish...
    ```
    
    * aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed.
    Other our tests passed, further investigation is worth to be done in #4008
    Kirill Bulatov authored Apr 14, 2023
    Configuration menu
    Copy the full SHA
    ebea298 View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2023

  1. Configuration menu
    Copy the full SHA
    c2496c7 View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2023

  1. Send AppendResponse keepalive once per second (#4036)

    Walproposer sends AppendRequest at least once per second. This patch
    adds a response to these requests once per second.
    
    Fixes #4017
    petuhovskiy authored Apr 17, 2023
    Configuration menu
    Copy the full SHA
    73f34ea View commit details
    Browse the repository at this point in the history
  2. Add helm values for us-east-1

    fcdm committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    d8dd60d View commit details
    Browse the repository at this point in the history
  3. Add us-east-1 hosts file and update regions (#4042)

    ## Describe your changes
    
    ## Issue ticket number and link
    
    ## Checklist before requesting a review
    
    - [x] I have performed a self-review of my code.
    - [ ] If it is a core feature, I have added thorough tests.
    - [ ] Do we need to implement analytics? if so did you add the relevant
    metrics to the dashboard?
    - [ ] If this PR requires public announcement, mark it with
    /release-notes label and add several sentences in this section.
    
    ## Checklist before merging
    
    - [ ] Do not forget to reformat commit message to not include the above
    checklist
    fcdm authored Apr 17, 2023
    Configuration menu
    Copy the full SHA
    0c08356 View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2023

  1. Configuration menu
    Copy the full SHA
    e2a5177 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f1b7dc4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0bfbae2 View commit details
    Browse the repository at this point in the history
  4. fix vm-informant dbname: "neondb" -> "postgres" (#4046)

    Changes the vm-informant's postgres connection string's dbname from
    "neondb" (which sometimes doesn't exist) to "postgres" (which
    _hopefully_ should exist more often?).
    
    Currently there are a handful of VMs in prod that aren't working with
    autoscaling because they don't have the "neondb" database.
    
    The vm-informant doesn't require any database in particular; it's just
    connecting as `cloud_admin` to be able to adjust the file cache
    settings.
    sharnoff authored Apr 18, 2023
    Configuration menu
    Copy the full SHA
    02b28ae View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2023

  1. [compute_ctl] Improve 'empty' compute startup sequence (#4034)

    Do several attempts to get spec from the control-plane and retry network
    errors and all reasonable HTTP response codes. Do not hang waiting for
    spec without confirmation from the control-plane that compute is known
    and is in the `Empty` state.
    
    Adjust the way we track `total_startup_ms` metric, it should be
    calculated since the moment we received spec, not from the moment
    `compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric
    to track the time spent sleeping and waiting for spec to be delivered
    from control-plane.
    
    Part of neondatabase/cloud#3533
    ololobus authored Apr 21, 2023
    Configuration menu
    Copy the full SHA
    7ba5c28 View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2023

  1. Adding synthetic size to pageserver swagger (#4049)

    ## Describe your changes
    
    I added synthetic size response to the console swagger. Now I am syncing
    it back to neon
    duskpoet authored Apr 24, 2023
    Configuration menu
    Copy the full SHA
    afbbc61 View commit details
    Browse the repository at this point in the history

Commits on Apr 25, 2023

  1. add libmetric metric for each logged log message (#4055)

    This patch extends the libmetrics logging setup functionality with a
    `tracing` layer that increments a Prometheus counter each time we log a
    log message. We have the counter per tracing event level. This allows
    for monitoring WARN and ERR log volume without parsing the log. Also, it
    would allow cross-checking whether logs got dropped on the way into
    Loki.
    
    It would be nicer if we could hook deeper into the tracing logging
    layer, to avoid evaluating the filter twice.
    But I don't know how to do it.
    problame authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    e83684b View commit details
    Browse the repository at this point in the history
  2. feat: warn when requests get cancelled (#4064)

    Add a simple disarmable dropguard to log if request is cancelled before
    it is completed. We currently don't have this, and it makes for
    difficult to know when the request was dropped.
    koivunej authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    4911d7c View commit details
    Browse the repository at this point in the history
  3. add gauge for in-flight layer uploads (#3951)

    For the "worst-case /storage usage panel", we need to compute
    ```
    remote size + local-only size
    ```
    
    We currently don't have a metric for local-only layers.
    
    The number of in-flight layers in the upload queue is just that, so, let
    Prometheus scrape it.
    
    The metric is two counters (started and finished).
    The delta is the amount of in-flight uploads in the queue.
    
    The metrics are incremented in the respective `call_unfinished_metric_*`
    functions.
    These track ongoing operations by file_kind and op_kind.
    We only need this metric for layer uploads, so, there's the new
    RemoteTimelineClientMetricsCallTrackSize type that forces all call sites
    to decide whether they want the size tracked or not.
    If we find that other file_kinds or op_kinds are interesting (metadata
    uploads, layer downloads, layer deletes) are interesting, we can just
    enable them, and they'll be just another label combination within the
    metrics that this PR adds.
    
    fixes #3922
    problame authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    fa20e37 View commit details
    Browse the repository at this point in the history
  4. feat: add rough timings for basebackup (#4062)

    just record the time needed for waiting the lsn and then the basebackup
    in a log message in millis. this is related to ongoing investigations to
    cold start performance.
    
    this could also be a a counter. it cannot be added next to smgr
    histograms, because we don't want another histogram per timeline.
    
    the aim is to allow drilling deeper into which timelines were slow, and
    to understand why some need two basebackups.
    koivunej authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    cb94739 View commit details
    Browse the repository at this point in the history
  5. neon_local: fix tenant create -c eviction_policy:... (#4004)

    And add corresponding unit test.
    
    The fix is to use `.remove()` instead of `.get()` when processing the
    arugments hash map.
    The code uses emptiness of the hash map to determine whether all
    arguments have been processed.
    This was likely a copy-paste error.
    
        
    refs #3942
    problame authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    dbbe032 View commit details
    Browse the repository at this point in the history
  6. Deploy proxies for preview enviroments (#4052)

    ## Describe your changes
    Deploy `main` proxies to the preview environments
    We don't deploy storage there yet, as it's tricky.
    
    ## Issue ticket number and link
    neondatabase/cloud#4737
    SergeyMelnikov authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    78bbbcc View commit details
    Browse the repository at this point in the history
  7. fix: stop dead_code rustc lint (#4070)

    only happens without `--all-features` which is what `./run_clippy.sh`
    uses.
    koivunej authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    7f80230 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    bfd45dd View commit details
    Browse the repository at this point in the history
  9. Login to ECR and Docker Hub at once (#4067)

    - Update kaniko to 1.9.2 (from 1.7.0), problem with reproducible build is fixed
    - Login to ECR and Docker Hub at once, so we can push to several
    registries, it makes job `push-docker-hub` unneeded
    - `push-docker-hub` replaced with `promote-images` in `needs:` clause,
    Pushing images to production ECR moved to `promote-images` job
    bayandin authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    05ac0e2 View commit details
    Browse the repository at this point in the history
  10. Enable OpenTelemetry tracing in proxy in staging. (#4065)

    Depends on neondatabase/helm-charts#32
    
    Co-authored-by: Lassi Pölönen <lassi.polonen@iki.fi>
    hlinnaka and lassizci authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    8945fbd View commit details
    Browse the repository at this point in the history
  11. GitHub Workflows: Fix crane for several registries (#4076)

    Follow-up fix after #4067
    
    ```
    + crane tag neondatabase/vm-compute-node-v14:3064 latest
    Error: fetching "neondatabase/vm-compute-node-v14:3064": GET https://index.docker.io/v2/neondatabase/vm-compute-node-v14/manifests/3064: MANIFEST_UNKNOWN: manifest unknown; unknown tag=3064
    ```
    
    I reverted back the previous approach for promoting images
    (login to one registry, save images to local fs, logout and login to
    another registry, and push images from local fs). It turns out what
    works for one Google project (kaniko), doesn't work for another (crane)
    [sigh]
    bayandin authored Apr 25, 2023
    Configuration menu
    Copy the full SHA
    2d6fd72 View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2023

  1. Configuration menu
    Copy the full SHA
    9d0cf08 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f19b70b View commit details
    Browse the repository at this point in the history
  3. refactor: drop pageserver_ondisk_layers (#4071)

    I didn't get through #3775 fast enough so we wanted to remove this
    metric.
    
    Fixes #3705.
    koivunej authored Apr 26, 2023
    Configuration menu
    Copy the full SHA
    850f6b1 View commit details
    Browse the repository at this point in the history
  4. build: remove busted sk-1.us-east-2 from staging hosts (#4082)

    this should give us complete deployments while a new one is being
    brought up.
    koivunej authored Apr 26, 2023
    Configuration menu
    Copy the full SHA
    4625da3 View commit details
    Browse the repository at this point in the history
  5. feat: log how long tenant activation takes (#4080)

    Adds just a counter counting up from the creation to the tenant, logged
    after activation. Might help guide us with the investigation of #4025.
    koivunej authored Apr 26, 2023
    Configuration menu
    Copy the full SHA
    381c8fc View commit details
    Browse the repository at this point in the history
  6. Remove wait_for_sk_commit_lsn_to_reach_remote_storage.

    It had a couple of inherent races:
    
    1) Even if compute is killed before the call, some more data might still arrive
    to safekeepers after commit_lsn on them is polled, advancing it. Then checkpoint
    on pageserver might not include this tail, and so upload of expected LSN won't
    happen until one more checkpoint.
    
    2) commit_lsn is updated asynchronously -- compute can commit transaction before
    communicating commit_lsn to even single safekeeper (sync-safekeepers can be used
    to forces the advancement). This makes semantics of
    wait_for_sk_commit_lsn_to_reach_remote_storage quite complicated.
    
    Replace it with last_flush_lsn_upload which
    1) Learns last flush LSN on compute;
    2) Waits for it to arrive to pageserver;
    3) Checkpoints it;
    4) Waits for the upload.
    
    In some tests this keeps compute alive longer than before, but this doesn't seem
    to be important.
    
    There is a chance this fixes #3209
    arssher committed Apr 26, 2023
    Configuration menu
    Copy the full SHA
    31a3910 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    11df2ee View commit details
    Browse the repository at this point in the history