Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(compute): Minor compute_ctl startup refactoring #11003

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ololobus
Copy link
Member

Problem

While debugging neondatabase/cloud#24882 I noticed that there is a significant amount of startup time (hundreds of ms) that is not attributed to any particular phase.

The current code structure makes it harder to add additional per-startup phase metrics.

Summary of changes

Refactor compute startup:

  1. Between getting spec from control plane and calling start_compute() we do not have any meaningful code except launching some aux threads.
  2. Move swap and disk quota configuration to a separate configure_vm() compute helper.
  3. Do not try to launch vm-monitor if Postgres startup failed and we are exiting anyway.

PR with more metrics will follow up, I do not want to pack too many changes in a single one.

@ololobus ololobus requested a review from sharnoff February 26, 2025 20:11
@ololobus ololobus requested a review from a team as a code owner February 26, 2025 20:11
@ololobus ololobus requested review from hlinnaka and MMeent February 26, 2025 20:11
@ololobus ololobus force-pushed the alexk/compute-ctl-refactor-start branch from af4745d to 04a934a Compare February 26, 2025 20:26
@ololobus ololobus force-pushed the alexk/compute-ctl-refactor-start branch from 04a934a to 6cfaa77 Compare February 26, 2025 20:55
@hlinnaka
Copy link
Contributor

@ololobus only noticed this now. I was working on similar refactorings, see PR #11007. I went much further, but it's the same general direction I think. I propose that we proceed with #11007. But if some issue with that arises in review or CI, we could fall back to this one.

@hlinnaka
Copy link
Contributor

PR with more metrics will follow up, I do not want to pack too many changes in a single one.

I'm curious to see what you have in mind. I've been playing with the OpenTelemetry traces again. It's great because it gives you a very clear visualization of where exactly the time is spent, but requires a bit of setup to collect the traces.

Copy link

7744 tests run: 7364 passed, 0 failed, 380 skipped (full report)


Flaky tests (3)

Postgres 17

Code coverage* (full report)

  • functions: 32.8% (8636 of 26362 functions)
  • lines: 48.6% (73085 of 150452 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
6cfaa77 at 2025-02-26T21:57:39.647Z :recycle:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants