Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't start compute #3405

Closed
vadim2404 opened this issue Jan 23, 2023 · 4 comments
Closed

Couldn't start compute #3405

vadim2404 opened this issue Jan 23, 2023 · 4 comments
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@vadim2404
Copy link
Contributor

Steps to reproduce

Try to connect to our largest tenant on staging

Expected result

The compute is started

Actual result

      "id": "a94cfb85-d6f5-45ec-b74f-e5965270dfa9",
      "project_id": "calm-unit-272325",
      "branch_id": "br-little-math-565362",
      "endpoint_id": "ep-fragrant-glade-840764",
      "action": "start_compute",
      "status": "failed",
      "error": "could not execute operation: could not start the compute node: failed to get basebackup@47/27F4E7D8 from pageserver host=pageserver-2.us-east-2.aws.neon.build port=6400\n\nCaused by:\n    0: db error: ERROR: Timed out while waiting for WAL record at LSN 47/27F4E7D8 to arrive, last_record_lsn 47/208936D8 disk consistent LSN=47/19086BD8\n    1: ERROR: Timed out while waiting for WAL record at LSN 47/27F4E7D8 to arrive, last_record_lsn 47/208936D8 disk consistent LSN=47/19086BD8",
      "failures_count": 6,
      "retry_at": "2023-01-23T12:36:29Z",
      "created_at": "2023-01-23T12:25:10Z",
      "updated_at": "2023-01-23T12:33:59Z"
    },

Environment

staging

Logs, links

I got the same error while vacuuming this database
the task is related to #3404

@vadim2404 vadim2404 added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Jan 23, 2023
@vadim2404 vadim2404 changed the title Could start compute Couldn't start compute Jan 23, 2023
@problame
Copy link
Contributor

Searching logs for this tenant:

2023-01-23T12:41:55.704913Z ERROR walreceiver_connection{id=8c9520708d8cce74f072a867f141c1b9/f15ae0cf21cce2ba27e4d80c6709a6cd node_id=60}: writeback of buffer EphemeralPage { file_id: 12, blkno: 12574 } failed: failed to write back to ephemeral file at /storage/pageserver/data/tenants/8c9520708d8cce74f072a867f141c1b9/timelines/f15ae0cf21cce2ba27e4d80c6709a6cd/ephemeral-12 error: No space left on device (os error 28)

So, the pagserver's /storage is full again

It grew from 91.7% to 100% disk usage at 9:28 UTC today, see https://neonprod.grafana.net/d/JKnWhTO7z/zenith-pageserver?orgId=1&var-cluster=victoria[…]-east-2.aws.neon.build&from=1674458272063&to=1674479872063 (edited)

Do you want me to root-cause why the disk usage grew or is that enough explanation for the error?

@vadim2404
Copy link
Contributor Author

I've run such queries, that can affect the disc size:

CREATE TABLE x (x INT);
INSERT INTO x SELECT * FROM generate_series(1, 100000);
SELECT SUM(x) FROM x;
DROP TABLE x;
VACUUM;

And I received the error while vacuuming the databsae

@problame
Copy link
Contributor

Yeah, that'd be expected. Just in case it's not obvious: even though the SQL is freeing up logical space, this doesn't free up any space on the pageserver, in fact, it consumes more space. Until PITR is over and compaction comes around to generate image layers. But even that needs more space before it can delete the delta layers.

@shanyp
Copy link
Contributor

shanyp commented Mar 23, 2023

This is by design and known behavior

@shanyp shanyp closed this as completed Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

3 participants