Couldn't start compute #3405

vadim2404 · 2023-01-23T12:38:42Z

Steps to reproduce

Try to connect to our largest tenant on staging

Expected result

The compute is started

Actual result

      "id": "a94cfb85-d6f5-45ec-b74f-e5965270dfa9",
      "project_id": "calm-unit-272325",
      "branch_id": "br-little-math-565362",
      "endpoint_id": "ep-fragrant-glade-840764",
      "action": "start_compute",
      "status": "failed",
      "error": "could not execute operation: could not start the compute node: failed to get basebackup@47/27F4E7D8 from pageserver host=pageserver-2.us-east-2.aws.neon.build port=6400\n\nCaused by:\n    0: db error: ERROR: Timed out while waiting for WAL record at LSN 47/27F4E7D8 to arrive, last_record_lsn 47/208936D8 disk consistent LSN=47/19086BD8\n    1: ERROR: Timed out while waiting for WAL record at LSN 47/27F4E7D8 to arrive, last_record_lsn 47/208936D8 disk consistent LSN=47/19086BD8",
      "failures_count": 6,
      "retry_at": "2023-01-23T12:36:29Z",
      "created_at": "2023-01-23T12:25:10Z",
      "updated_at": "2023-01-23T12:33:59Z"
    },

Environment

staging

Logs, links

I got the same error while vacuuming this database
the task is related to #3404

The text was updated successfully, but these errors were encountered:

problame · 2023-01-23T13:20:12Z

Searching logs for this tenant:

2023-01-23T12:41:55.704913Z ERROR walreceiver_connection{id=8c9520708d8cce74f072a867f141c1b9/f15ae0cf21cce2ba27e4d80c6709a6cd node_id=60}: writeback of buffer EphemeralPage { file_id: 12, blkno: 12574 } failed: failed to write back to ephemeral file at /storage/pageserver/data/tenants/8c9520708d8cce74f072a867f141c1b9/timelines/f15ae0cf21cce2ba27e4d80c6709a6cd/ephemeral-12 error: No space left on device (os error 28)

So, the pagserver's /storage is full again

It grew from 91.7% to 100% disk usage at 9:28 UTC today, see https://neonprod.grafana.net/d/JKnWhTO7z/zenith-pageserver?orgId=1&var-cluster=victoria[…]-east-2.aws.neon.build&from=1674458272063&to=1674479872063 (edited)

Do you want me to root-cause why the disk usage grew or is that enough explanation for the error?

vadim2404 · 2023-01-23T14:49:01Z

I've run such queries, that can affect the disc size:

CREATE TABLE x (x INT);
INSERT INTO x SELECT * FROM generate_series(1, 100000);
SELECT SUM(x) FROM x;
DROP TABLE x;
VACUUM;

And I received the error while vacuuming the databsae

problame · 2023-01-24T14:49:55Z

Yeah, that'd be expected. Just in case it's not obvious: even though the SQL is freeing up logical space, this doesn't free up any space on the pageserver, in fact, it consumes more space. Until PITR is over and compaction comes around to generate image layers. But even that needs more space before it can delete the delta layers.

shanyp · 2023-03-23T14:10:18Z

This is by design and known behavior

vadim2404 added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Jan 23, 2023

vadim2404 changed the title ~~Could start compute~~ Couldn't start compute Jan 23, 2023

shanyp closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couldn't start compute #3405

Couldn't start compute #3405

vadim2404 commented Jan 23, 2023

problame commented Jan 23, 2023

vadim2404 commented Jan 23, 2023

problame commented Jan 24, 2023

shanyp commented Mar 23, 2023

Couldn't start compute #3405

Couldn't start compute #3405

Comments

vadim2404 commented Jan 23, 2023

Steps to reproduce

Expected result

Actual result

Environment

Logs, links

problame commented Jan 23, 2023

vadim2404 commented Jan 23, 2023

problame commented Jan 24, 2023

shanyp commented Mar 23, 2023