Excessive tenant initial load times on large pageserver after restart #4025

koivunej · 2023-04-14T09:18:58Z

Internal slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1681409913535909 by @arssher who brought this up.

In general load is assumed to be fast, but recently two tenants went over multiple times the page_service.rs's get_active_tenant_with_timeout from:

neon/pageserver/src/page_service.rs

Line 347 in b6c7c32

let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?;

Quick log analysis revealed no obvious reason. Working assumption is that we were busy, perhaps with even blocking work as earlier initial logical size calculations most likely had started at this point (re: #2975).

Post-analysis ideas in general so we'd learn these early:

@koivunej thought of a "load takes more than hardcoded seconds => warn!"
@problame suggested an alert on long times

The text was updated successfully, but these errors were encountered:

koivunej · 2023-04-14T09:23:41Z

Aside note: we do log so much during startup and shutdown it's probably something to look into as well.

re: #4025

Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.

koivunej · 2023-04-26T11:36:31Z

Put the storage operation time histograms on https://neonprod.grafana.net/d/b7a5a5e2-1276-4bb0-9e3a-b4528adb6eb6/storage-operations-histograms-in-prod?orgId=1&var-datasource=victoria-metrics-aws-prod&var-instance=All&var-operation=All

koivunej · 2023-05-11T16:05:45Z

There seems to be a lot of fluctuation; a tenant with 10 timelines took over 100s to activate, this time only 70s.

problame · 2023-05-11T16:19:51Z

I know you believe it's not the concurrency limiter semaphore, but, we could rule it out by introducing metrics that measure it.

I think you believe the root cause for this is the synchronous IO that we do. My #4215 might fix this.

koivunej · 2023-05-11T21:25:01Z

I at least in the past weeks have been thinking it could be the s3 semaphore which is why I had hoped to continue on that, one or two weeks. I did not deprioritize it.

I still don't agree we can just add permits, because it's trivial to go over the rps limits and recently all aws limit exceeding has resulted in unclear failures.

Would very much want to add the task metrics but it will be requiring a different set of buckets if we can get the spawn_blocking.

Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking. See: #4025, #4183 Short-term solution to above: - Delay global background tasks until initial tenant loads complete - Just limit how many init logical size calculations can we have at the same time to `cores / 2` This PR is for trying in staging.

koivunej · 2023-08-04T08:45:10Z

There's a more fresh duplicate: #4183 and an maybe an epic as well.

Fixes are in place: #4399 and are working, at best giving us 1ms per timeline, but at worst much more. Added #4892 for us to get understanding how long do things take.

Closing this to focus on #4183.

koivunej added the c/storage/pageserver Component: storage: pageserver label Apr 14, 2023

koivunej added the t/investigation Needs further investigation label Apr 14, 2023

koivunej added a commit that referenced this issue Apr 26, 2023

feat: log how long tenant activation takes

2147841

re: #4025

This was referenced Apr 26, 2023

feat: log how long tenant activation takes #4080

Merged

add pageserver SLO for startup performance: tenant load & time-to-active #4083

Open

koivunej added a commit that referenced this issue Apr 26, 2023

feat: log how long tenant activation takes (#4080)

381c8fc

Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.

koivunej mentioned this issue Apr 26, 2023

Add test and reduce prometheus metrics a little #4059

Open

koivunej mentioned this issue May 11, 2023

Pageserver is allegedly takes a lot of time to restart when there are a lot of tenants #4183

Closed

koivunej mentioned this issue May 29, 2023

try: startup speedup #4366

Merged

koivunej mentioned this issue May 30, 2023

Continued startup speedup #4372

Merged

5 tasks

koivunej closed this as completed Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive tenant initial load times on large pageserver after restart #4025

Excessive tenant initial load times on large pageserver after restart #4025

koivunej commented Apr 14, 2023 •

edited

Loading

koivunej commented Apr 14, 2023

koivunej commented Apr 26, 2023

koivunej commented May 11, 2023

problame commented May 11, 2023

koivunej commented May 11, 2023

koivunej commented Aug 4, 2023

Excessive tenant initial load times on large pageserver after restart #4025

Excessive tenant initial load times on large pageserver after restart #4025

Comments

koivunej commented Apr 14, 2023 • edited Loading

koivunej commented Apr 14, 2023

koivunej commented Apr 26, 2023

koivunej commented May 11, 2023

problame commented May 11, 2023

koivunej commented May 11, 2023

koivunej commented Aug 4, 2023

koivunej commented Apr 14, 2023 •

edited

Loading