-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive tenant initial load times on large pageserver after restart #4025
Comments
Aside note: we do log so much during startup and shutdown it's probably something to look into as well. |
Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.
There seems to be a lot of fluctuation; a tenant with 10 timelines took over 100s to activate, this time only 70s. |
I know you believe it's not the concurrency limiter semaphore, but, we could rule it out by introducing metrics that measure it. I think you believe the root cause for this is the synchronous IO that we do. My #4215 might fix this. |
I at least in the past weeks have been thinking it could be the s3 semaphore which is why I had hoped to continue on that, one or two weeks. I did not deprioritize it. I still don't agree we can just add permits, because it's trivial to go over the rps limits and recently all aws limit exceeding has resulted in unclear failures. Would very much want to add the task metrics but it will be requiring a different set of buckets if we can get the spawn_blocking. |
Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking. See: #4025, #4183 Short-term solution to above: - Delay global background tasks until initial tenant loads complete - Just limit how many init logical size calculations can we have at the same time to `cores / 2` This PR is for trying in staging.
Internal slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1681409913535909 by @arssher who brought this up.
In general load is assumed to be fast, but recently two tenants went over multiple times the page_service.rs's
get_active_tenant_with_timeout
from:neon/pageserver/src/page_service.rs
Line 347 in b6c7c32
Quick log analysis revealed no obvious reason. Working assumption is that we were busy, perhaps with even blocking work as earlier initial logical size calculations most likely had started at this point (re: #2975).
Post-analysis ideas in general so we'd learn these early:
The text was updated successfully, but these errors were encountered: