Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive tenant initial load times on large pageserver after restart #4025

Closed
koivunej opened this issue Apr 14, 2023 · 6 comments
Closed
Labels
c/storage/pageserver Component: storage: pageserver t/investigation Needs further investigation

Comments

@koivunej
Copy link
Member

koivunej commented Apr 14, 2023

Internal slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1681409913535909 by @arssher who brought this up.

In general load is assumed to be fast, but recently two tenants went over multiple times the page_service.rs's get_active_tenant_with_timeout from:

let tenant = get_active_tenant_with_timeout(tenant_id, &ctx).await?;

Quick log analysis revealed no obvious reason. Working assumption is that we were busy, perhaps with even blocking work as earlier initial logical size calculations most likely had started at this point (re: #2975).

Post-analysis ideas in general so we'd learn these early:

  • @koivunej thought of a "load takes more than hardcoded seconds => warn!"
  • @problame suggested an alert on long times
@koivunej koivunej added the c/storage/pageserver Component: storage: pageserver label Apr 14, 2023
@koivunej
Copy link
Member Author

Aside note: we do log so much during startup and shutdown it's probably something to look into as well.

@koivunej koivunej added the t/investigation Needs further investigation label Apr 14, 2023
koivunej added a commit that referenced this issue Apr 26, 2023
koivunej added a commit that referenced this issue Apr 26, 2023
Adds just a counter counting up from the creation to the tenant, logged
after activation. Might help guide us with the investigation of #4025.
@koivunej
Copy link
Member Author

There seems to be a lot of fluctuation; a tenant with 10 timelines took over 100s to activate, this time only 70s.

@problame
Copy link
Contributor

I know you believe it's not the concurrency limiter semaphore, but, we could rule it out by introducing metrics that measure it.

I think you believe the root cause for this is the synchronous IO that we do. My #4215 might fix this.

@koivunej
Copy link
Member Author

I at least in the past weeks have been thinking it could be the s3 semaphore which is why I had hoped to continue on that, one or two weeks. I did not deprioritize it.

I still don't agree we can just add permits, because it's trivial to go over the rps limits and recently all aws limit exceeding has resulted in unclear failures.

Would very much want to add the task metrics but it will be requiring a different set of buckets if we can get the spawn_blocking.

koivunej added a commit that referenced this issue May 29, 2023
Startup can take a long time. We suspect it's the initial logical size
calculations. Long term solution is to not block the tokio executors but
do most of I/O in spawn_blocking.

See: #4025, #4183

Short-term solution to above:

- Delay global background tasks until initial tenant loads complete
- Just limit how many init logical size calculations can we have at the
same time to `cores / 2`

This PR is for trying in staging.
@koivunej
Copy link
Member Author

koivunej commented Aug 4, 2023

There's a more fresh duplicate: #4183 and an maybe an epic as well.

Fixes are in place: #4399 and are working, at best giving us 1ms per timeline, but at worst much more. Added #4892 for us to get understanding how long do things take.

Closing this to focus on #4183.

@koivunej koivunej closed this as completed Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/investigation Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants