Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper handling for download failing to replace layer #3533

Closed
Tracked by #2476
koivunej opened this issue Feb 3, 2023 · 6 comments
Closed
Tracked by #2476

Proper handling for download failing to replace layer #3533

koivunej opened this issue Feb 3, 2023 · 6 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver

Comments

@koivunej
Copy link
Member

koivunej commented Feb 3, 2023

First discussed in #3387 (comment), made more robust in #3513, but still remains to be done:

  • reproducing test case
  • proper handling for metrics
  • downloaded file on disk
@koivunej
Copy link
Member Author

koivunej commented Feb 13, 2023

Because of compaction never downloading a remote layer and gc depending on the presence of image layers, I haven't yet figured out a way for this to be reproduced in a test. We certainly have the conditions for it.

If we supported compaction doing on-demand download while not holding layer_removal_cs #3589 suggested we could run into a replacement failure with the inserts and removals caused by compaction. I don't think that is valid either, because the two threads would download the same instance of RemoteLayer, so they would wait on the same semaphore, and either of those would win and do the replacement, then close the semaphore. If fixed compaction to work with on-demand downloads would be done partially, then trying to replace the already downloaded RemoteLayers from the LayerMap with actual reshuffled layers later would fail, but I don't think this is a race, just a logic issue.

(The "both wait on same semaphore" issue has been overlooked on a number of on-demand related guesses for replacement failures.)

@koivunej
Copy link
Member Author

Now suspected failure to replace layer happened in production while migrating a tenant, which caused a retry loop which killed the logging with:

2023-02-15T12:39:28.295686Z  INFO synthetic_size_worker:calculate_synthetic_size{tenant_id=X}:gather_size_inputs{tenant_id=X}:download_remote_layer{tenant_id=X timeline_id=Y layer=000000067F00004002000004EB0000000005-030000000000000000000000000000000002__00000000D2B03F89-00000000D2E79FE9}: download of layer has already finished

This comes from the semaphore being closed, which will be looped forever after one or more threads discover this RemoteLayer still in LayerMap (commit unrelated):

info!("download of layer has already finished");

I suspect it is similar to #3387 in that it happened via refresh_gc_info because we have a span on the only message (related fix: #3431).

This was on a version before #3558 fixes (we don't have git commits for production right now, I thought closest would be 248563c).

@LizardWizzard could you link this issue if you write something up about the production issue?

@koivunej
Copy link
Member Author

Looking at the scrapes for the tenant and timeline, this problematic layer must have been the last remote layer. This is something not discovered at least on #3387.

koivunej added a commit that referenced this issue Feb 17, 2023
Add an AtomicBool per RemoteLayer, use it to mark together with closed
semaphore that remotelayer is unusable until restart or ignore+load.

#3533 (comment)
@koivunej koivunej added the c/storage/pageserver Component: storage: pageserver label Feb 20, 2023
@koivunej
Copy link
Member Author

Discussed today with @problame: this shouldn't be an issue now that #3613 is merged. If a replacement fails:

  • the get_reconstruct_data's will be restarted
  • after next restart or ignore+load cycle, the layer will be added to layermap, uploaded
    • later on, gc should remove it (if preconditions to remove it before were met, they will condition to be met)

@problame
Copy link
Contributor

So, I think we're good here, for a low-acceptance-bar version of 'proper'?

IMO since we're not deleting the layer file on replace failure, it's correct to not decrement resident size.

@koivunej
Copy link
Member Author

I think so as well, on both, I assume we are not decrementing the size either after successful download right now. Closing, feel free to reopen in case I've misunderstood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

No branches or pull requests

2 participants