-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not return Error in case of duplicate layer detetion #4094
Conversation
Test results for 5f39328:debug build: 221 tests run: 211 passed, 0 failed, 10 (full report)release build: 221 tests run: 211 passed, 0 failed, 10 (full report) |
This is related to #4088. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another scenario to consider:
- Compaction writes an L1 layer to local disk.
- Crash before any L0 layers are deleted
- Restart
- The L1 layer is uploaded to S3
- The L1 layer is evicted from local disk
- Compaction starts again. It creates a new L1 layer with the same filename
- insert_historic_noflush() finds the duplicate
Now what? If we just log the error, you'll still have a RemoteLayer entry in the layer map, but the newly created file exists on local disk. I'm not sure what exactly happens then, but this is a very confused state.
The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:
- create the layer file with a temporary filename
- get write-lock on the layer map
- check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
- Rename the file to the final filename
- Release the lock.
Or maybe there's some simpler way if we know there can be only one compaction running at a time.
pageserver/src/tenant/layer_map.rs
Outdated
error!( | ||
"Attempt to insert duplicate layer {} in layer map", | ||
layer.short_id() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel right. The caller presumably needs to do something differently, if the layer already existed. This needs to return some kind of an "AlreadyExistsError", and let the caller decide if that's OK or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I replaced Result<()>
with Result<bool>
to inform caller that duplicate is detected.
But right now it is not handled somehow.
In compaction the only thing we can do when detect duplicate layer is not to send it to S3.
But it is also not correct, because presence of layer local directory doesn't mean that it was successfully uploaded to S3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the first pass, this is looking good.
Just to reiterate my earlier points and what I understand from the logs of the test run:
The test shows that finally the duplicate layer will be re-created, do I understand correctly? But the overwrite happens with rename, so hopefully we will be overwriting equal file contents. Feels like we should assert the contents are the same but I am not sure if we can afford that. Should the file contents differ, what to do then? Seems like no clear idea what we would do then.
Theoretical worst case I see that we would keep hold off an unnecessary layer, but like we discussed, it is very unlikely for the partitioning to change for the restart and pageserver to decide upon creating different L1 layers than it decided previously. The L1_from_previous_round
and L1_after_new_repartitioning
could even overlap. Would that be a problem, should it be possible to end up in this situation?
So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.
I think one compaction at a time is a "physical constant" of our system. |
|
It seems to be quite expensive to compare layer content.
It should not be a problem: overlapped layers are expected. |
I do not understand why do we need lock here: rename is atomic and check for duplicates is done under layrer map lock.
We should upload new one to S3 because there is no guarantee that old one was uploaded.
There is one compaction task, how it is possible that multiple compaction are performed concurrently? |
I missed a critical step here: insert the entry to the layer map. I meant that to be part of step 2, or between steps 2 and 3.
Hmm, so the sequence would be:
Is that what you're thinking ? There's a race condition: a concurrent getPage request could try to access the file between steps 3 and 4, and get ENOENT.
I don't like this "overwrite old file" business. Let's prevent the overwriting.
Right, it's not possible currently. Ideally the code would be safe, even if we start doing parallel compactions in the future, or run GC, compaction or image layer creation in parallel with each other. What I meant is that if that's difficult, then maybe we rely on the fact that there can be only one compaction running at a time, after all. |
Correctness of handling overlapped layers is based on the assumption that intersected parts of layers files contain the same data. It is not somehow checked (because such check will be too expensive) but if this precondition is violated, then page reconstruction may be performed incorrectly. Duplicate layers are just extreme case of overlapped layers. So they also should contain the same data. This is why it is both correct to rewrite old duplicate layer with new one or keep old layer and skip new layer. Right now compaction is implemented in such way that new L1 layers are first flushed to the disk and only after it added to layer map when check for duplicates is performed. I do not think that it is critical to insert new layers in lay map after them are flushed to the disk. but I also do not see any problems with rewriting duplicate layer file on the disk and not changing layer map. If file content is the same - it doesn't matter. |
Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.
I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that. |
Yes but if two files are identical, what is the problem of overwriting old file with new one?
So what is your suggestion? |
Holding the lock while you rename the file isn't ideal from a concurrency point of view, but I think it's good enough for now.
Two reasons:
You could probably find answers to all of those questions and make it safe. But that's a lot of new assumptions and restrictions for future code that you'd need to document and make sure we uphold now and in the future. It is much simpler to just not overwrite the layer in the first place. |
I agree with your arguments. Right now we are writing new layers to the disk (using temp file for atomic rename), What you have suggested requires rewriting of all this case. And I am not sure that sequence you proposed is absolutely correct. |
(Made and quickly removed a bogus comment about the |
I guess that would work better, but it feels silly to overwrite a perfectly good file. I would still worry if the contents of the old and new file are not physically identical.
hmm, ok I see that the sequence for writing the files during compaction is a bit more complicated than I remembered. It's currently like this:
If we rely on the fact that there can be only one compaction running at a time, we can do this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this would restore the previous functionality but replace the L0 but still error out? I don't understand that part.
EDIT: I think this would mean that we have separate L0 instance in the historic layermap part and different instance in the LayerMap::l0_layers
. Which I think would be bad, because we have eviction related assumptions that we have unique inmem layers 1:1 disk layers.
Other than that, it does seem ok as far as to check for duplicates and warn on adding. Have admit it took me a while to see the negated !Self::compare_arced_layers(...)
, which I still find problematic. The tests for insert would probably give me confidence.
See #4106 for how that would look like. I ran the new test case that this PR adds with it, but haven't done any other testing. |
@hlinnaka why can't we do the rename thing while we hold the layer map write lock? Because it requires more code changes? |
Right, it would require more code changes. Maybe we should try it and see how bad it is, though. Since we're not rushing this to production right now, we can breathe a little and do this properly. |
Yeah I think the " rename thing while we hold the layer map write lock" won't be too bad in terms of implementation effort. Do we want to commit the revert ( #4104 ) in the meantime so the release is unblocked? |
Summarizing some of the discussion spread around everywhere here:
|
This reverts commit 732acc5. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088
This reverts commit 732acc5. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Before finishing the delta file, and possibly overwriting an old perfectly valid file, check if an identical file already exists in the layer map. This is an alternative for #4094. Test case is copied from that PR.
## Problem Compactions might generate files of exactly the same name as before compaction due to our naming of layer files. This could have already caused some mess in the system, and is known to cause some issues like #4088. Therefore, we now consider duplicated layers in the post-compaction process to avoid violating the layer map duplicate checks. related previous works: close #4094 error reported in: #4690, #4088 ## Summary of changes If a file already exists in the layer map before the compaction, do not modify the layer map and do not delete the file. The file on disk at that time should be the new one overwritten by the compaction process. This PR also adds a test case with a fail point that produces exactly the same set of files. This bypassing behavior is safe because the produced layer files have the same content / are the same representation of the original file. An alternative might be directly removing the duplicate check in the layer map, but I feel it would be good if we can prevent that in the first place. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Describe your changes
compact_level0_phase1 flush created new L1 layers to the disk and only after it compact_level0 can remove old layers. So it guarantees that we do not loose some data if compaction is interrupted.
If pageserver is interrupted at this moment then after pageserver restart the following state of layers are possible:
In any case subsequent compaction may produce duplicates of L1 layers.
Issue ticket number and link
Fixes changes made in this PR: #3869
Discussion: https://neondb.slack.com/archives/C033QLM5P7D/p1682519017949719
Checklist before requesting a review
Checklist before merging