Do not return Error in case of duplicate layer detetion #4094

knizhnik · 2023-04-27T12:46:58Z

Describe your changes

compact_level0_phase1 flush created new L1 layers to the disk and only after it compact_level0 can remove old layers. So it guarantees that we do not loose some data if compaction is interrupted.
If pageserver is interrupted at this moment then after pageserver restart the following state of layers are possible:

Some subset of new L1 layers is present at the disk + all old L0 layers
All new L1 layers are present at the disk + some subset of old L0 layers

In any case subsequent compaction may produce duplicates of L1 layers.

Issue ticket number and link

Fixes changes made in this PR: #3869
Discussion: https://neondb.slack.com/archives/C033QLM5P7D/p1682519017949719

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-04-27T13:04:21Z

Test results for `5f39328`:

debug build: 221 tests run: 211 passed, 0 failed, 10 (full report)

release build: 221 tests run: 211 passed, 0 failed, 10 (full report)

koivunej · 2023-04-27T14:04:23Z

This is related to #4088.

hlinnaka

Another scenario to consider:

Compaction writes an L1 layer to local disk.
Crash before any L0 layers are deleted
Restart
The L1 layer is uploaded to S3
The L1 layer is evicted from local disk
Compaction starts again. It creates a new L1 layer with the same filename
insert_historic_noflush() finds the duplicate

Now what? If we just log the error, you'll still have a RemoteLayer entry in the layer map, but the newly created file exists on local disk. I'm not sure what exactly happens then, but this is a very confused state.

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:

create the layer file with a temporary filename
get write-lock on the layer map
check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
Rename the file to the final filename
Release the lock.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

hlinnaka · 2023-04-27T13:59:15Z

pageserver/src/tenant/layer_map.rs

+            error!(
                "Attempt to insert duplicate layer {} in layer map",
                layer.short_id()
            );


This doesn't feel right. The caller presumably needs to do something differently, if the layer already existed. This needs to return some kind of an "AlreadyExistsError", and let the caller decide if that's OK or not.

Ok, I replaced Result<()> with Result<bool> to inform caller that duplicate is detected.
But right now it is not handled somehow.
In compaction the only thing we can do when detect duplicate layer is not to send it to S3.
But it is also not correct, because presence of layer local directory doesn't mean that it was successfully uploaded to S3.

test_runner/regress/test_duplicate_layers.py

koivunej

Did the first pass, this is looking good.

Just to reiterate my earlier points and what I understand from the logs of the test run:

The test shows that finally the duplicate layer will be re-created, do I understand correctly? But the overwrite happens with rename, so hopefully we will be overwriting equal file contents. Feels like we should assert the contents are the same but I am not sure if we can afford that. Should the file contents differ, what to do then? Seems like no clear idea what we would do then.

Theoretical worst case I see that we would keep hold off an unnecessary layer, but like we discussed, it is very unlikely for the partitioning to change for the restart and pageserver to decide upon creating different L1 layers than it decided previously. The L1_from_previous_round and L1_after_new_repartitioning could even overlap. Would that be a problem, should it be possible to end up in this situation?

koivunej · 2023-04-27T14:19:38Z

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:
0. create the layer file with a temporary filename
1. get write-lock on the layer map
2. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
3. Rename the file to the final filename
4. Release the lock.

So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

I think one compaction at a time is a "physical constant" of our system. Timeline::compact_inner grabs Timeline::layer_removal_cs. Multiple compactions could race layer download attempts however, and that should be fine.

koivunej · 2023-04-27T14:33:25Z

check-codestyle-rust should be gone on next push or workflow run as #4095 got merged.

knizhnik · 2023-04-27T15:18:29Z

Did the first pass, this is looking good.

Just to reiterate my earlier points and what I understand from the logs of the test run:

The test shows that finally the duplicate layer will be re-created, do I understand correctly? But the overwrite happens with rename, so hopefully we will be overwriting equal file contents. Feels like we should assert the contents are the same but I am not sure if we can afford that. Should the file contents differ, what to do then? Seems like no clear idea what we would do then.

It seems to be quite expensive to compare layer content.
Yes, it is intended to be the same.
But I do not think that it should be checked.
And you are right - it is not clear what to do if them are different.
Right now we just rewriting layer file.

Theoretical worst case I see that we would keep hold off an unnecessary layer, but like we discussed, it is very unlikely for the partitioning to change for the restart and pageserver to decide upon creating different L1 layers than it decided previously. The L1_from_previous_round and L1_after_new_repartitioning could even overlap. Would that be a problem, should it be possible to end up in this situation?

It should not be a problem: overlapped layers are expected.
This is why it is not so obvious why did we spent so much efforts to detect duplicates - it is some specific case of overlapped layers.

knizhnik · 2023-04-27T15:22:55Z

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:
0. create the layer file with a temporary filename
1. get write-lock on the layer map
2. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
3. Rename the file to the final filename
4. Release the lock.

I do not understand why do we need lock here: rename is atomic and check for duplicates is done under layrer map lock.

So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.

We should upload new one to S3 because there is no guarantee that old one was uploaded.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

There is one compaction task, how it is possible that multiple compaction are performed concurrently?

hlinnaka · 2023-04-27T19:08:42Z

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:
0. create the layer file with a temporary filename
1. get write-lock on the layer map
2. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
3. Rename the file to the final filename
4. Release the lock.

I missed a critical step here: insert the entry to the layer map. I meant that to be part of step 2, or between steps 2 and 3.

I do not understand why do we need lock here: rename is atomic and check for duplicates is done under layrer map lock.

Hmm, so the sequence would be:

create the layer file with a temporary filename
Acquire write-lock on the layer map
Atomically check if the layer map already contains a layer with the same name. If it does not, insert it
Release lock
Rename the file to the final filename

Is that what you're thinking ? There's a race condition: a concurrent getPage request could try to access the file between steps 3 and 4, and get ENOENT.

So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.

We should upload new one to S3 because there is no guarantee that old one was uploaded.

I don't like this "overwrite old file" business. Let's prevent the overwriting.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

There is one compaction task, how it is possible that multiple compaction are performed concurrently?

Right, it's not possible currently. Ideally the code would be safe, even if we start doing parallel compactions in the future, or run GC, compaction or image layer creation in parallel with each other. What I meant is that if that's difficult, then maybe we rely on the fact that there can be only one compaction running at a time, after all.

knizhnik · 2023-04-27T19:33:06Z

Correctness of handling overlapped layers is based on the assumption that intersected parts of layers files contain the same data. It is not somehow checked (because such check will be too expensive) but if this precondition is violated, then page reconstruction may be performed incorrectly.

Duplicate layers are just extreme case of overlapped layers. So they also should contain the same data. This is why it is both correct to rewrite old duplicate layer with new one or keep old layer and skip new layer.

Right now compaction is implemented in such way that new L1 layers are first flushed to the disk and only after it added to layer map when check for duplicates is performed. I do not think that it is critical to insert new layers in lay map after them are flushed to the disk. but I also do not see any problems with rewriting duplicate layer file on the disk and not changing layer map. If file content is the same - it doesn't matter.

hlinnaka · 2023-04-27T21:39:47Z

Correctness of handling overlapped layers is based on the assumption that intersected parts of layers files contain the same data. It is not somehow checked (because such check will be too expensive) but if this precondition is violated, then page reconstruction may be performed incorrectly.

Duplicate layers are just extreme case of overlapped layers. So they also should contain the same data. This is why it is both correct to rewrite old duplicate layer with new one or keep old layer and skip new layer.

Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.

Right now compaction is implemented in such way that new L1 layers are first flushed to the disk and only after it added to layer map when check for duplicates is performed. I do not think that it is critical to insert new layers in lay map after them are flushed to the disk. but I also do not see any problems with rewriting duplicate layer file on the disk and not changing layer map. If file content is the same - it doesn't matter.

I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that.

knizhnik · 2023-04-28T05:31:47Z

Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.

Yes but if two files are identical, what is the problem of overwriting old file with new one?
Both locally and in S3?
If it is done atomically and there is warranty that because of failure both files are lost, then what is wrong with overwritting file? Why do we need to care about it?

I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that.

So what is your suggestion?
If content of intersected section of overlapped segment is different, then we can get different page reconstruction result depending on which trajectory of layer map traversal we choose. And the sequence of traversal layers is determined y starting point - target LSN. So assume that some page was last updated at lsn=LSN0. If get_page(LSN1) and get_page(LSN2) returns different results for some LSN1>LSN0 and LSN2>LSN0, then we are in trouble, aren't we?
And as you mentioned, duplicate layers are extreme case of overlapped layers.
And despite to the fact that we are not able to store both of them, still if their content is different then we can get wrong behavior doesn't matter whether we will keep old version or replace it with new one.

hlinnaka · 2023-04-28T07:22:33Z

Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.

Yes but if two files are identical, what is the problem of overwriting old file with new one? Both locally and in S3? If it is done atomically and there is warranty that because of failure both files are lost, then what is wrong with overwritting file? Why do we need to care about it?

I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that.

So what is your suggestion?

Write file with temporary filename
Lock layer map
Try to insert entry in layer map. If a duplicate layer exists, delete the temporary file and leave the layer map unmodified
Rename the temporary file to have the correct filename
Release lock

Holding the lock while you rename the file isn't ideal from a concurrency point of view, but I think it's good enough for now.

If content of intersected section of overlapped segment is different, then we can get different page reconstruction result depending on which trajectory of layer map traversal we choose. And the sequence of traversal layers is determined y starting point - target LSN. So assume that some page was last updated at lsn=LSN0. If get_page(LSN1) and get_page(LSN2) returns different results for some LSN1>LSN0 and LSN2>LSN0, then we are in trouble, aren't we? And as you mentioned, duplicate layers are extreme case of overlapped layers. And despite to the fact that we are not able to store both of them, still if their content is different then we can get wrong behavior doesn't matter whether we will keep old version or replace it with new one.

Two reasons:

There is a difference between logically identical and physically identical files. The duplicates should indeed be logically identical, but they might not be physically identical. In other words, you can use either one to reconstruct the page, and you should end up with the same result, but you cannot read block 0 from one file, and block 1 from the other one. We might decide to materialize some pages differently, or we might have some non-deterministic behaviour in how we lay the bytes on the disk. The layer-file construction is currently deterministic AFAIK, but I do not want to rely on that in the future. We also store the file length in the index_part.json file, so if the old and the new files have different length, you'd get errors about that on restart. We have discussed adding a checksum there too.
You can end up in a confused state in the layer map. Take the example I gave earlier: the old file had already been uploaded to S3 and evicted from local disk. Now you create a new layer with the same filename. You successfully create the layer, but skip the insertion to the layer map because there was already a RemoteLayer entry for the same filename Now you have a file present locally, but the layer map claims it's a RemoteLayer. What does our on-demand download code do in that case, if you try to download the remote-layer later? And because the layer map doesn't know that the file exists locally, it won't be taken into account in physical size calculations and layer eviction.

You could probably find answers to all of those questions and make it safe. But that's a lot of new assumptions and restrictions for future code that you'd need to document and make sure we uphold now and in the future. It is much simpler to just not overwrite the layer in the first place.

knizhnik · 2023-04-28T09:49:55Z

It is much simpler to just not overwrite the layer in the first place.

I agree with your arguments.
It seems to be that the simplest solution is not to overwrite the layer, but quite opposite - override it both on disk and layer map.
In this case we will have new layer on both places and them will be identical.

Right now we are writing new layers to the disk (using temp file for atomic rename),
then fsync them to the disk and then add the to layer map.

What you have suggested requires rewriting of all this case. And I am not sure that sequence you proposed is absolutely correct.
If you add layer to layer map before flushing it to the disk, then some other task (GC) cn perform soe action based on presence of this layer and after subsequent crash this file may be lost. As we can loose data.

koivunej · 2023-04-28T11:29:11Z

(Made and quickly removed a bogus comment about the insert_hsitoric_noflush needing to use the buffer which is commited/flushed on drop by BufferedUpdates -- I just had a brief misunderstanding before checking the other methods at this level. You probably got email spam.)

hlinnaka · 2023-04-28T11:47:14Z

It is much simpler to just not overwrite the layer in the first place.

I agree with your arguments. It seems to be that the simplest solution is not to overwrite the layer, but quite opposite - override it both on disk and layer map. In this case we will have new layer on both places and them will be identical.

I guess that would work better, but it feels silly to overwrite a perfectly good file. I would still worry if the contents of the old and new file are not physically identical.

Right now we are writing new layers to the disk (using temp file for atomic rename), then fsync them to the disk and then add the to layer map.

What you have suggested requires rewriting of all this case. And I am not sure that sequence you proposed is absolutely correct. If you add layer to layer map before flushing it to the disk, then some other task (GC) cn perform soe action based on presence of this layer and after subsequent crash this file may be lost. As we can loose data.

hmm, ok I see that the sequence for writing the files during compaction is a bit more complicated than I remembered. It's currently like this:

Create all new L1 delta files. rename them in place one by one
Fsync them all
Acquire write-lock on layer map
Insert all new L1 files to the layer map, and schedule uploads for them

If we rely on the fact that there can be only one compaction running at a time, we can do this:

Create all new L1 delta files. For each file, before renaming it in place, check if there is a duplicate in the layer map. If there is, skip it.
Fsync them all
Acquire write-lock on layer map
Insert all new L1 files to the layer map, and schedule uploads for them

pageserver/src/tenant/layer_map.rs

koivunej

So this would restore the previous functionality but replace the L0 but still error out? I don't understand that part.

EDIT: I think this would mean that we have separate L0 instance in the historic layermap part and different instance in the LayerMap::l0_layers. Which I think would be bad, because we have eviction related assumptions that we have unique inmem layers 1:1 disk layers.

Other than that, it does seem ok as far as to check for duplicates and warn on adding. Have admit it took me a while to see the negated !Self::compare_arced_layers(...), which I still find problematic. The tests for insert would probably give me confidence.

hlinnaka · 2023-04-28T12:47:05Z

If we rely on the fact that there can be only one compaction running at a time, we can do this:

Create all new L1 delta files. For each file, before renaming it in place, check if there is a duplicate in the layer map. If there is, skip it.

Fsync them all

Acquire write-lock on layer map

Insert all new L1 files to the layer map, and schedule uploads for them

See #4106 for how that would look like. I ran the new test case that this PR adds with it, but haven't done any other testing.

problame · 2023-04-28T12:49:02Z

@hlinnaka why can't we do the rename thing while we hold the layer map write lock? Because it requires more code changes?

hlinnaka · 2023-04-28T12:59:39Z

@hlinnaka why can't we do the rename thing while we hold the layer map write lock? Because it requires more code changes?

Right, it would require more code changes. Maybe we should try it and see how bad it is, though. Since we're not rushing this to production right now, we can breathe a little and do this properly.

problame · 2023-04-28T13:01:37Z

Yeah I think the " rename thing while we hold the layer map write lock" won't be too bad in terms of implementation effort.

Do we want to commit the revert ( #4104 ) in the meantime so the release is unblocked?

koivunej · 2023-04-28T13:49:30Z

Summarizing some of the discussion spread around everywhere here:

we go forwards with the revert revert: "Add check for duplicates of generated image layers" #4104 which has the rationale
- known issues vs. unknown
this PR has an alternative in Fix the duplicate key problem during compaction. #4106

This reverts commit 732acc5. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Before finishing the delta file, and possibly overwriting an old perfectly valid file, check if an identical file already exists in the layer map. This is an alternative for #4094. Test case is copied from that PR.

## Problem Compactions might generate files of exactly the same name as before compaction due to our naming of layer files. This could have already caused some mess in the system, and is known to cause some issues like #4088. Therefore, we now consider duplicated layers in the post-compaction process to avoid violating the layer map duplicate checks. related previous works: close #4094 error reported in: #4690, #4088 ## Summary of changes If a file already exists in the layer map before the compaction, do not modify the layer map and do not delete the file. The file on disk at that time should be the new one overwritten by the compaction process. This PR also adds a test case with a fail point that produces exactly the same set of files. This bypassing behavior is safe because the produced layer files have the same content / are the same representation of the original file. An alternative might be directly removing the duplicate check in the layer map, but I feel it would be good if we can prevent that in the first place. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Do not return Error in case of duplicate layer detetion

9b09503

knizhnik requested review from a team as code owners April 27, 2023 12:46

knizhnik requested review from MMeent and hlinnaka and removed request for a team April 27, 2023 12:46

Make clippy happy

fa7e975

koivunej self-requested a review April 27, 2023 14:05

hlinnaka requested changes Apr 27, 2023

View reviewed changes

koivunej reviewed Apr 27, 2023

View reviewed changes

test_runner/regress/test_duplicate_layers.py Show resolved Hide resolved

koivunej reviewed Apr 27, 2023

View reviewed changes

test_runner/regress/test_duplicate_layers.py Show resolved Hide resolved

koivunej reviewed Apr 27, 2023

View reviewed changes

koivunej mentioned this pull request Apr 28, 2023

revert: "Add check for duplicates of generated image layers" #4104

Merged

Replace suplicate layer in layer map

5f39328

koivunej reviewed Apr 28, 2023

View reviewed changes

pageserver/src/tenant/layer_map.rs Outdated Show resolved Hide resolved

koivunej reviewed Apr 28, 2023

View reviewed changes

hlinnaka mentioned this pull request Apr 28, 2023

Fix the duplicate key problem during compaction. #4106

Closed

Update pageserver/src/tenant/layer_map.rs

6b35f33

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

problame mentioned this pull request Jun 28, 2023

refactor: use LayerDesc in LayerMap (part 2) #4437

Merged

5 tasks

koivunej mentioned this pull request Jul 13, 2023

compaction: allow duplicated layers and skip in replacement #4696

Merged

5 tasks

skyzh closed this in #4696 Jul 17, 2023

problame mentioned this pull request Sep 1, 2023

Epic: crash-consistent layer map through index_part.json #5172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not return Error in case of duplicate layer detetion #4094

Do not return Error in case of duplicate layer detetion #4094

knizhnik commented Apr 27, 2023

github-actions bot commented Apr 27, 2023 •

edited

Loading

koivunej commented Apr 27, 2023

hlinnaka left a comment

hlinnaka Apr 27, 2023

knizhnik Apr 27, 2023

koivunej left a comment

koivunej commented Apr 27, 2023 •

edited

Loading

koivunej commented Apr 27, 2023

knizhnik commented Apr 27, 2023

knizhnik commented Apr 27, 2023

hlinnaka commented Apr 27, 2023

knizhnik commented Apr 27, 2023

hlinnaka commented Apr 27, 2023

knizhnik commented Apr 28, 2023

hlinnaka commented Apr 28, 2023

knizhnik commented Apr 28, 2023

koivunej commented Apr 28, 2023

hlinnaka commented Apr 28, 2023

koivunej left a comment •

edited

Loading

hlinnaka commented Apr 28, 2023 •

edited

Loading

problame commented Apr 28, 2023

hlinnaka commented Apr 28, 2023

problame commented Apr 28, 2023

koivunej commented Apr 28, 2023

Do not return Error in case of duplicate layer detetion #4094

Do not return Error in case of duplicate layer detetion #4094

Conversation

knizhnik commented Apr 27, 2023

Describe your changes

Issue ticket number and link

Checklist before requesting a review

Checklist before merging

github-actions bot commented Apr 27, 2023 • edited Loading

Test results for 5f39328:

debug build: 221 tests run: 211 passed, 0 failed, 10 (full report)

release build: 221 tests run: 211 passed, 0 failed, 10 (full report)

koivunej commented Apr 27, 2023

hlinnaka left a comment

Choose a reason for hiding this comment

hlinnaka Apr 27, 2023

Choose a reason for hiding this comment

knizhnik Apr 27, 2023

Choose a reason for hiding this comment

koivunej left a comment

Choose a reason for hiding this comment

koivunej commented Apr 27, 2023 • edited Loading

koivunej commented Apr 27, 2023

knizhnik commented Apr 27, 2023

knizhnik commented Apr 27, 2023

hlinnaka commented Apr 27, 2023

knizhnik commented Apr 27, 2023

hlinnaka commented Apr 27, 2023

knizhnik commented Apr 28, 2023

hlinnaka commented Apr 28, 2023

knizhnik commented Apr 28, 2023

koivunej commented Apr 28, 2023

hlinnaka commented Apr 28, 2023

koivunej left a comment • edited Loading

Choose a reason for hiding this comment

hlinnaka commented Apr 28, 2023 • edited Loading

problame commented Apr 28, 2023

hlinnaka commented Apr 28, 2023

problame commented Apr 28, 2023

koivunej commented Apr 28, 2023

github-actions bot commented Apr 27, 2023 •

edited

Loading

Test results for `5f39328`:

koivunej commented Apr 27, 2023 •

edited

Loading

koivunej left a comment •

edited

Loading

hlinnaka commented Apr 28, 2023 •

edited

Loading