Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not return Error in case of duplicate layer detetion #4094

Closed
wants to merge 4 commits into from

Conversation

knizhnik
Copy link
Contributor

Describe your changes

compact_level0_phase1 flush created new L1 layers to the disk and only after it compact_level0 can remove old layers. So it guarantees that we do not loose some data if compaction is interrupted.
If pageserver is interrupted at this moment then after pageserver restart the following state of layers are possible:

Some subset of new L1 layers is present at the disk + all old L0 layers
All new L1 layers are present at the disk + some subset of old L0 layers

In any case subsequent compaction may produce duplicates of L1 layers.

Issue ticket number and link

Fixes changes made in this PR: #3869
Discussion: https://neondb.slack.com/archives/C033QLM5P7D/p1682519017949719

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@knizhnik knizhnik requested review from a team as code owners April 27, 2023 12:46
@knizhnik knizhnik requested review from MMeent and hlinnaka and removed request for a team April 27, 2023 12:46
@github-actions
Copy link

github-actions bot commented Apr 27, 2023

Test results for 5f39328:


debug build: 221 tests run: 211 passed, 0 failed, 10 (full report)


release build: 221 tests run: 211 passed, 0 failed, 10 (full report)


@koivunej
Copy link
Member

This is related to #4088.

@koivunej koivunej self-requested a review April 27, 2023 14:05
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario to consider:

  1. Compaction writes an L1 layer to local disk.
  2. Crash before any L0 layers are deleted
  3. Restart
  4. The L1 layer is uploaded to S3
  5. The L1 layer is evicted from local disk
  6. Compaction starts again. It creates a new L1 layer with the same filename
  7. insert_historic_noflush() finds the duplicate

Now what? If we just log the error, you'll still have a RemoteLayer entry in the layer map, but the newly created file exists on local disk. I'm not sure what exactly happens then, but this is a very confused state.

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:

  1. create the layer file with a temporary filename
  2. get write-lock on the layer map
  3. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
  4. Rename the file to the final filename
  5. Release the lock.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

Comment on lines 280 to 283
error!(
"Attempt to insert duplicate layer {} in layer map",
layer.short_id()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel right. The caller presumably needs to do something differently, if the layer already existed. This needs to return some kind of an "AlreadyExistsError", and let the caller decide if that's OK or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I replaced Result<()> with Result<bool> to inform caller that duplicate is detected.
But right now it is not handled somehow.
In compaction the only thing we can do when detect duplicate layer is not to send it to S3.
But it is also not correct, because presence of layer local directory doesn't mean that it was successfully uploaded to S3.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the first pass, this is looking good.

Just to reiterate my earlier points and what I understand from the logs of the test run:

The test shows that finally the duplicate layer will be re-created, do I understand correctly? But the overwrite happens with rename, so hopefully we will be overwriting equal file contents. Feels like we should assert the contents are the same but I am not sure if we can afford that. Should the file contents differ, what to do then? Seems like no clear idea what we would do then.

Theoretical worst case I see that we would keep hold off an unnecessary layer, but like we discussed, it is very unlikely for the partitioning to change for the restart and pageserver to decide upon creating different L1 layers than it decided previously. The L1_from_previous_round and L1_after_new_repartitioning could even overlap. Would that be a problem, should it be possible to end up in this situation?

@koivunej
Copy link
Member

koivunej commented Apr 27, 2023

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:

0. create the layer file with a temporary filename
1. get write-lock on the layer map
2. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
3. Rename the file to the final filename
4. Release the lock.

So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

I think one compaction at a time is a "physical constant" of our system. Timeline::compact_inner grabs Timeline::layer_removal_cs. Multiple compactions could race layer download attempts however, and that should be fine.

@koivunej
Copy link
Member

check-codestyle-rust should be gone on next push or workflow run as #4095 got merged.

@knizhnik
Copy link
Contributor Author

Did the first pass, this is looking good.

Just to reiterate my earlier points and what I understand from the logs of the test run:

The test shows that finally the duplicate layer will be re-created, do I understand correctly? But the overwrite happens with rename, so hopefully we will be overwriting equal file contents. Feels like we should assert the contents are the same but I am not sure if we can afford that. Should the file contents differ, what to do then? Seems like no clear idea what we would do then.

It seems to be quite expensive to compare layer content.
Yes, it is intended to be the same.
But I do not think that it should be checked.
And you are right - it is not clear what to do if them are different.
Right now we just rewriting layer file.

Theoretical worst case I see that we would keep hold off an unnecessary layer, but like we discussed, it is very unlikely for the partitioning to change for the restart and pageserver to decide upon creating different L1 layers than it decided previously. The L1_from_previous_round and L1_after_new_repartitioning could even overlap. Would that be a problem, should it be possible to end up in this situation?

It should not be a problem: overlapped layers are expected.
This is why it is not so obvious why did we spent so much efforts to detect duplicates - it is some specific case of overlapped layers.

@knizhnik
Copy link
Contributor Author

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:

0. create the layer file with a temporary filename
1. get write-lock on the layer map
2. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
3. Rename the file to the final filename
4. Release the lock.

I do not understand why do we need lock here: rename is atomic and check for duplicates is done under layrer map lock.

So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.

We should upload new one to S3 because there is no guarantee that old one was uploaded.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

There is one compaction task, how it is possible that multiple compaction are performed concurrently?

@hlinnaka
Copy link
Contributor

The insert_historic_noflush() function needs to return an error or None or something to the caller, and the caller needs to decide what to do. I think the sequence is:

0. create the layer file with a temporary filename
1. get write-lock on the layer map
2. check if the layer map already contains a layer with the same name. If it does, delete the temporary file and leave the entry in the layer map alone.
3. Rename the file to the final filename
4. Release the lock.

I missed a critical step here: insert the entry to the layer map. I meant that to be part of step 2, or between steps 2 and 3.

I do not understand why do we need lock here: rename is atomic and check for duplicates is done under layrer map lock.

Hmm, so the sequence would be:

  1. create the layer file with a temporary filename
  2. Acquire write-lock on the layer map
  3. Atomically check if the layer map already contains a layer with the same name. If it does not, insert it
  4. Release lock
  5. Rename the file to the final filename

Is that what you're thinking ? There's a race condition: a concurrent getPage request could try to access the file between steps 3 and 4, and get ENOENT.

So we assume that whatever we rewrote this time around is equal or better than what we wrote last time? No, this assumes we rewrote the exact same thing, because we wouldn't be uploading this new one to s3.

We should upload new one to S3 because there is no guarantee that old one was uploaded.

I don't like this "overwrite old file" business. Let's prevent the overwriting.

Or maybe there's some simpler way if we know there can be only one compaction running at a time.

There is one compaction task, how it is possible that multiple compaction are performed concurrently?

Right, it's not possible currently. Ideally the code would be safe, even if we start doing parallel compactions in the future, or run GC, compaction or image layer creation in parallel with each other. What I meant is that if that's difficult, then maybe we rely on the fact that there can be only one compaction running at a time, after all.

@knizhnik
Copy link
Contributor Author

Correctness of handling overlapped layers is based on the assumption that intersected parts of layers files contain the same data. It is not somehow checked (because such check will be too expensive) but if this precondition is violated, then page reconstruction may be performed incorrectly.

Duplicate layers are just extreme case of overlapped layers. So they also should contain the same data. This is why it is both correct to rewrite old duplicate layer with new one or keep old layer and skip new layer.

Right now compaction is implemented in such way that new L1 layers are first flushed to the disk and only after it added to layer map when check for duplicates is performed. I do not think that it is critical to insert new layers in lay map after them are flushed to the disk. but I also do not see any problems with rewriting duplicate layer file on the disk and not changing layer map. If file content is the same - it doesn't matter.

@hlinnaka
Copy link
Contributor

Correctness of handling overlapped layers is based on the assumption that intersected parts of layers files contain the same data. It is not somehow checked (because such check will be too expensive) but if this precondition is violated, then page reconstruction may be performed incorrectly.

Duplicate layers are just extreme case of overlapped layers. So they also should contain the same data. This is why it is both correct to rewrite old duplicate layer with new one or keep old layer and skip new layer.

Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.

Right now compaction is implemented in such way that new L1 layers are first flushed to the disk and only after it added to layer map when check for duplicates is performed. I do not think that it is critical to insert new layers in lay map after them are flushed to the disk. but I also do not see any problems with rewriting duplicate layer file on the disk and not changing layer map. If file content is the same - it doesn't matter.

I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that.

@knizhnik
Copy link
Contributor Author

Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.

Yes but if two files are identical, what is the problem of overwriting old file with new one?
Both locally and in S3?
If it is done atomically and there is warranty that because of failure both files are lost, then what is wrong with overwritting file? Why do we need to care about it?

I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that.

So what is your suggestion?
If content of intersected section of overlapped segment is different, then we can get different page reconstruction result depending on which trajectory of layer map traversal we choose. And the sequence of traversal layers is determined y starting point - target LSN. So assume that some page was last updated at lsn=LSN0. If get_page(LSN1) and get_page(LSN2) returns different results for some LSN1>LSN0 and LSN2>LSN0, then we are in trouble, aren't we?
And as you mentioned, duplicate layers are extreme case of overlapped layers.
And despite to the fact that we are not able to store both of them, still if their content is different then we can get wrong behavior doesn't matter whether we will keep old version or replace it with new one.

@hlinnaka
Copy link
Contributor

Right, overlapping layers are supported, and you could indeed think of duplicate layers as just an extreme case of that. However, it breaks down because you cannot actually have two distinct layers with the same name. You can only have one file with the same filename on local disk, and you can only have one file with the same filename in S3.

Yes but if two files are identical, what is the problem of overwriting old file with new one? Both locally and in S3? If it is done atomically and there is warranty that because of failure both files are lost, then what is wrong with overwritting file? Why do we need to care about it?

I don't want to bet on the file content being 100% the same. Yes, currently the code is fully deterministic, but I don't want to take a hard dependency on that.

So what is your suggestion?

  1. Write file with temporary filename
  2. Lock layer map
  3. Try to insert entry in layer map. If a duplicate layer exists, delete the temporary file and leave the layer map unmodified
  4. Rename the temporary file to have the correct filename
  5. Release lock

Holding the lock while you rename the file isn't ideal from a concurrency point of view, but I think it's good enough for now.

If content of intersected section of overlapped segment is different, then we can get different page reconstruction result depending on which trajectory of layer map traversal we choose. And the sequence of traversal layers is determined y starting point - target LSN. So assume that some page was last updated at lsn=LSN0. If get_page(LSN1) and get_page(LSN2) returns different results for some LSN1>LSN0 and LSN2>LSN0, then we are in trouble, aren't we? And as you mentioned, duplicate layers are extreme case of overlapped layers. And despite to the fact that we are not able to store both of them, still if their content is different then we can get wrong behavior doesn't matter whether we will keep old version or replace it with new one.

Two reasons:

  1. There is a difference between logically identical and physically identical files. The duplicates should indeed be logically identical, but they might not be physically identical. In other words, you can use either one to reconstruct the page, and you should end up with the same result, but you cannot read block 0 from one file, and block 1 from the other one. We might decide to materialize some pages differently, or we might have some non-deterministic behaviour in how we lay the bytes on the disk. The layer-file construction is currently deterministic AFAIK, but I do not want to rely on that in the future. We also store the file length in the index_part.json file, so if the old and the new files have different length, you'd get errors about that on restart. We have discussed adding a checksum there too.

  2. You can end up in a confused state in the layer map. Take the example I gave earlier: the old file had already been uploaded to S3 and evicted from local disk. Now you create a new layer with the same filename. You successfully create the layer, but skip the insertion to the layer map because there was already a RemoteLayer entry for the same filename Now you have a file present locally, but the layer map claims it's a RemoteLayer. What does our on-demand download code do in that case, if you try to download the remote-layer later? And because the layer map doesn't know that the file exists locally, it won't be taken into account in physical size calculations and layer eviction.

You could probably find answers to all of those questions and make it safe. But that's a lot of new assumptions and restrictions for future code that you'd need to document and make sure we uphold now and in the future. It is much simpler to just not overwrite the layer in the first place.

@knizhnik
Copy link
Contributor Author

It is much simpler to just not overwrite the layer in the first place.

I agree with your arguments.
It seems to be that the simplest solution is not to overwrite the layer, but quite opposite - override it both on disk and layer map.
In this case we will have new layer on both places and them will be identical.

Right now we are writing new layers to the disk (using temp file for atomic rename),
then fsync them to the disk and then add the to layer map.

What you have suggested requires rewriting of all this case. And I am not sure that sequence you proposed is absolutely correct.
If you add layer to layer map before flushing it to the disk, then some other task (GC) cn perform soe action based on presence of this layer and after subsequent crash this file may be lost. As we can loose data.

@koivunej
Copy link
Member

(Made and quickly removed a bogus comment about the insert_hsitoric_noflush needing to use the buffer which is commited/flushed on drop by BufferedUpdates -- I just had a brief misunderstanding before checking the other methods at this level. You probably got email spam.)

@hlinnaka
Copy link
Contributor

It is much simpler to just not overwrite the layer in the first place.

I agree with your arguments. It seems to be that the simplest solution is not to overwrite the layer, but quite opposite - override it both on disk and layer map. In this case we will have new layer on both places and them will be identical.

I guess that would work better, but it feels silly to overwrite a perfectly good file. I would still worry if the contents of the old and new file are not physically identical.

Right now we are writing new layers to the disk (using temp file for atomic rename), then fsync them to the disk and then add the to layer map.

What you have suggested requires rewriting of all this case. And I am not sure that sequence you proposed is absolutely correct. If you add layer to layer map before flushing it to the disk, then some other task (GC) cn perform soe action based on presence of this layer and after subsequent crash this file may be lost. As we can loose data.

hmm, ok I see that the sequence for writing the files during compaction is a bit more complicated than I remembered. It's currently like this:

  1. Create all new L1 delta files. rename them in place one by one
  2. Fsync them all
  3. Acquire write-lock on layer map
  4. Insert all new L1 files to the layer map, and schedule uploads for them

If we rely on the fact that there can be only one compaction running at a time, we can do this:

  1. Create all new L1 delta files. For each file, before renaming it in place, check if there is a duplicate in the layer map. If there is, skip it.
  2. Fsync them all
  3. Acquire write-lock on layer map
  4. Insert all new L1 files to the layer map, and schedule uploads for them

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would restore the previous functionality but replace the L0 but still error out? I don't understand that part.

EDIT: I think this would mean that we have separate L0 instance in the historic layermap part and different instance in the LayerMap::l0_layers. Which I think would be bad, because we have eviction related assumptions that we have unique inmem layers 1:1 disk layers.

Other than that, it does seem ok as far as to check for duplicates and warn on adding. Have admit it took me a while to see the negated !Self::compare_arced_layers(...), which I still find problematic. The tests for insert would probably give me confidence.

@hlinnaka
Copy link
Contributor

hlinnaka commented Apr 28, 2023

If we rely on the fact that there can be only one compaction running at a time, we can do this:

  1. Create all new L1 delta files. For each file, before renaming it in place, check if there is a duplicate in the layer map. If there is, skip it.

  2. Fsync them all

  3. Acquire write-lock on layer map

  4. Insert all new L1 files to the layer map, and schedule uploads for them

See #4106 for how that would look like. I ran the new test case that this PR adds with it, but haven't done any other testing.

@problame
Copy link
Contributor

@hlinnaka why can't we do the rename thing while we hold the layer map write lock? Because it requires more code changes?

@hlinnaka
Copy link
Contributor

@hlinnaka why can't we do the rename thing while we hold the layer map write lock? Because it requires more code changes?

Right, it would require more code changes. Maybe we should try it and see how bad it is, though. Since we're not rushing this to production right now, we can breathe a little and do this properly.

@problame
Copy link
Contributor

Yeah I think the " rename thing while we hold the layer map write lock" won't be too bad in terms of implementation effort.

Do we want to commit the revert ( #4104 ) in the meantime so the release is unblocked?

@koivunej
Copy link
Member

Summarizing some of the discussion spread around everywhere here:

koivunej added a commit that referenced this pull request Apr 28, 2023
This reverts commit 732acc5.

Reverted PR: #3869

As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.

Cc: #4094, #4088
vadim2404 pushed a commit that referenced this pull request Apr 28, 2023
This reverts commit 732acc5.

Reverted PR: #3869

As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.

Cc: #4094, #4088
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
knizhnik pushed a commit that referenced this pull request May 15, 2023
Before finishing the delta file, and possibly overwriting an old
perfectly valid file, check if an identical file already exists in the
layer map.

This is an alternative for #4094.
Test case is copied from that PR.
@skyzh skyzh closed this in #4696 Jul 17, 2023
skyzh added a commit that referenced this pull request Jul 17, 2023
## Problem

Compactions might generate files of exactly the same name as before
compaction due to our naming of layer files. This could have already
caused some mess in the system, and is known to cause some issues like
#4088. Therefore, we now
consider duplicated layers in the post-compaction process to avoid
violating the layer map duplicate checks.

related previous works: close
#4094
error reported in: #4690,
#4088

## Summary of changes

If a file already exists in the layer map before the compaction, do not
modify the layer map and do not delete the file. The file on disk at
that time should be the new one overwritten by the compaction process.

This PR also adds a test case with a fail point that produces exactly
the same set of files.

This bypassing behavior is safe because the produced layer files have
the same content / are the same representation of the original file.

An alternative might be directly removing the duplicate check in the
layer map, but I feel it would be good if we can prevent that in the
first place.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants