-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/metamorphic/crossversion/crossversion_test: TestMetaCrossVersion failed #2325
Comments
Potentially fallout from the new objstorage interface? |
Pretty weird: We try and delete a file
But then later, as part of
It looks like we're trying to work with an obsolete sstable. |
Looks like a jobID of 1 is only used during Open. So, we try to delete the file during open, but then later schedule a rewrite compaction for the same file? |
In Open, we ratchet the format major versions(which can schedule rewrite compactions). And then we delete obsolete files at the end of the Open function(after the ratchet). I think I understand what's going on here. Not a major bug. |
No that change is fine. I believe the bug is an interaction between 1. 7a29c5b, and 2. https://github.com/cockroachdb/pebble/blame/master/open.go#L254, and 3. https://github.com/cockroachdb/pebble/blame/master/open.go#L472.
In I think the fix is just to relist the files before scanning for obsolete. What do you think? |
Good find, that makes sense.
I'm a little wary of racing with concurrent compactions even if we relist the files. The ratcheting of the format major version can schedule compactions. It's possible that compactions may be running in the background after the ratcheting completes. Those concurrent compactions may also delete obsolete files. Before 163fdf6, we wouldn't schedule new compactions until the very end of Open: https://github.com/cockroachdb/pebble/blame/master/open.go#L485-L486 Should we delete |
Yea, I think that makes sense. But we'd also have to delay writing the Options file it looks like. On the other hand, we have some contradictory behaviour here: https://github.com/cockroachdb/pebble/blob/master/objstorage/provider.go#L239 and here https://github.com/cockroachdb/pebble/blob/master/objstorage/provider.go#L216. The caller of Should we keep the old behaviour, and allow deletion of files which have already been deleted. What do you think? cc @RaduBerinde |
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ 5d8daed0ac88:
|
Yes, I think we should fix that. Also, I have a TODO in |
@RaduBerinde We'd have to relist anyway, I think. The write ahead logs(which are associated with memtables) could also become obsolete(due to flushes during WAL replay). And are you saying we should allow duplicate file deletions(like we currently do), or we shouldn't? And I can finish this, since either way the fix will be small. |
Only tables go through objstorage, WALs would be unaffected by my changes. I do agree that objstorage should return not-exist. But in any case, we also need to get the list of tables from the objstorage provider (which by itself would also fix this). |
Return a NotExistError when the object is not in our list. This is important for Remove, where such an error is ignored. Informs cockroachdb#2325
I opened #2335 to return an error for which By the way, the failure from yesterday looks different: #2325 (comment) |
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ 9b1142a5070e:
|
Oh true, WAL deletion should be unaffected. Yea, I think just allowing deletion of duplicate file should fix this. I'll take a look at the other failure and determine what happened there. |
Return a NotExistError when the object is not in our list. This is important for Remove, where such an error is ignored. Informs #2325
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ ebc640d19e50:
|
Yes, I checked the second failure, which looked different, but it's the same error in the logs:
But at the end of the logs I also see:
We hit a panic while trying to The actual bug in the code is fixed, but we can leave this open until I investigate the problem in the test harness. |
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ efd802f16697:
|
^ This is a new failure. |
There's a similar looking error, but instead of deleting a file twice, we're trying to open a deleted file:
|
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ 79eb94777b28:
|
Haven't had a chance to investigate the new failure. Will look at it once I respond to the review on: #2352. |
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ bc4c9afe47a5:
|
Another missing sstable. This is concerning. We should identify this soon. @bananabrick you still want to take it, or you can punt it to me or maybe @RaduBerinde given the objstorage context.
|
@jbowens I'm looking now. |
Probably caused by #2356, but not sure why. |
@RaduBerinde Yea, it started failing when #2347 was merged. Posting what I gathered after a first look at the logs.
We see an sstable created message for sstable
After that we delete sstable After that another compaction job 9, tries to read file
We know that file |
OOC, how did you get to those logs? I've been looking around through the artifacts but I'm a little lost. |
You can click on |
@RaduBerinde It wasn't #2356 when the failure started, but #2347. |
In the latest failure log I see:
then
We probably need to |
Oh.. but we are stopping compactions and waiting for any to complete, so there shouldn't be any changes to the live set or the files. And this is during Open so nothing else is going on. |
The live file nums are determined through the latest version: https://github.com/cockroachdb/pebble/blob/master/version_set.go. So, |
Might be easier to reproduce this first. I'll try and do that. |
Shouldn't this be a loop?
In the log I'm looking at, there are two compactions going on right before the delete. |
Oh. I think that's right. Racing with compactions perfectly explains the bug. We're reacquiring the lock, but not ensuring that the condition is met. If there's two compactions going on, then one of them will Broadcast to the condition variable, and we'll proceed with the scan step even when the |
This makes sense to me and will probably fix the bug, but we should be a bit more principled and try and reproduce the bug to confirm for sure. |
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ fccb83b60d5c:
|
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ fccb83b60d5c:
|
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ 02d779ffbc46:
|
I'm going to close this out as this will most likely be fixed by #2360. |
internal/metamorphic/crossversion/crossversion_test.TestMetaCrossVersion failed with artifacts on refs/heads/master @ 5d8daed0ac88:
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: