-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify commit-graph and multi-pack-index after writing #658
Verify commit-graph and multi-pack-index after writing #658
Conversation
7a22c4c
to
cd2f2fe
Compare
@jrbriggs pointed out that this would be better to do after every write to these files. Limits the window where there could be problems. |
I like this suggestion. It also keeps our maintenance tasks more deterministic, in that it will run in response to generating these files, instead of at a different set of (seemingly random) times. Do we have any ideas on what is causing the corruption? For instance, are we writing out corrupt files or, or is something else corrupting them after we generate them? If we are generating corrupt files, then this approach should let us identify it right away. If something else is corrupting them, then we might not pick it up as quickly... |
In the cases where I can get in contact with the user, the only ways the file could have been corrupted were on-disk bit flips or in-memory bit flips. We literally were validating the data right before sending it (via a buffer) to disk. Those users that ran memcheck found issues with their RAM. The perf machine recently had an issue that was caught by the |
If corruption occurs strictly after a write (e.g., midx verify would have passed immediately after the write, but would have failed a day later), I don't think it's within the scope of this project to fix. If we can detect corrupt data was written (e.g., a bit flip in memory was persisted to disk), we should mitigate. |
cd2f2fe
to
0772173
Compare
@derrickstolee Is it safe to assume the verify / write should be quick tasks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nits, and a question about the pattern.
this.GetPackFilesInfo(out int afterCount, out long afterSize, out hasKeep); | ||
this.GetPackFilesInfo(out int afterCount, out long afterSize, out hasKeep); | ||
|
||
GitProcess.Result verifyResult2 = this.RunGitCommand((process) => process.VerifyMultiPackIndex(this.Context.Enlistment.GitObjectsRoot)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verifyAfterRepackResult
?
metadata["TryDeleteFileResult"] = this.Context.FileSystem.TryDeleteFile(multiPackIndexPath); | ||
activity.RelatedError(metadata, "multi-pack-index is corrupt after write. Deleting and rewriting."); | ||
|
||
this.RunGitCommand((process) => process.WriteMultiPackIndex(this.Context.Enlistment.GitObjectsRoot)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about writing the midx without validating. This would still leave enlistments broken if it was corrupted. Deletion without rewrite would leave the enlistment functional, but slow. What are your thoughts on how we should handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we were a big more aggressive and had WriteMultiPackIndex do the verification / retry. This way we can be sure the Write/Verify are always paired together.
I agree it seems like it would be good to know if the rewrite was successful. Would it make sense to make the Verification a warning if we succeed on the second try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the primary question to answer is if we think regenerating the file will fix the corruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples we've seen all have the problem that the multi-pack-index or commit-graph files were corrupted, and the problem persisted only because we continued to trust that data when we write a new one. Deleting the file and rewriting (based on the .pack and .idx files) has always fixed the issue.
If the rewrite did not fix the issue, then we would see failures from the same enlistment multiple times in a row. I plan to set alerts for any user hitting this even once, so we would be ready to intervene if this happens repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. So we expect the validate/rewrite cycle to fix the issue in all cases we have seen so far? With monitors in place, I'm happy with this pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will automatically fix all cases we have seen so far.
They take about as long as it would take to write them from scratch. I've measured 11-16 seconds. |
ed8d7d9
to
e698810
Compare
Run the
git commit-graph verify
andgit multi-pack-index verify
commands after any command that would change those files. If these fail, then delete the corrupt file and rewrite.We've had issues with users data in the past, and this gives us a way to automatically detect and repair these scenarios. The immediate rewrite should work since we will regenerate from the other Git data. Issues we've seen in the past are related to trusting the content in these files and carrying that data forward to future versions of the file.