Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify commit-graph and multi-pack-index after writing #658

Merged
merged 5 commits into from
Jan 3, 2019

Conversation

derrickstolee
Copy link
Contributor

@derrickstolee derrickstolee commented Jan 2, 2019

Run the git commit-graph verify and git multi-pack-index verify commands after any command that would change those files. If these fail, then delete the corrupt file and rewrite.

We've had issues with users data in the past, and this gives us a way to automatically detect and repair these scenarios. The immediate rewrite should work since we will regenerate from the other Git data. Issues we've seen in the past are related to trusting the content in these files and carrying that data forward to future versions of the file.

@derrickstolee derrickstolee force-pushed the verify-maintenance branch 2 times, most recently from 7a22c4c to cd2f2fe Compare January 2, 2019 21:17
@derrickstolee
Copy link
Contributor Author

@jrbriggs pointed out that this would be better to do after every write to these files. Limits the window where there could be problems.

@jamill
Copy link
Member

jamill commented Jan 3, 2019

@jrbriggs pointed out that this would be better to do after every write to these files. Limits the window where there could be problems.

I like this suggestion. It also keeps our maintenance tasks more deterministic, in that it will run in response to generating these files, instead of at a different set of (seemingly random) times.

Do we have any ideas on what is causing the corruption? For instance, are we writing out corrupt files or, or is something else corrupting them after we generate them? If we are generating corrupt files, then this approach should let us identify it right away. If something else is corrupting them, then we might not pick it up as quickly...

@derrickstolee
Copy link
Contributor Author

Do we have any ideas on what is causing the corruption? For instance, are we writing out corrupt files or, or is something else corrupting them after we generate them? If we are generating corrupt files, then this approach should let us identify it right away. If something else is corrupting them, then we might not pick it up as quickly...

In the cases where I can get in contact with the user, the only ways the file could have been corrupted were on-disk bit flips or in-memory bit flips. We literally were validating the data right before sending it (via a buffer) to disk. Those users that ran memcheck found issues with their RAM.

The perf machine recently had an issue that was caught by the git multi-pack-index repack command, but the problem may have existed much earlier. I'm not sure what caused it, but we could have noticed right away had we verified right afterward.

@jrbriggs
Copy link
Member

jrbriggs commented Jan 3, 2019

If corruption occurs strictly after a write (e.g., midx verify would have passed immediately after the write, but would have failed a day later), I don't think it's within the scope of this project to fix. If we can detect corrupt data was written (e.g., a bit flip in memory was persisted to disk), we should mitigate.

@derrickstolee derrickstolee changed the title [RFC] Add VerifyObjectStoreStep [RFC] Verify commit-graph and multi-pack-index after writing Jan 3, 2019
@jeschu1
Copy link
Member

jeschu1 commented Jan 3, 2019

@derrickstolee Is it safe to assume the verify / write should be quick tasks?

Copy link
Member

@jrbriggs jrbriggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nits, and a question about the pattern.

GVFS/GVFS.Common/Maintenance/PackfileMaintenanceStep.cs Outdated Show resolved Hide resolved
this.GetPackFilesInfo(out int afterCount, out long afterSize, out hasKeep);
this.GetPackFilesInfo(out int afterCount, out long afterSize, out hasKeep);

GitProcess.Result verifyResult2 = this.RunGitCommand((process) => process.VerifyMultiPackIndex(this.Context.Enlistment.GitObjectsRoot));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verifyAfterRepackResult?

GVFS/GVFS.Common/Maintenance/PackfileMaintenanceStep.cs Outdated Show resolved Hide resolved
metadata["TryDeleteFileResult"] = this.Context.FileSystem.TryDeleteFile(multiPackIndexPath);
activity.RelatedError(metadata, "multi-pack-index is corrupt after write. Deleting and rewriting.");

this.RunGitCommand((process) => process.WriteMultiPackIndex(this.Context.Enlistment.GitObjectsRoot));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about writing the midx without validating. This would still leave enlistments broken if it was corrupted. Deletion without rewrite would leave the enlistment functional, but slow. What are your thoughts on how we should handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we were a big more aggressive and had WriteMultiPackIndex do the verification / retry. This way we can be sure the Write/Verify are always paired together.

I agree it seems like it would be good to know if the rewrite was successful. Would it make sense to make the Verification a warning if we succeed on the second try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the primary question to answer is if we think regenerating the file will fix the corruption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples we've seen all have the problem that the multi-pack-index or commit-graph files were corrupted, and the problem persisted only because we continued to trust that data when we write a new one. Deleting the file and rewriting (based on the .pack and .idx files) has always fixed the issue.

If the rewrite did not fix the issue, then we would see failures from the same enlistment multiple times in a row. I plan to set alerts for any user hitting this even once, so we would be ready to intervene if this happens repeatedly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. So we expect the validate/rewrite cycle to fix the issue in all cases we have seen so far? With monitors in place, I'm happy with this pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will automatically fix all cases we have seen so far.

@derrickstolee
Copy link
Contributor Author

@derrickstolee Is it safe to assume the verify / write should be quick tasks?

They take about as long as it would take to write them from scratch. I've measured 11-16 seconds.

@derrickstolee derrickstolee changed the title [RFC] Verify commit-graph and multi-pack-index after writing Verify commit-graph and multi-pack-index after writing Jan 3, 2019
@derrickstolee derrickstolee merged commit a8e17e0 into microsoft:master Jan 3, 2019
@jrbriggs jrbriggs added this to the S147 milestone Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants