-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-16423. S3Guard fsck: Check metadata consistency between S3 and metadatastore (log) #1208
Conversation
...s/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3GuardFsckViolationHandler.java
Outdated
Show resolved
Hide resolved
...s/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3GuardFsckViolationHandler.java
Outdated
Show resolved
Hide resolved
Missing things:
|
bd74814
to
3765377
Compare
...s/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3GuardFsckViolationHandler.java
Outdated
Show resolved
Hide resolved
6e8779b
to
79fa8b1
Compare
|
Overallthere are checks, but as it doesnt recurse from me it's hard to valide them. The UX can be improved. I propose:
operationsfailed on root entry.
with a path, I got the same message twice
missing file.
This is good. Add a test for it. s3a://bucket/.. This is bad. Add test and then fix.
a check of s3a://guarded-table/example/.. shows qualification is taking place; it is the root dir where things break. unknown bucket. I'd like the failure message, and no need for the usage string as it is unrelated
For a missing path in the valid fs, a failure. ()
I'd expect a check to flag the file is missing in s3guard and s3, and if there's a tombstone in s3 to catch that and consider it a check failure. now, prepared a dir with
on both root and etc/ it failed with a message about etags. these are directories. etags should not be needed.
Valid file is good. I think we can/should print size, timestamp, etag and version, for more info
now purge the ddb table and repeat Prune didn't work. this looks like a prune problem as it does the same for tombstones and DDB shows a lot of them. And I'd have expected to see some debug level logging.
Manually delete from the AWS console and fsck the file
recover into ddb then retry. now fails on version id mismatch
the two IDs match; I don't know what the problem is. And if it was a real failure, I'd like to know the rest of the details about the entry. |
d6d357a
to
e402651
Compare
e402651
to
cdbd0d2
Compare
working on the CLI, but overreporting errors, especially on versioning Scan of a dir did work, but it overreacts to
Reports no etag on directory entries where we don't expect one:
On a scan of a tree it reports version ID mismatches where s3 == null:
The ddb table has the version ID, but I'm assuming that the scan doesn't get them from S3 because we'd need to use HEAD over LIST. When I give the full path
Note also that the file gets scanned twice. This hints at the scanning playing up when the supplied path is a file, not a dir. Now I open the file with
side issue: what to do when the path supplied is for a file which has a tombstone in DDB and no file? Currently it's FNFE
Are we confident that this command will do a check if there is a file in S3 but tombstoned in MS? |
OK, latest patch is better.
|
e.g
The good news: the return code is 0; it passed the scan. So these are just info rather than warn |
Did another test run on an unversioned bucket where the the DDB table was built up with an ls -R, so filled up straight from S3. all checks happy (e.g modtime) but still warning of null etags on all directories, including the root one.
|
got a test run failure in a new test
|
and
|
The testCLIFsckWithParam test works standalone. Looks to me like a race condition -the fsck is being performed on an active bucket and files have been deleted between listed and queued for scanning and the actual scan.
But: the failure could be delayed and reported as a missing file, rather than triggering a fast failure. |
OK, latest review is good with etags; modtime is something we can worry about as an extra iteration. Tested on a store which is set up for auth listings and is clearly considered inconsistent. It warns me of this, maybe in too much detail.
It's noisy, but, well, that could be tuned a bit by cutting back on the amount of the S3AFileStatus fields to print. What is clear is: it found real problems where the DDB was incomplete. But: exit code was still 0. I think we should return something in the case where there is a mismatch of this scale. BTW, a |
Overall then, last iteration has a working CLI
|
🎊 +1 overall
This message was automatically generated. |
…sionID mismatch, added severity (not used enywhere yet); Change-Id: I4e327bb172663b5da247789d19053e6d54e88a1e
…sing the class defined in the enum instead. Change-Id: I2debc18e70af54ed08d0382bf42e0e11e3100603
Change-Id: Ifcaaf2ca2027e81f3be0dc1337b34aa315b8d5c1
…oolDynamoDB#testCLIFsckWithParam Change-Id: I6bbb331b6c0a41c61043e482b95504fda8a50596
Change-Id: I611e7421ba061d1048bd6bb182f5238f810a400a
…ed because of readability. Change-Id: I0660c181ec07e2c0addd906fdac41b92169283f5
Change-Id: I751b1520070836894b0667eff7861d0eb760a4a3
… and working. May need some more fine-tuning. Change-Id: I11df42693da9911738dbc031e74f418d487b2460
Change-Id: I8330f9c562ab3e3335a2aea7a85446643ce4fa8c
Change-Id: Iaff8875a7ca238639c105537c3268bfb212189e2
Change-Id: Ife89007fdc028aa49abe0ed6441f95e08078688f
Change-Id: I2ce69d66e348c4c0aded9bc8cf273e0c3a44f580
Change-Id: Ib933b0cfee6fd5dd9da0a062b5f81c26e94d383c
Change-Id: I98da340813d826acdf21e13698942c9cde09f192
42f44c6
to
633d6cd
Compare
last test run:
|
my branch with changes to the assertion https://github.com/steveloughran/hadoop/tree/incoming/HADOOP-16423-fsck-log-changes |
🎊 +1 overall
This message was automatically generated. |
Change-Id: Ic31cbd5925b92df6c421012a3b91497d16aa6bef
🎊 +1 overall
This message was automatically generated. |
ok, full test with ddb non-auth happy; repeating with auth Tomorrow I'll build the CLI and run the various manual operations which were failing |
+1As well as all the automated tests, did some manual command line operations.
All outcomes were as expected I'm happy with this FollowupOne of the changes with the HADOOP-16430 PR is that we now have an S3A FS method This can be a follow-on patch, because as usual it will need more tests, in the code, and someone to try out the command line. |
Created followup: https://issues.apache.org/jira/browse/HADOOP-16563 |
… metadatastore (log) (apache#1208). Contributed by Gabor Bota. Change-Id: I6bbb331b6c0a41c61043e482b95504fda8a50596 (cherry picked from commit 4e273a3)
… metadatastore (log) (apache#1208). Contributed by Gabor Bota. Change-Id: I6bbb331b6c0a41c61043e482b95504fda8a50596
… metadatastore (log) (apache#1208). Contributed by Gabor Bota. Change-Id: I6bbb331b6c0a41c61043e482b95504fda8a50596
No description provided.