-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Startup crash trying to read key from corrupted data file [JIRA: RIAK-1819] #212
Comments
Which version of Riak introduced this issue? We have a Riak installation on some apparently very unreliable hardware, where this has bitten us a couple of times over the last month. I'd like to downgrade that setup to a "pre-optimization" version. |
@krestenkrab It seems I was wrong about the origin of the lack of CRC checks when only reading keys from data files instead of hintfiles. The optimization I was thinking of introduced variable chunk sizes while reading the file, but the code all the way back to pre 1.4 was already skipping the value and not computing the CRC: https://github.com/basho/bitcask/blob/1.5.0/src/bitcask_fileops.erl#L370 So we have been vulnerable to corruption in that scenario for a long time apparently. |
IIRC, this just preserves the original behavior, so I am not sure there is
any before.
|
This was fixed in #214, included in Bitcask 1.7.2 and shipped with Riak 2.0.6. It will ship in the upcoming 2.1 point release. |
I got the same error when one of the Riak 2.1.1 nodes went down (just Riak server went down, not the server itself) and now I can't start Riak with the following error:
I've updated Riak to 2.1.3 on the failed node and it started successfully. |
On startup, when a hintfile is not available or does not pass the CRC check, Bitcask loads keys directly from the data file. When it does, it is not validating each record in the data file against its CRC. This changed as part of an optimization introduced last year that tried to skip over values when reading the data file only to read keys. Unfortunately this means that it can not use the CRC to verify the validity of the key or value, and may end up with crash errors like the one below. Since this is only an optimization, and it only applies when hintfiles are not available, I propose removing it to handle corruption properly.
It may be worth to also protect the key conversion code with a try/catch to also protect from loading a key in an unhandle format. This could happen as part of a downgrade or similar situation.
The text was updated successfully, but these errors were encountered: