-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock resolution failed #660
Comments
fixed issue with self rollback seen when looking into #660
I am not sure if this is 100% the same thing, but on Fluo 2.0.0-SNAPSHOT we ran into something that looks similar. This occurred on a single test server that was likely very overloaded at the time of failure. When trying to scan we see that exception
When looking at the raw data we see the lock to the primary
But maybe in this case the primary does exist?
Are there any recovery tools or process for cleaning up a situation like this? Thanks for any advice anyone may have. cc @wjsl |
@jaredwinick would the IllegalstateExcep happen every-time the data was scanned? Want to confirm this is an issue with the persisted data and not a transient issue in the code. Seems like a problem with the data. What command did you run to see the following output :
I don't think so. |
It’s a part of the data when we scan it.
… On Sep 1, 2021, at 10:01 AM, Keith Turner ***@***.***> wrote:
@jaredwinick <https://github.com/jaredwinick> would the IllegalstateExcep happen every-time the data was scanned? Want to confirm this is an issue with the persisted data and not a transient issue in the code. Seems like a problem with the data.
What command did you run to see the following output :
record:Alerts:4:000021178 :10 [] 143265458-LOCK record:Alerts:4:000021178 10 WRITE NOT_DELETE NOT_TRIGGER c
Are there any recovery tools or process for cleaning up a situation like this?
I don't think so.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#660 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKB7ZKPOZMPFMKTHU6YXMTT7YW3PANCNFSM4CC6NMYA>.
Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
It was
the values are big so i just copied what looked like the relevant part of the output |
There was a bug where if a column family was empty and the qual was not empty this would cause lock recovery to fail. The underlying cause was a bug in the Column class. This class has an isFamilySet() method that was returning false when the family was set to the empty string. This cause caused lock recovery code to create an incorrect range. The Column class was relying on internal behavior of the Bytes class that probably changed and caused this bug. This commit adds a new IT that recreates this bug. If the new IT is run w/o the fix to the Column class then it would fail as follows. ``` Running org.apache.fluo.integration.impl.FailureIT Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.011 sec <<< FAILURE! - in org.apache.fluo.integration.impl.FailureIT testRecoverEmptyColumn(org.apache.fluo.integration.impl.FailureIT) Time elapsed: 7.096 sec <<< ERROR! java.lang.IllegalStateException: can not abort : bob bal 5 (UNKNOWN) at org.apache.fluo.integration.impl.FailureIT.testRecoverEmptyColumn(FailureIT.java:688) ```
There was a bug where if a column family was empty and the qual was not empty this would cause lock recovery to fail. The underlying cause was a bug in the Column class. This class has an isFamilySet() method that was returning false when the family was set to the empty string. This cause caused lock recovery code to create an incorrect range. The Column class was relying on internal behavior of the Bytes class that probably changed and caused this bug. This commit adds a new IT that recreates this bug. If the new IT is run w/o the fix to the Column class then it would fail as follows. ``` Running org.apache.fluo.integration.impl.FailureIT Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 8.011 sec <<< FAILURE! - in org.apache.fluo.integration.impl.FailureIT testRecoverEmptyColumn(org.apache.fluo.integration.impl.FailureIT) Time elapsed: 7.096 sec <<< ERROR! java.lang.IllegalStateException: can not abort : bob bal 5 (UNKNOWN) at org.apache.fluo.integration.impl.FailureIT.testRecoverEmptyColumn(FailureIT.java:688) ```
I think we tracked down the problem in #1123 |
During a long run of webindex because of #656 there was some data that was unable to be processed for a long time. When I updated the Accumulo iterators on the cluster with the fix for #656 some data then failed to process with following error.
Digging into this there were locks pointing to a primary that did not exists.
Below is the place where the primary should be.
Below are some locks that point to a non-existent primary.
I think I have tracked down one possible cause of this. When a transaction rolls itself back it does not properly mark the primary column.
The text was updated successfully, but these errors were encountered: