-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vnode hangs in {eleveldb,write,3}
#273
Comments
I forgot to mention that we did "in place" upgrade. Data files kept, but riak upgraded to 3.0.12. |
So there was an update done to snappy as part of 3.0.12 - https://github.com/basho/eleveldb/releases/tag/riak_kv-3.0.12. The upgrade was from 1.0.4 to 1.1.9 - 6ef9202. There should be LOG files within the leveldb backend folder for the problem partition (i.e. "/var/lib/riak/leveldb/11417981541647679048466287755595961091061972992"), which are normally quite verbose - so it would be worth looking through those logs at the time of the issue. |
Thx, it was a good hint. It is definitely something with the compression. I'll keep digging.
|
I went back to look at the basic upgrade test https://github.com/basho/riak_test/blob/develop-3.0/tests/verify_basic_upgrade.erl - going from 3.0.9 to 3.0.12. This passed with the eleveldb backend, but that is because the default compression algorithm is lz4. When I switched the test to force snappy compression:
So it does look like there is a fairly clear issue of incompatibility, that we could have picked up in test. I haven't been able to dig into snappy history in any meaningful way to see if there is an interim version that can safely bridge between. I'll have a think about workarounds that may help. One way would be to do the upgrade via rolling transfer rather than rolling restart (so you data transfer into the upgraded nodes - rolling one node in/out at a time), but this is a lot more time consuming (and it may in that case be worth waiting for 3.0.16 to get the final improvements to transfers). |
Also with regards to the configuration of lz4. Looking at the cuttlefish logic for compression in the eleveldb schema there is a translation operation: https://github.com/basho/eleveldb/blob/develop/priv/eleveldb.schema#L174-L183 However, this translation operation isn't present in the https://github.com/basho/eleveldb/blob/develop/priv/eleveldb_multi.schema#L115-L129 This might mean that the compression_algorithm is ignored in multi-backends, as it doesn't get translated into the compression setting - and hence it gets defaulted to snappy. |
Thanks. |
Using https://docs.riak.com/riak/kv/latest/using/admin/commands/index.html#replace - is what I meant by rolling transfer. This allows you to setup a new node with the new code, and then using cluster replace to replace an existing node, then update that node, and replace another node etc. |
Thx. |
In RIAK 3.0.12, vnodes started to hang in leveldb:write/3 and the process mailbox could grow up to 10K messages.
After a while whole node starts to behave unstable. The queue size of the vnode does not decrease, even without a load.
Another interesting thing that i noticed, that in the vnode state the compression is set to
snappy
, but in the config it is defined tolz4
.It's not fully confirmed, but t seems that in 3.0.9 doesn't have the issue.
Are there any incompatibility between snappy versions?
The text was updated successfully, but these errors were encountered: