Increase ibrowse Inactivity Timeout #367

rzezeski · 2014-04-23T14:50:30Z

After digging into issues #320, #330, and #358 it was discovered that
ibrowse's default inactivity_timeout combined with its poor load
balancing algorithm and the default pool and pipeline sizes is causing
unnecessary socket churn. Even under load the client may not work
Yokozuna enough to fill all 10 pipelines to prevent the connection
from reaching time out. Increasing the pool or pipeline size just
makes things worse as the ibrowse algorithm first wants to fill the
pool before using pipelines causing it to make a new connection for
almost every request. My comment on #330 goes into a bit more detail
and includes evidence à la DTrace [1].

Action Items

verify no socket churn with higher timeout (60s)
benchmark before/after data load, curious if reduced churn helps

The text was updated successfully, but these errors were encountered:

rzezeski · 2014-04-23T14:52:51Z

Pinging @wbrown, let's focus all inactivity timeout research to this issue now.

wbrown · 2014-04-23T17:00:38Z

So, a few of observations that I have now that I've filled up my database and pushed my nodes to the limit.

The socket timeout appears to really kill Riak PUT latency when the throttle is eased back due to my inbound data deduplicating data that's already in Riak. I see the average PUT latency go up to about 65ms, 99ms and 216ms latency.
Another observation is that with the default sockets of 10, my searches get bottlenecked by all the PUTs being indexed in Yokozuna. On the client side, I solve a lot of concurrency issues by spinning up multiple socket connections to Riak, and I can do this on a demand basis -- i.e. searches get their own Riak connections. But Yokozuna indexes and updates are essentially sharing the same connection pool.

This is excaberated by the multithreaded search result paging code that I have. The basic flow goes:

Do a small search against Yokozuna with a maxRows of 100. This helps get small searches done quicker, if they have small result sets.
Retrieve the num_found key, and if it is more than 100, we divide the remainder into chunks of ranges over 1000 items.
A certain amount of threaded search workers is spun up, each worker grabs a chunk, and does a streaming search.
The results are yielded to the search invoker, allowing the resulting keys to be returned as fast as the invoker can process it.

Search performance wise, I get:

For a search the returns one result, I get an average latency of 13.69ms.
With ~2.5K results, which would be about four searches -- one small 100 maxRows search, and three 1000 maxRows searches with an average latency range of 176.90ms to 242.92ms per search. This is the three bigger searches being concurrent. This ends up at a total wall clock time of 730ms to 1072ms.
When Yokozuna's ibrowse socket pool gets filled up, this goes up dramatically to about 50s to 60s!

So, in this case, the ibrowse load balancing algorithm bites us on both ends -- we don't want more connections in the pool, but when all connections in the pool are busy, it severely impacts actual retrieval of the data via searches.

wbrown · 2014-04-24T17:25:37Z

Full Dataset from a Standing Start
Some extra data points -- unloaded, and after the machine has been idle a while, I see:

Searches against the same dataset, different value in one field:

Results: 2594 Batches: (1x 100, 3x 1000 ), Average Latency per Batch: 212ms, Total Time: 937.92ms
Results: 2572 Batches: (1x 100, 3x 1000 ), Average Latency per Batch: 143ms, Total Time: 615.38ms

Cluster Status

Two nodes at n=2 replication, using bitcask backend
200GB of compressed on-disk storage used per node on filer, for actual data set size of 400GB, breakdown is:

wbrown@stratus:/internal/riak/altostratus$ du -skhc ./*
27G ./anti_entropy
135G    ./bitcask
689K    ./cluster_meta
337K    ./kv_vnode
17K ./riak_kv_exchange_fsm
114K    ./ring
25G ./yz
15G ./yz_anti_entropy
201G    total

Solr is at 59.93mm documents per node.

Current Performance of Data Import
I just kicked off another data import job, and wow, my system is pretty hammered, but still performant considering:

Riak PUT latencies are at 28ms to 158ms, median appears to be at around 50ms.
Import rates are at about 1000 keys PUT a second, with the following latency figures:

node_puts_total : 57504552
node_put_fsm_time_mean : 85170
node_put_fsm_time_median : 1705
node_put_fsm_time_95 : 19494
node_put_fsm_time_99 : 4048624
node_put_fsm_time_100 : 5971479

Each node is doing 10K ZFS read IOPS against the filer over the network.
Solr is reading 300MB to 600MB per second from the filer, per node over NFS, for an aggregate read throughput of 600MB to 1.2GB per second. (!!!? Infiniband ho!) I think Solr may be doing its rewrite of the indexes in response to a trigger -- or just reloading data, as it appears to shrink while idle.
zswap really, really helps with Bitcask backends, especially if you set the swappiness low and have a small swap partition (1GB in this case)

Some interesting error messages:

2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}]

2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]

Failure to index due to timeouts:

2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,304}]}]

Error in entropy:

2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun<yz_index_hashtree.5.12742521>,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...

New Test
I will probably redo this multi-day dataset reload, with the following changes to see if it improves things:

LevelDB instead of Bitcask -- each Riak process is currently taking 16GB, completely squeezing out the disk cache.
Detune the HugePages allocation a little -- I've never seen it go above about 2GB of HugePages

HugePages_Total:    3072
HugePages_Free:       80
HugePages_Rsvd:       68
HugePages_Surp:        0

Although the current allocation is because of memory pressure:

2014-04-23 01:56:49.304 [info] <0.515.0>@yz_solr_proc:handle_info:135 solr stdout/err: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a7000000, 35651584, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages

Anything you'd like to see changed or tested on my next go-round at this?

A combination of ibrowse's inefficient load balancing algorithm and default socket inactivity timeout of 10 seconds can cause TIME-WAIT load. It becomes worse as the pool size of pipeline sizes are increased. This patch is a temporary workaround to reduce socket churn and thus TIME-WAIT load. It does so by increasing the inactivity timeout to 60 seconds across the board. Below is a chart of showing the amount of socket churn in connections per minute for the different timeout values both at idle and while under load from basho bench. These numbers were calculated by a DTrace script which counted the number of new connections being accepted on port 8093. | Timeout | Socket Churn At Idle | Socket Churn Under Load | |---------|----------------------|-------------------------| | 10s | ~59 conns/min | ~29 conns/min | | 60s | ~9 conns/min | ~6 conns/min | The timeout is set via application env because ibrowse has the absolute most complex configuration management code I have ever seen and this was the easiest way to make sure the timeout is set correctly. This is just a workaround until after 2.0 when other HTTP clients and pools may be tested. ibrowse seems to have many issues, this is but just one. For more background see the following issues: #367 #358 #330 #320

A combination of ibrowse's inefficient load balancing algorithm and default socket inactivity timeout of 10 seconds can cause TIME-WAIT load. It becomes worse as the pool size of pipeline sizes are increased. This patch is a temporary workaround to reduce socket churn and thus TIME-WAIT load. It does so by increasing the inactivity timeout to 600 seconds across the board. Below is a chart of showing the amount of socket churn in connections per minute for the different timeout values both at idle and while under load from basho bench. These numbers were calculated by a DTrace script which counted the number of new connections being accepted on port 8093. | Timeout | Socket Churn At Idle | Socket Churn Under Load | |---------|----------------------|-------------------------| | 10s | ~59 conns/min | ~29 conns/min | | 600s | ~9 conns/min | ~6 conns/min | The timeout is set via application env because ibrowse has the absolute most complex configuration management code I have ever seen and this was the easiest way to make sure the timeout is set correctly. This is just a workaround until after 2.0 when other HTTP clients and pools may be tested. ibrowse seems to have many issues, this is but just one. For more background see the following issues: #367 #358 #330 #320

rzezeski · 2014-05-20T19:02:00Z

@wbrown Sorry for late comments. Trying to keep pace with you :).

Some interesting error messages:

2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}]
2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]

Wow, those message queues are much too large. There should be a throttle mechanism in the AAE code to prevent this overload but perhaps it is not working properly. I also remember you saying in another ticket that you increased your overload threshold for the cluster because of infiniband. I think this is a bad idea. You want to avoid overly large message queues in Erlang. Increasing the overload threshold will just allow more queues to get larger.

Failure to index due to timeouts:

2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_com

This simply indicates that Solr isn't keeping up with the load.

Error in entropy:

2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun<yz_index_hashtree.5.12742521>,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...

This is #324, it is an acceptable error as AAE will just retry later.

rzezeski · 2014-05-21T22:22:08Z

I'm closing this issue since it was about the inactivity timeout which has been raised for 2.0.0 until a better solution can be done in 2.x.x.

rzezeski added this to the 2.0-RC milestone Apr 23, 2014

rzezeski self-assigned this Apr 24, 2014

rzezeski mentioned this issue May 9, 2014

Reduce TIME-WAIT load #375

Merged

rzezeski mentioned this issue May 21, 2014

Index hashtree mailbox can grow very large [JIRA: RIAK-1693] #389

Closed

rzezeski closed this as completed May 21, 2014

lukebakken mentioned this issue Dec 17, 2014

Timeouts during bulk data load, possible tie in to yz_events crash and a bad state on core create #442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase ibrowse Inactivity Timeout #367

Increase ibrowse Inactivity Timeout #367

rzezeski commented Apr 23, 2014

rzezeski commented Apr 23, 2014

wbrown commented Apr 23, 2014

wbrown commented Apr 24, 2014

rzezeski commented May 20, 2014

rzezeski commented May 21, 2014

Increase ibrowse Inactivity Timeout #367

Increase ibrowse Inactivity Timeout #367

Comments

rzezeski commented Apr 23, 2014

Action Items

rzezeski commented Apr 23, 2014

wbrown commented Apr 23, 2014

wbrown commented Apr 24, 2014

rzezeski commented May 20, 2014

rzezeski commented May 21, 2014