Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase ibrowse Inactivity Timeout #367

Closed
2 tasks done
rzezeski opened this issue Apr 23, 2014 · 5 comments
Closed
2 tasks done

Increase ibrowse Inactivity Timeout #367

rzezeski opened this issue Apr 23, 2014 · 5 comments
Assignees
Milestone

Comments

@rzezeski
Copy link
Contributor

After digging into issues #320, #330, and #358 it was discovered that
ibrowse's default inactivity_timeout combined with its poor load
balancing algorithm and the default pool and pipeline sizes is causing
unnecessary socket churn. Even under load the client may not work
Yokozuna enough to fill all 10 pipelines to prevent the connection
from reaching time out. Increasing the pool or pipeline size just
makes things worse as the ibrowse algorithm first wants to fill the
pool before using pipelines causing it to make a new connection for
almost every request. My comment on #330 goes into a bit more detail
and includes evidence à la DTrace [1].

Action Items

  • verify no socket churn with higher timeout (60s)
  • benchmark before/after data load, curious if reduced churn helps
@rzezeski rzezeski added this to the 2.0-RC milestone Apr 23, 2014
@rzezeski
Copy link
Contributor Author

Pinging @wbrown, let's focus all inactivity timeout research to this issue now.

@wbrown
Copy link

wbrown commented Apr 23, 2014

So, a few of observations that I have now that I've filled up my database and pushed my nodes to the limit.

  • The socket timeout appears to really kill Riak PUT latency when the throttle is eased back due to my inbound data deduplicating data that's already in Riak. I see the average PUT latency go up to about 65ms, 99ms and 216ms latency.
  • Another observation is that with the default sockets of 10, my searches get bottlenecked by all the PUTs being indexed in Yokozuna. On the client side, I solve a lot of concurrency issues by spinning up multiple socket connections to Riak, and I can do this on a demand basis -- i.e. searches get their own Riak connections. But Yokozuna indexes and updates are essentially sharing the same connection pool.

This is excaberated by the multithreaded search result paging code that I have. The basic flow goes:

  • Do a small search against Yokozuna with a maxRows of 100. This helps get small searches done quicker, if they have small result sets.
  • Retrieve the num_found key, and if it is more than 100, we divide the remainder into chunks of ranges over 1000 items.
  • A certain amount of threaded search workers is spun up, each worker grabs a chunk, and does a streaming search.
  • The results are yielded to the search invoker, allowing the resulting keys to be returned as fast as the invoker can process it.

Search performance wise, I get:

  • For a search the returns one result, I get an average latency of 13.69ms.
  • With ~2.5K results, which would be about four searches -- one small 100 maxRows search, and three 1000 maxRows searches with an average latency range of 176.90ms to 242.92ms per search. This is the three bigger searches being concurrent. This ends up at a total wall clock time of 730ms to 1072ms.
  • When Yokozuna's ibrowse socket pool gets filled up, this goes up dramatically to about 50s to 60s!

So, in this case, the ibrowse load balancing algorithm bites us on both ends -- we don't want more connections in the pool, but when all connections in the pool are busy, it severely impacts actual retrieval of the data via searches.

@rzezeski rzezeski self-assigned this Apr 24, 2014
@wbrown
Copy link

wbrown commented Apr 24, 2014

Full Dataset from a Standing Start
Some extra data points -- unloaded, and after the machine has been idle a while, I see:

Searches against the same dataset, different value in one field:

  • Results: 2594 Batches: (1x 100, 3x 1000 ), Average Latency per Batch: 212ms, Total Time: 937.92ms
  • Results: 2572 Batches: (1x 100, 3x 1000 ), Average Latency per Batch: 143ms, Total Time: 615.38ms

Cluster Status

  • Two nodes at n=2 replication, using bitcask backend
  • 200GB of compressed on-disk storage used per node on filer, for actual data set size of 400GB, breakdown is:
wbrown@stratus:/internal/riak/altostratus$ du -skhc ./*
27G ./anti_entropy
135G    ./bitcask
689K    ./cluster_meta
337K    ./kv_vnode
17K ./riak_kv_exchange_fsm
114K    ./ring
25G ./yz
15G ./yz_anti_entropy
201G    total
  • Solr is at 59.93mm documents per node.

Current Performance of Data Import
I just kicked off another data import job, and wow, my system is pretty hammered, but still performant considering:

  • Riak PUT latencies are at 28ms to 158ms, median appears to be at around 50ms.
  • Import rates are at about 1000 keys PUT a second, with the following latency figures:
node_puts_total : 57504552
node_put_fsm_time_mean : 85170
node_put_fsm_time_median : 1705
node_put_fsm_time_95 : 19494
node_put_fsm_time_99 : 4048624
node_put_fsm_time_100 : 5971479
  • Each node is doing 10K ZFS read IOPS against the filer over the network.
  • Solr is reading 300MB to 600MB per second from the filer, per node over NFS, for an aggregate read throughput of 600MB to 1.2GB per second. (!!!? Infiniband ho!) I think Solr may be doing its rewrite of the indexes in response to a trigger -- or just reloading data, as it appears to shrink while idle.
  • zswap really, really helps with Bitcask backends, especially if you set the swappiness low and have a small swap partition (1GB in this case)

Some interesting error messages:

2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}]
2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]

Failure to index due to timeouts:

2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_command,3,[{file,"src/riak_core_vnode.erl"},{line,304}]}]

Error in entropy:

2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun<yz_index_hashtree.5.12742521>,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...

New Test
I will probably redo this multi-day dataset reload, with the following changes to see if it improves things:

  • LevelDB instead of Bitcask -- each Riak process is currently taking 16GB, completely squeezing out the disk cache.
  • Detune the HugePages allocation a little -- I've never seen it go above about 2GB of HugePages
HugePages_Total:    3072
HugePages_Free:       80
HugePages_Rsvd:       68
HugePages_Surp:        0

Although the current allocation is because of memory pressure:

2014-04-23 01:56:49.304 [info] <0.515.0>@yz_solr_proc:handle_info:135 solr stdout/err: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007a7000000, 35651584, 2097152, 0) failed; error='Cannot allocate memory' (errno=12); Cannot allocate large pages, falling back to regular pages

Anything you'd like to see changed or tested on my next go-round at this?

rzezeski added a commit that referenced this issue May 9, 2014
A combination of ibrowse's inefficient load balancing algorithm and
default socket inactivity timeout of 10 seconds can cause TIME-WAIT
load.  It becomes worse as the pool size of pipeline sizes are
increased.  This patch is a temporary workaround to reduce socket
churn and thus TIME-WAIT load.  It does so by increasing the
inactivity timeout to 60 seconds across the board.  Below is a chart
of showing the amount of socket churn in connections per minute for
the different timeout values both at idle and while under load from
basho bench.  These numbers were calculated by a DTrace script which
counted the number of new connections being accepted on port 8093.

| Timeout | Socket Churn At Idle | Socket Churn Under Load |
|---------|----------------------|-------------------------|
| 10s     | ~59 conns/min        | ~29 conns/min           |
| 60s     | ~9 conns/min         | ~6 conns/min            |

The timeout is set via application env because ibrowse has the
absolute most complex configuration management code I have ever seen
and this was the easiest way to make sure the timeout is set
correctly.  This is just a workaround until after 2.0 when other HTTP
clients and pools may be tested.  ibrowse seems to have many issues,
this is but just one.

For more background see the following issues:

#367
#358
#330
#320
rzezeski added a commit that referenced this issue May 9, 2014
A combination of ibrowse's inefficient load balancing algorithm and
default socket inactivity timeout of 10 seconds can cause TIME-WAIT
load.  It becomes worse as the pool size of pipeline sizes are
increased.  This patch is a temporary workaround to reduce socket
churn and thus TIME-WAIT load.  It does so by increasing the
inactivity timeout to 600 seconds across the board.  Below is a chart
of showing the amount of socket churn in connections per minute for
the different timeout values both at idle and while under load from
basho bench.  These numbers were calculated by a DTrace script which
counted the number of new connections being accepted on port 8093.

| Timeout | Socket Churn At Idle | Socket Churn Under Load |
|---------|----------------------|-------------------------|
| 10s     | ~59 conns/min        | ~29 conns/min           |
| 600s    | ~9 conns/min         | ~6 conns/min            |

The timeout is set via application env because ibrowse has the
absolute most complex configuration management code I have ever seen
and this was the easiest way to make sure the timeout is set
correctly.  This is just a workaround until after 2.0 when other HTTP
clients and pools may be tested.  ibrowse seems to have many issues,
this is but just one.

For more background see the following issues:

#367
#358
#330
#320
@rzezeski
Copy link
Contributor Author

@wbrown Sorry for late comments. Trying to keep pace with you :).

Some interesting error messages:

2014-04-22 04:31:12.126 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1823.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{yz_index_hashtree,do_insert,5}},{message_queue_len,301184}] [{old_heap_block_size,0},{heap_block_size,22177879},{mbuf_size,0},{stack_size,14},{old_heap_size,0},{heap_size,14484411}]
2014-04-22 07:34:14.843 [info] <0.98.0>@riak_core_sysmon_handler:handle_event:92 monitor large_heap <0.1827.0> [{initial_call,{yz_index_hashtree,init,1}},{almost_current_function,{hashtree,should_insert,3}},{message_queue_len,1355002}] [{old_heap_block_size,0},{heap_block_size,95360811},{mbuf_size,0},{stack_size,19},{old_heap_size,0},{heap_size,65138871}]

Wow, those message queues are much too large. There should be a throttle mechanism in the AAE code to prevent this overload but perhaps it is not working properly. I also remember you saying in another ticket that you increased your overload threshold for the cluster because of infiniband. I think this is a bad idea. You want to avoid overly large message queues in Erlang. Increasing the overload threshold will just allow more queues to get larger.

Failure to index due to timeouts:

2014-04-22 04:32:56.469 [error] <0.733.0>@yz_kv:index:205 failed to index object {{<<"obs">>,<<"obs">>},<<"AnxNK4SL8CVtGiK5/1qEtrFibWC8gWI/mSl6I2ozbss=:3">>} with error {"Failed to index docs",{error,req_timedout}} because [{yz_solr,index,3,[{file,"src/yz_solr.erl"},{line,175}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,252}]},{yz_kv,index,3,[{file,"src/yz_kv.erl"},{line,192}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1440}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1428}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1223}]},{riak_kv_vnode,handle_command,3,[{file,"src/riak_kv_vnode.erl"},{line,468}]},{riak_core_vnode,vnode_com

This simply indicates that Solr isn't keeping up with the load.

Error in entropy:

2014-04-22 04:32:56.469 [error] emulator Error in process <0.2580.1755> on node 'riak@cumulus.fabric' with exit value: {function_clause,[{yz_entropy,iterate_entropy_data,[<<3 bytes>>,[{continuation,<<91 bytes>>},{limit,100},{partition,14}],#Fun<yz_index_hashtree.5.12742521>,{error,{error,req_timedout}}],[{file,"src/yz_entropy.erl"},{line,44}]},{yz_index_hashtree,'-fold_keys/2-lc$^0/1-0-',3,[{file...

This is #324, it is an acceptable error as AAE will just retry later.

@rzezeski
Copy link
Contributor Author

I'm closing this issue since it was about the inactivity timeout which has been raised for 2.0.0 until a better solution can be done in 2.x.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants