-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase ibrowse Inactivity Timeout #367
Comments
Pinging @wbrown, let's focus all inactivity timeout research to this issue now. |
So, a few of observations that I have now that I've filled up my database and pushed my nodes to the limit.
This is excaberated by the multithreaded search result paging code that I have. The basic flow goes:
Search performance wise, I get:
So, in this case, the |
Full Dataset from a Standing Start Searches against the same dataset, different value in one field:
Cluster Status
Current Performance of Data Import
Some interesting error messages:
Failure to index due to timeouts:
Error in entropy:
New Test
Although the current allocation is because of memory pressure:
Anything you'd like to see changed or tested on my next go-round at this? |
A combination of ibrowse's inefficient load balancing algorithm and default socket inactivity timeout of 10 seconds can cause TIME-WAIT load. It becomes worse as the pool size of pipeline sizes are increased. This patch is a temporary workaround to reduce socket churn and thus TIME-WAIT load. It does so by increasing the inactivity timeout to 60 seconds across the board. Below is a chart of showing the amount of socket churn in connections per minute for the different timeout values both at idle and while under load from basho bench. These numbers were calculated by a DTrace script which counted the number of new connections being accepted on port 8093. | Timeout | Socket Churn At Idle | Socket Churn Under Load | |---------|----------------------|-------------------------| | 10s | ~59 conns/min | ~29 conns/min | | 60s | ~9 conns/min | ~6 conns/min | The timeout is set via application env because ibrowse has the absolute most complex configuration management code I have ever seen and this was the easiest way to make sure the timeout is set correctly. This is just a workaround until after 2.0 when other HTTP clients and pools may be tested. ibrowse seems to have many issues, this is but just one. For more background see the following issues: #367 #358 #330 #320
A combination of ibrowse's inefficient load balancing algorithm and default socket inactivity timeout of 10 seconds can cause TIME-WAIT load. It becomes worse as the pool size of pipeline sizes are increased. This patch is a temporary workaround to reduce socket churn and thus TIME-WAIT load. It does so by increasing the inactivity timeout to 600 seconds across the board. Below is a chart of showing the amount of socket churn in connections per minute for the different timeout values both at idle and while under load from basho bench. These numbers were calculated by a DTrace script which counted the number of new connections being accepted on port 8093. | Timeout | Socket Churn At Idle | Socket Churn Under Load | |---------|----------------------|-------------------------| | 10s | ~59 conns/min | ~29 conns/min | | 600s | ~9 conns/min | ~6 conns/min | The timeout is set via application env because ibrowse has the absolute most complex configuration management code I have ever seen and this was the easiest way to make sure the timeout is set correctly. This is just a workaround until after 2.0 when other HTTP clients and pools may be tested. ibrowse seems to have many issues, this is but just one. For more background see the following issues: #367 #358 #330 #320
@wbrown Sorry for late comments. Trying to keep pace with you :).
Wow, those message queues are much too large. There should be a throttle mechanism in the AAE code to prevent this overload but perhaps it is not working properly. I also remember you saying in another ticket that you increased your overload threshold for the cluster because of infiniband. I think this is a bad idea. You want to avoid overly large message queues in Erlang. Increasing the overload threshold will just allow more queues to get larger.
This simply indicates that Solr isn't keeping up with the load.
This is #324, it is an acceptable error as AAE will just retry later. |
I'm closing this issue since it was about the inactivity timeout which has been raised for 2.0.0 until a better solution can be done in 2.x.x. |
After digging into issues #320, #330, and #358 it was discovered that
ibrowse's default
inactivity_timeout
combined with its poor loadbalancing algorithm and the default pool and pipeline sizes is causing
unnecessary socket churn. Even under load the client may not work
Yokozuna enough to fill all 10 pipelines to prevent the connection
from reaching time out. Increasing the pool or pipeline size just
makes things worse as the ibrowse algorithm first wants to fill the
pool before using pipelines causing it to make a new connection for
almost every request. My comment on #330 goes into a bit more detail
and includes evidence à la DTrace [1].
Action Items
The text was updated successfully, but these errors were encountered: