Soften JVM overload failures #358

coderoshi · 2014-04-09T17:39:45Z

After a week of replicating #330, I've concluded that there is little that we can do about an overloaded Solr JVM. What we can do is reduce the flood of failure messages by making some simple adjustments in key configuration files:

solrconfig.xml

Suggested by http://wiki.apache.org/solr/SolrPerformanceProblems, slow or overly frequent commits can overload the cache warmer. Since all other autowarmCounts were set to 0, I followed suit with the fieldValueCache.

Since all cache auto warmers were set to zero, I also bumped up with default maxWarmingSearchers from 2 (suitable for read-only slaves, as mentioned in the config comments) to 4. I found that on high loads of large objects, I'd get an exceeded limit of maxWarmingSearchers=X message, which this resolved.

default_schema.xml

For objects with many fields, I set the catch-all field to false by default, as per #316. This helped tremendously in the reduction of incidental field indexing.

jetty.xml

I bumped up the maxIdleTime from 50 seconds to 60, since that's the default timeout for many of our PB clients. It's also important that this value matches the ibrowse client timeout, explained below.

The acceptors I kept at their default, which is 2. Solr recommends that you have 2 acceptors per CPU, so this is something that we should document. A better long term option is to make this a riak.conf, or possibly even automatically set acceptor count by interrogating the system, but I'm unsure this is something we want to tackle for beta.

yz_solr.erl

Issue #320 was rife with org.eclipse.jetty.io.EofException. This is because ibrowse default times out at 30 seconds. Since jetty defaulted to 50 seconds, it was Solr complaining about the unexpected closed socket. This at least stops that complaint.

ibrowse.conf

I made no changes to ibrowse.conf. Though we could increase the number of pooled and opened sockets, similar to this {dest, "localhost", 10014, 100, 100000, []}. it doesn't resolve the underlying problem of a slow and overloaded JVM. It might be a advanced setting in isolated cases, but in all of my tests increasing the connection pool just prolonged the inevitable crash from a JVM that wasn't indexing large objects fast enough.

riak.conf

Followed many JVM tuning suggestions for larger heaps (http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning), from tweaking pages sizes, settings per heap space, ratios, etc.

-d64 -Xms8g -Xmx32g -XX:+UseStringCache -XX:+UseCompressedOops -XX:NewRatio=3 \
-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8  \
-XX:+UseConcMarkSweepGC -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m \
-XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=70 -XX:CMSTriggerPermRatio=80 \
-XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts

I found some values that allowed me to support slightly larger objects sizes on my test cluster, for my particular test set (large JSON objects, 1-2M with 40-10k indexes of various types, updated 8-25 client threads), but expanding those options to be yokozuna default seems too pedantic.

I analyzed opened sockets in my tests, and when the JVM takes a long time, the connection remains active until the process is complete. If the data coming in is too large and fast, with insufficient JVM resources to manage the data being indexed in a timely manner, you'll see ports in use. The only solution I could see to this problem is ensuring that the JVM runs fast enough that the connections aren't used for more than a few seconds, which depends entirely on Solr being fast.

Other change

There's also a minor change in here. While making these _yz_solr changes, Dmitri found an error where his key was named for the state of the Oregon, causing the delete query to fail on a reserved word (_yz_rk:OR), so I wrapped it in quotes.

Documentation

Issue #330 gives us a glimpse of some problems that an ill-fitting JVM can cause. We're going to need to find and share plenty of documentation about how/where a user can got o investigate their resource requirements, and troubleshoot problems: basho/basho_docs#1012.

rzezeski · 2014-04-09T18:57:29Z

src/yz_solr.erl

@@ -320,10 +321,10 @@ encode_commit() ->
 %% @doc Encode a delete operation into a mochijson2 compatiable term.
 -spec encode_delete(delete_op()) -> term().
 encode_delete({key,Key}) ->
-    Query = ?YZ_RK_FIELD_S ++ ":" ++ ibrowse_lib:url_encode(binary_to_list(Key)),
+    Query = ?YZ_RK_FIELD_S ++ ":\"" ++ ibrowse_lib:url_encode(binary_to_list(Key)) ++ "\"",


I wonder if we are breatking UTF-8 here? I don't think we have tests for indexing/searching/deleting UTF-8 keys. I'll write up a separate issue.

This solves the reserved word problem by using a phrase query but the more appropriate solution down the line is to use the raw or term query parsers.

http://wiki.apache.org/solr/QueryParser

rzezeski · 2014-04-09T19:00:18Z

@wbrown If you have the chance I would love to hear if this patch improves the issues you noticed in #330 & #320.

wbrown · 2014-04-09T20:03:06Z

@rzezeski Awesome, I've read these notes. So, this is informative and tells me that if I want good Solr performance, I need to architect for it. i.e. highly clocked machines. I'll give the patch set a shot. Do I apply this against the 2.0.0-pre20 download?

rzezeski · 2014-04-09T20:16:45Z

@wbrown I'm not sure if it will apply cleanly to pre20 download, but you can certainly try.

rzezeski · 2014-04-11T15:20:36Z

priv/default_schema.xml

@@ -104,7 +104,7 @@
   <dynamicField name="*_set"      type="string"  indexed="true" stored="false" multiValued="true" />

   <!-- catch-all field -->
-   <dynamicField name="*" type="text_general" indexed="true" stored="false" multiValued="true" />
+   <dynamicField name="*" type="text_general" indexed="false" stored="false" multiValued="true" />


If we are not going to index the catch-all field then set the type to ignored and drop the other attributes so they are inherited from the type. IMO this makes the drop semantic obvious. It also may prevent unnecessary analyzing since it uses the StrField type (I would hope indexed=false prevents analysis but use a non-analyzed type just in case).

rzezeski · 2014-04-11T16:03:28Z

Since all cache auto warmers were set to zero, I also bumped up with default maxWarmingSearchers from 2 (suitable for read-only slaves, as mentioned in the config comments) to 4. I found that on high loads of large objects, I'd get an exceeded limit of maxWarmingSearchers=X message, which this resolved.

My reading of this is we should NOT raise the limit to 4. If Solr is
overloaded and can't keep up with the load then allowing more
concurrent warming searchers is only going to drag it down more. I
would rather shed load than avoid error messages in the log.

before the warming searcher from first commit has enough time to warm up, then there can be multiple searchers all competeing for resources at the same time, EVEN HTOUGH ONE OF THEM WILL BE THROWN AWAY AS SOON AS THE NEXT ONE IS READY. (emphasis mine)

This sentence makes me want to set the max to 1. If Solr can't keep up
with the commit load then throwing more concurrent load on the pile
will not help. Given that we auto soft-commit at 1s interval what
sense does it make to pile up warmers that are just going to
invalidate each other? That is, if a commit takes 5s to warm then I
don't want 4 concurrent warmers running, 3 of which are going to
quickly invalidated once the fourth finishes. Yes, updates will take
longer to be visible but in this case I think it's worth trading
harvest for yield because piling up concurrent warmers could end up
taking down the node; or worse, slowing it down just enough to make
latency balloon.

If you encounter this error a lot, you can (in theory) increase the number in your maxWarmingSearchers, but that is risky to do unless you are confident you have the system resources (RAM, CPU, etc...) to do it safely. A MORE CORRECT WAY TO DEAL WITH THE SITUATION IS TO REDUCE HOW FREQUENTLY YOU SEND COMMITS. (emphases mine)

I think this is the crux of the issue. In the future I would like
Yokozuna to expose a soft/hard commit call with the ability to turn
off the auto commits. It won't make sense for everyone but it could
allow more efficient use of resources for those that have more regular
data load patterns.

@coderoshi Did the exceeded limit of maxWarmingSearchers=X error
cause any actual issues like crashing the node or cause failed writes?
Even if it failed writes I think I'm still inclined to prefer that
behavior over adding more load. I'm tempted to play around with this a
bit myself. Is there code I could run to reproduce your benchmark?

coderoshi · 2014-04-11T17:16:37Z

@rzezeski no, it didn't cause any failures, it was just filling the logs.

Here's a script I was running. https://gist.github.com/coderoshi/329b93ee7987b1beecdf

I later ran some tests directly against solr in order to riak/yz out of the equation, for straight solr tuning

bh = big_hash.clone
bh['_yz_id'] = SecureRandom.hex.to_s
doc = {"commit" => {}, "add" => {"doc" => bh}}.to_json
File.open("tmp/#{bh['_yz_id']}.json", 'w+') {|f| f.write(doc)}
`curl 'http://192.168.1.135:10034/solr/herp/update' -H'Content-type:application/json' --data-binary @tmp/#{bh['_yz_id']}.json 2> /dev/null`

hectcastro · 2014-04-11T18:54:55Z

Putting this together as a way to verify my understanding of the maxWarmingSearchers changes proposed:

We're setting all of the cache autowarmCount settings to 0 in order to minimize the duration of auto warming so that maxWarmingSearchers is not exceeded
Here we are trading the value of caching for speed of making new values available in query results
We have no solr.QuerySenderListener configured by default, so combining that with setting all of the autowarmCount settings to 0, we've done pretty much all we can in minimizing the warming done before searchers are moved out of the "warming" state
~~maxWarmingSearchers was bumped from 2 to 4 because we removed as much warming overhead as possible~~
We may want maxWarmingSearchers to be > 1 because it is something that happens per index (having lots of indexes and setting this to 1 would hurt overall ability to provide fresh query results across all indexes)

rzezeski · 2014-04-11T19:05:16Z

We may want maxWarmingSearchers to be > 1 because it is something that happens per index (having lots of indexes and setting this to 1 would hurt overall ability to provide fresh query results across all indexes)

Are you sure this applies across all cores? Each core gets its own solrconfig.xml so that is surprising behavior to me.

rzezeski · 2014-04-11T19:12:18Z

maxWarmingSearchers was bumped from 2 to 4 because we removed as much warming overhead as possible

My understand is it was bumped to avoid logging noise. AFAICT running
out of warmers is not a hard error but rather a safety valve to
prevent overloading the core. Reading the following wiki entry makes
me think lower would be better given our 1s auto soft-commit and
preference for availability. This becomes even more important if
warming searchers is per-core. If we have 16 cores that could be 48
threads crunching.

http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F

coderoshi · 2014-04-11T20:06:49Z

So my reasoning for zeroing out autowarmCount was in response to the solr documentation suggestions for slow commit times, and the fact that autowarmCount="0" is actually the default for current solr.

As for bumping up maxWarmingSearchers, that was, as @rzezeski suggested, solely to reduce the complaints in the log.

I do believe we should keep autowarmCount at zero, however, I'm not so strident about keeping maxWarmingSearchers at 4. Increasing the size didn't seem to make much of a difference for my tests, so I figured what the hell?

If log noise doesn't bother anyone, I'm fine with dropping maxWarmingSearchers back to 2 or even 1.

rzezeski · 2014-04-11T20:13:34Z

I do believe we should keep autowarmCount at zero

I agree. Given our 1s auto commit I think auto warming may be hurting
more than helping. This is something I've meant to benchmark for a
while now.

The real solution, in the future, is to provide APIs for manipulating
certain aspects of the solrconfig.xml because things like auto
commit, caches and auto warming are all highly dependent on query
profile. One size does not fit all here. I foresee plenty of complaints
in the future about the hardcoded solrconfig.xml but the 2.0 feature
ship has sailed so until then we can try to protect ourselves (if
users REALLY want a custom setup they should be able to override
manually at filesystem level).

wbrown · 2014-04-21T03:19:09Z

@rzezeski I grabbed the patch from this commit, and applied it to the riak 2.0beta1 sources that went up for download. I even have a nice Arch Linux packaging script that will build me a Riak 2.0 beta1 package with that patch installed.

Unfortunately, I ended up beating my head against the desk for hours. What had worked before doesn't appear to work anymore, and it possibly has to do with the bucket types stuff. It's worked in the past by setting the bucket's index_search value to the index and schema in question.

So I got the latest bleeding edge Python client that had bucket types support added, worked my way through some of the bugs and lack of documentation there, and transitioned my indexing engine classes to use bucket types. I have indexes being created in Solr, buckets being assigned to bucket types, and bucket objects being grabbed from the bucket type object to ensure that we use the Yokozuna indexing. But I am absolutely not seeing any documents going into Yokozuna/Solr.

My setup step are:

Initialize new Riak cluster.
Create bucket types from riak-admin with the following script:

bts=(obs entity raw)
for bt in "${bts[@]}";
do
    bin/riak-admin bucket-type create $bt '{"props":{}}'
    bin/riak-admin bucket-type activate $bt
done

Upon initialization of my application, it does the following:

        ### create our schema if it doesn't exist
        try:
            self.riak.get_search_schema(schemaName)
        except riak.RiakError:
            self.riak.create_search_schema(schemaName,
                open(schemaFile).read())

       ### our bucket type is the same name as our bucket name
       indexedBucketType = self.riak.bucket_type( self.bucketName )
       self.bucket = indexedBucketType.bucket( self.bucketName )

        ### create our index in Yokozuna if it doesn't exist
        if not self.bucketName in [ x['name']
                            for x in self.riak.list_search_indexes() ]:
            self.riak.create_search_index( RKeyString( self.bucketName ),
                                           RKeyString( schemaName ),
                                           self.replicas )
            print "* Index created for", self.bucketName
            time.sleep(5)

        ### Set the bucket type to our desired n_val.
        if indexedBucketType.get_property( "n_val", None ) != self.replicas:
            indexedBucketType.set_property( "n_val", self.replicas )

        ### set our bucket type's index to our desired index
        try:
            searchIndex = indexedBucketType.get_property("search_index", None)
        except KeyError:
            indexedBucketType.set_property("search_index", self.bucketName)
            searchIndex = indexedBucketType.get_property("search_index", None)
        if searchIndex != self.bucketName:
            indexedBucketType.set_property("search_index", self.bucketName)

Am I missing any obvious steps here?

When I store to an indexed bucket, I ensure that I acquire the bucket object via the bucket_type instantiation method:

        if indexEngine:
            self.getBucket = self.riak.bucket_type( bucketName ).bucket
        else:
            self.getBucket = self.riak.bucket

        self.bucket = self.getBucket(self.name)

When I double check the object representation with print, it is giving the expected representation -- a bucket type of raw and a bucket of raw.

rzezeski · 2014-04-21T14:21:48Z

There seem to dialyzer issues on develop so going to figure those out first before getting the final review done on this issue.

wbrown · 2014-04-22T01:10:08Z

@rzezeski @coderoshi So, reporting in now that I've successfully gotten this to work. It was some subtle issues brought up by the bleeding edge version of the Python client that I had to use to get the bucket_type stuff working that had an impact on my code. It's not ready for prime time, unfortunately, with some breaking API changes. I will need to sit down and figure out what issues and patches I can make against the client when I have more time than I do at the moment.

It is feeling like for high load situations like mine, JVM tuning is absolutely essential. I took the JVM settings that were outlined in the initial comment opening this PR, and applied it. To make it work, I had to enable Huge Pages, and allocate about 2048 huge pages at 2MB each. This is with a slightly reduced -Xms4g -Xmx8g setting as my development machines have 32GB each. I also had to raise the kernel shared memory maximum.

Things are looking better now:

node_puts : 22254
node_puts_total : 1189134
node_put_fsm_time_mean : 4309
node_put_fsm_time_median : 1325
node_put_fsm_time_95 : 3574
node_put_fsm_time_99 : 119076
node_put_fsm_time_100 : 259319

The above numbers put me in the ballpark of 300 to 400 new indexed keys a second, with an apparent median latency of 1325us

I'm seeing client side put latency of 4.84ms to 10.96ms with a load of about 10-30 active client-side threads, with Yokozuna and a Bitcask backend. This is interleaved with a client get latency of about 1.01ms to 2.28ms. Not bad at all considering that the client is connecting to my Riak nodes other Gigabit.

Looking at the network side of the picture:

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep TIME_WAIT | wc -l
49

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep ESTABLISHED | wc -l
20

So, about 20 in TIME_WAIT and 10 in ESTABLISHED, and the number isn't growing.

We'll see how this does once I've fully reloaded my dataset thus far into the new database. It's been my observation that Yokozuna usually starts out great, and then slows down. A part of this improvement is that I've improved my selection and versioning algorithms since I last used Yokozuna, and had been having good performance with 2i -- but this is quite promising. Yokozuna allows a lot more flexibility in query terms than 2i, as well as multiple query terms.

Thank you -- and let me know if there's anything I can test or check out here. I am quite motivated to see a performant Yokozuna in 2.0 release. I'll keep you guys in the loop as I rebuild my dataset.

jonmeredith · 2014-04-22T01:15:04Z

Thanks for the detailed report @wbrown.

On Mon, Apr 21, 2014 at 7:10 PM, wbrown notifications@github.com wrote:

@rzezeski https://github.com/rzezeski @coderoshihttps://github.com/coderoshiSo, reporting in now that I've successfully gotten this to work. It was
some subtle issues brought up by the bleeding edge version of the Python
client that I had to use to get the bucket_type stuff working that had an
impact on my code. It's not ready for prime time, unfortunately, with some
breaking API changes. I will need to sit down and figure out what issues
and patches I can make against the client when I have more time than I do
at the moment.

It is feeling like for high load situations like mine, JVM tuning is
absolutely essential. I took the JVM settings that were outlined in the
initial comment opening this PR, and applied it. To make it work, I had to
enable Huge Pages, and allocate about 2048 huge pages at 2MB each. This is
with a slightly reduced -Xms4g -Xmx8g setting as my development machines
have 32GB each. I also had to raise the kernel shared memory maximum.

Things are looking better now:

node_puts : 22254
node_puts_total : 1189134
node_put_fsm_time_mean : 4309
node_put_fsm_time_median : 1325
node_put_fsm_time_95 : 3574
node_put_fsm_time_99 : 119076
node_put_fsm_time_100 : 259319

The above numbers put me in the ballpark of 300 to 400 new indexed keys a
second, with an apparent median latency of 1325us

I'm seeing client side put latency of 4.84ms to 10.96ms with a load of
about 10-30 active client-side threads, with Yokozuna and a Bitcask
backend. This is interleaved with a client get latency of about 1.01ms to
2.28ms. Not bad at all considering that the client is connecting to my
Riak nodes other Gigabit.

Looking at the network side of the picture:

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep TIME_WAIT | wc -l
49

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep ESTABLISHED | wc -l
20

So, about 20 in TIME_WAIT and 10 in ESTABLISHED, and the number isn't
growing.

We'll see how this does once I've fully reloaded my dataset thus far into
the new database. It's been my observation that Yokozuna usually starts out
great, and then slows down. A part of this improvement is that I've
improved my selection and versioning algorithms since I last used Yokozuna,
and had been having good performance with 2i -- but this is quite
promising. Yokozuna allows a lot more flexibility in query terms than 2i,
as well as multiple query terms.

Thank you -- and let me know if there's anything I can test or check out
here. I am quite motivated to see a performant Yokozuna in 2.0 release.
I'll keep you guys in the loop as I rebuild my dataset.

Reply to this email directly or view it on GitHubhttps://github.com//pull/358#issuecomment-40994183
.

Jon Meredith
VP, Engineering
Basho Technologies, Inc.
jmeredith@basho.com

wbrown · 2014-04-22T16:00:47Z

Update after the load test/data reloading had been running for about 14 hours. Shortly before I went to bed, I went ahead and started a second client feeding another set of data in.

I'm at about 30 million keys indexed now. With two observation engine clients and a GeoIP database client populating Riak over Gigabit on a single workstation, my client-side latency stats are:

GET ranges from 5.07ms to 20.26ms with an apparent median of ~11ms
PUT ranges from 7.20ms to 42.53ms with an apparent median of ~30ms

This is with a current rate of about 562 keys a second PUT into Riak.

node_put_fsm_active : 5
node_put_fsm_active_60s : 34074
node_put_fsm_in_rate : 429
node_put_fsm_out_rate : 427

Latency measured by Riak is pretty good, though 95th and 99th percentile times aren't great at 327ms to 2679ms:

node_put_fsm_time_mean : 13050
node_put_fsm_time_median : 1413
node_put_fsm_time_95 : 11724
node_put_fsm_time_99 : 327825
node_put_fsm_time_100 : 2679843

GET latencies are pretty good from the Riak side still, though 99th and 100th is horrible at 173ms and 2799ms:

node_get_fsm_time_mean : 5374
node_get_fsm_time_median : 254
node_get_fsm_time_95 : 5424
node_get_fsm_time_99 : 173151
node_get_fsm_time_100 : 2799747

With 30 million keys, that's enough to outsize the 32GB of RAM in my two test nodes, and 32GB of RAM in my Infiniband ZFS filer. Each node is up to about 129GB of data stored, with a compressratio of 2.35x, for an actual dataset size of 303GB, so we're getting a pretty good picture of 'real world' dataset load.

Riak with Bitcask is at 9.1GB of RAM and growing, while Solr is at 2.57GB of RAM.

So, things are overall improved versus my first attempt basically causing Riak, Yokozuna, and Solr to crash and burn. However, some confusing flies in the ointment:

I am still seeing a fair amount of socket turnover. The numbers are even, but there's still connections dropped and connected despite being under load.

[root@cumulus wbrown]# netstat -anp | grep :8093  | grep EST | grep java | wc -l
10
[root@cumulus wbrown]# netstat -anp | grep :8093  | grep TIME_WAIT  | wc -l
45

Looking closer:

[root@cumulus wbrown]# netstat -anp | grep :8093  | grep EST | grep java; echo "Waiting 10s"; sleep 10; netstat -anp | grep :8093  | grep EST | grep java;
tcp6       0      0 127.0.0.1:8093          127.0.0.1:56964         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:43465         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:40233         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:40236         ESTABLISHED 24086/java          
tcp6     946      0 127.0.0.1:8093          127.0.0.1:45451         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:47951         ESTABLISHED 24086/java          
tcp6    5962      0 127.0.0.1:8093          127.0.0.1:38766         ESTABLISHED 24086/java          
Waiting 10s
tcp6       0      0 127.0.0.1:8093          127.0.0.1:40469         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:50734         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:42569         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:45451         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:60645         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:48799         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:39427         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:47883         ESTABLISHED 24086/java          
tcp6       0    147 127.0.0.1:8093          127.0.0.1:38766         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:51918         ESTABLISHED 24086/java

We can see that the connections have completely turned over in a 10s interval with a single connection maintained.. It's almost as if the timeout settings are not working. Another interesting note to make -- the connections via lo are via IPv6 rather than IPv4, so I wonder if that's making an impact. There shouldn't be a near complete connection turnover in 10s if the timeouts are set to 60s.

A side question - I'm really big on metrics. You cannot do performance analysis without metrics. Do we have any Yokozuna latency metrics exposed anywhere, or do I need to set up Riak on OmniOS for dtrace? :) It'd be nice to have in the riak-admin status pane. I can open an issue for this, if indicated helpful.

rzezeski · 2014-04-22T16:51:53Z

@wbrown First, thank you so much for all the great information from
the field. The detail of your reporting is much appreciated. That
said, here is a reply to one of your comments:

I am still seeing a fair amount of socket turnover. The numbers are even, but there's still connections dropped and connected despite being under load.
...
We can see that the connections have completely turned over in a 10s interval with a single connection maintained.. It's almost as if the timeout settings are not working. Another interesting note to make -- the connections via lo are via IPv6 rather than IPv4, so I wonder if that's making an impact. There shouldn't be a near complete connection turnover in 10s if the timeouts are set to 60s.

Yes this is what I was trying to explain in my
comment about the ibrowse load balancer the other day. It
defaults to a 10s inactivity_timeout. That, along with its poor
algorithm causes connections to constantly his this 10s timeout. This
timeout was NOT adjusted in this patch. I tried setting it in the
global config but ibrowse didn't seem to pick it up. I want to try
again but setting the timeout in the code but I may open a more
focused PR as this one is already making too many changes at once.

rzezeski · 2014-04-22T19:57:16Z

My findings are that higher time out doesn't prevent the
EofException; it simply requires higher latency of the request to
hit it. I verified this by setting both the ibrowse request timeout
and Jetty idle timeout to 5 seconds and applying load. It managed to
produce plenty of EofException. The only method to completely
prevent EofException was to make the ibrowse request time out
substantially higher than the Jetty idle time out. This allows Jetty
to be in control of closing connections. This can be explored in a
different patch if need be.

I also found that these changes don't solve connection churn as that
is related to a combination of the inactivity_timeout and bad load
balancing algorithm [1]. This, in theory, could be fixed with a
raise to the inactivity_timeout but I will leave that to a focused
PR rather than appending to this one.

Overall, the 60 second time out did make things quieter since it makes
it harder to time out. I also noticed a bump in throughput of about
27% from 1.1 writes/s to 1.4 writes/s. This patch actually causes the
median latency to raise slightly because it is allowing requests to
take longer than 30 seconds. The 99th is lower, probably due to less
crashes and thus less busy work on the server.

	Throughput (Ops/s)	Median	99th (ms)
develop	1.1	17.1 s	44.5 s
patch 358	1.4	19.5 s	40.3 s

In general, I think time outs need to be looked at more closely,
especially in regards to user semantics. If a user sets a timeout of 5
seconds on the write request should the index operation allow to block
the vnode for 60 seconds?

With that I'm going to consider the time outs reviewed so that I can
get to the other changes made in this PR.

rzezeski · 2014-04-22T19:57:58Z

The riak tests look good:

Test Results:
yz_wm_extract_test-bitcask   : pass
yz_stat_test-bitcask         : pass
yz_solr_start_timeout-bitcask: pass
yz_siblings-bitcask          : pass
yz_security-bitcask          : pass
yz_schema_admin-bitcask      : pass
yz_rs_migration_test-bitcask : pass
yz_ring_resizing-bitcask     : pass
yz_pb-bitcask                : pass
yz_monitor_solr-bitcask      : pass
yz_mapreduce-bitcask         : pass
yz_languages-bitcask         : pass
yz_index_admin-bitcask       : pass
yz_fallback-bitcask          : pass
yz_errors-bitcask            : pass
yz_dt_test-bitcask           : pass
yokozuna_essential-bitcask   : pass
aae_test-bitcask             : pass
---------------------------------------------
0 Tests Failed
18 Tests Passed
That's 100.0% for those keeping score

I'm not away of `ON` being a reserved word in Lucene query syntax. Given the original reporter mentioned `OR` I've updated the test to use that.

rzezeski · 2014-04-22T20:34:32Z

I don't recall ON being a reserved word in Lucene query syntax so I changed to OR. Test still passes.

Test Results:
yz_languages-bitcask: pass
---------------------------------------------
0 Tests Failed
1 Tests Passed
That's 100.0% for those keeping score

rzezeski · 2014-04-22T20:48:29Z

rzezeski · 2014-04-22T20:50:50Z

+1 61d6326

Soften JVM overload failures Reviewed-by: rzezeski

rzezeski · 2014-04-22T21:19:28Z

@borshop merge

Add `json_obj_valgen` to the Yokozuna BB driver. I used this to test indexing "large" JSON objects while investigating #358. The config line to use this value generator looks like so: ``` {value_generator, {function, yz_driver, json_obj_valgen, [<NumFields>]}}. ``` This will generate a static JSON object (i.e. generated once at start up) which contains `<NumFields>` integer fields. All workers will index the identical object. Here is an example config from one of my testing runs: ``` {mode, max}. {concurrent, 32}. {driver, yz_driver}. {code_paths, ["/root/work/yz-bb/misc/bench"]}. {secure, false}. {bucket, {<<"data">>, <<"largeobj">>}}. {index, <<"largeobj">>}. {pb_conns, [{"10.0.1.201", 8087}]}. {http_conns, []}. {duration, infinity}. {key_generator, {to_binstr, "~B", {partitioned_sequential_int, 0, 1000}}}. {value_generator, {function, yz_driver, json_obj_valgen, [4000]}}. {operations, [{load_pb, 1}]}. ``` This uses 32 workers to write the same JSON object containing 4000 integer fields over the key space 0 to 999.

A combination of ibrowse's inefficient load balancing algorithm and default socket inactivity timeout of 10 seconds can cause TIME-WAIT load. It becomes worse as the pool size of pipeline sizes are increased. This patch is a temporary workaround to reduce socket churn and thus TIME-WAIT load. It does so by increasing the inactivity timeout to 60 seconds across the board. Below is a chart of showing the amount of socket churn in connections per minute for the different timeout values both at idle and while under load from basho bench. These numbers were calculated by a DTrace script which counted the number of new connections being accepted on port 8093. | Timeout | Socket Churn At Idle | Socket Churn Under Load | |---------|----------------------|-------------------------| | 10s | ~59 conns/min | ~29 conns/min | | 60s | ~9 conns/min | ~6 conns/min | The timeout is set via application env because ibrowse has the absolute most complex configuration management code I have ever seen and this was the easiest way to make sure the timeout is set correctly. This is just a workaround until after 2.0 when other HTTP clients and pools may be tested. ibrowse seems to have many issues, this is but just one. For more background see the following issues: #367 #358 #330 #320

A combination of ibrowse's inefficient load balancing algorithm and default socket inactivity timeout of 10 seconds can cause TIME-WAIT load. It becomes worse as the pool size of pipeline sizes are increased. This patch is a temporary workaround to reduce socket churn and thus TIME-WAIT load. It does so by increasing the inactivity timeout to 600 seconds across the board. Below is a chart of showing the amount of socket churn in connections per minute for the different timeout values both at idle and while under load from basho bench. These numbers were calculated by a DTrace script which counted the number of new connections being accepted on port 8093. | Timeout | Socket Churn At Idle | Socket Churn Under Load | |---------|----------------------|-------------------------| | 10s | ~59 conns/min | ~29 conns/min | | 600s | ~9 conns/min | ~6 conns/min | The timeout is set via application env because ibrowse has the absolute most complex configuration management code I have ever seen and this was the easiest way to make sure the timeout is set correctly. This is just a workaround until after 2.0 when other HTTP clients and pools may be tested. ibrowse seems to have many issues, this is but just one. For more background see the following issues: #367 #358 #330 #320

Add `json_obj_valgen` to the Yokozuna BB driver. I used this to test indexing "large" JSON objects while investigating #358. The config line to use this value generator looks like so: ``` {value_generator, {function, yz_driver, json_obj_valgen, [<NumFields>]}}. ``` This will generate a static JSON object (i.e. generated once at start up) which contains `<NumFields>` integer fields. All workers will index the identical object. Here is an example config from one of my testing runs: ``` {mode, max}. {concurrent, 32}. {driver, yz_driver}. {code_paths, ["/root/work/yz-bb/misc/bench"]}. {secure, false}. {bucket, {<<"data">>, <<"largeobj">>}}. {index, <<"largeobj">>}. {pb_conns, [{"10.0.1.201", 8087}]}. {http_conns, []}. {duration, infinity}. {key_generator, {to_binstr, "~B", {partitioned_sequential_int, 0, 1000}}}. {value_generator, {function, yz_driver, json_obj_valgen, [4000]}}. {operations, [{load_pb, 1}]}. ``` This uses 32 workers to write the same JSON object containing 4000 integer fields over the key space 0 to 999.

Soften jvm overload failures

c147db7

coderoshi added test labels Apr 9, 2014

coderoshi added this to the 2.0-beta milestone Apr 9, 2014

coderoshi self-assigned this Apr 9, 2014

coderoshi modified the milestones: 2.0-RC, 2.0-beta Apr 9, 2014

rzezeski reviewed Apr 9, 2014
View reviewed changes

rzezeski mentioned this pull request Apr 9, 2014

non-latin types, buckets, and keys #359

Closed

rzezeski mentioned this pull request Apr 11, 2014

Look into ramBufferSize #218

Closed

rzezeski reviewed Apr 11, 2014
View reviewed changes

coderoshi mentioned this pull request Apr 11, 2014

yz/ibrowse configuration and interaction with Solr #330

Closed

rzezeski assigned rzezeski and unassigned coderoshi Apr 11, 2014

coderoshi added 2 commits April 11, 2014 13:27

Set maxWarmingSearchers back to default

894e70d

Change catch-all to ignored

0c5736a

Merge remote-tracking branch 'origin/develop' into er/throughput

e823d0d

Use actual reserved word for test

61d6326

I'm not away of `ON` being a reserved word in Lucene query syntax. Given the original reporter mentioned `OR` I've updated the test to use that.

borshop added a commit that referenced this pull request Apr 22, 2014

Merge pull request #358 from basho/er/throughput

9f87dc4

Soften JVM overload failures Reviewed-by: rzezeski

rzezeski mentioned this pull request Apr 22, 2014

Catch-all "*" in default schema XML is problematic #316

Closed

borshop merged commit 61d6326 into develop Apr 22, 2014

rzezeski mentioned this pull request Apr 23, 2014

Increase ibrowse Inactivity Timeout #367

Closed

2 tasks

rzezeski mentioned this pull request Apr 23, 2014

Add JSON Object Value Generator #368

Merged

rzezeski mentioned this pull request May 9, 2014

Reduce TIME-WAIT load #375

Merged

rzezeski deleted the er/throughput branch May 22, 2014 19:41

zeeshanlakhani mentioned this pull request Aug 9, 2015

Expose (and make simpler) configuration of certain solrconfig.xml fields per index-level [JIRA: RIAK-2077] #537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soften JVM overload failures #358

Soften JVM overload failures #358

coderoshi commented Apr 9, 2014

rzezeski Apr 9, 2014

rzezeski Apr 22, 2014

rzezeski commented Apr 9, 2014

wbrown commented Apr 9, 2014

rzezeski commented Apr 9, 2014

rzezeski Apr 11, 2014

rzezeski commented Apr 11, 2014

coderoshi commented Apr 11, 2014

hectcastro commented Apr 11, 2014

rzezeski commented Apr 11, 2014

rzezeski commented Apr 11, 2014

coderoshi commented Apr 11, 2014

rzezeski commented Apr 11, 2014

wbrown commented Apr 21, 2014

rzezeski commented Apr 21, 2014

wbrown commented Apr 22, 2014

jonmeredith commented Apr 22, 2014

wbrown commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

Soften JVM overload failures #358

Soften JVM overload failures #358

Conversation

coderoshi commented Apr 9, 2014

solrconfig.xml

default_schema.xml

jetty.xml

yz_solr.erl

ibrowse.conf

riak.conf

Other change

Documentation

rzezeski Apr 9, 2014

Choose a reason for hiding this comment

rzezeski Apr 22, 2014

Choose a reason for hiding this comment

rzezeski commented Apr 9, 2014

wbrown commented Apr 9, 2014

rzezeski commented Apr 9, 2014

rzezeski Apr 11, 2014

Choose a reason for hiding this comment

rzezeski commented Apr 11, 2014

coderoshi commented Apr 11, 2014

hectcastro commented Apr 11, 2014

rzezeski commented Apr 11, 2014

rzezeski commented Apr 11, 2014

coderoshi commented Apr 11, 2014

rzezeski commented Apr 11, 2014

wbrown commented Apr 21, 2014

rzezeski commented Apr 21, 2014

wbrown commented Apr 22, 2014

jonmeredith commented Apr 22, 2014

wbrown commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014

rzezeski commented Apr 22, 2014