Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soften JVM overload failures #358

Merged
merged 5 commits into from
Apr 22, 2014
Merged

Soften JVM overload failures #358

merged 5 commits into from
Apr 22, 2014

Conversation

coderoshi
Copy link
Contributor

After a week of replicating #330, I've concluded that there is little that we can do about an overloaded Solr JVM. What we can do is reduce the flood of failure messages by making some simple adjustments in key configuration files:

solrconfig.xml

Suggested by http://wiki.apache.org/solr/SolrPerformanceProblems, slow or overly frequent commits can overload the cache warmer. Since all other autowarmCounts were set to 0, I followed suit with the fieldValueCache.

Since all cache auto warmers were set to zero, I also bumped up with default maxWarmingSearchers from 2 (suitable for read-only slaves, as mentioned in the config comments) to 4. I found that on high loads of large objects, I'd get an exceeded limit of maxWarmingSearchers=X message, which this resolved.

default_schema.xml

For objects with many fields, I set the catch-all field to false by default, as per #316. This helped tremendously in the reduction of incidental field indexing.

jetty.xml

I bumped up the maxIdleTime from 50 seconds to 60, since that's the default timeout for many of our PB clients. It's also important that this value matches the ibrowse client timeout, explained below.

The acceptors I kept at their default, which is 2. Solr recommends that you have 2 acceptors per CPU, so this is something that we should document. A better long term option is to make this a riak.conf, or possibly even automatically set acceptor count by interrogating the system, but I'm unsure this is something we want to tackle for beta.

yz_solr.erl

Issue #320 was rife with org.eclipse.jetty.io.EofException. This is because ibrowse default times out at 30 seconds. Since jetty defaulted to 50 seconds, it was Solr complaining about the unexpected closed socket. This at least stops that complaint.

ibrowse.conf

I made no changes to ibrowse.conf. Though we could increase the number of pooled and opened sockets, similar to this {dest, "localhost", 10014, 100, 100000, []}. it doesn't resolve the underlying problem of a slow and overloaded JVM. It might be a advanced setting in isolated cases, but in all of my tests increasing the connection pool just prolonged the inevitable crash from a JVM that wasn't indexing large objects fast enough.

riak.conf

Followed many JVM tuning suggestions for larger heaps (http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning), from tweaking pages sizes, settings per heap space, ratios, etc.

-d64 -Xms8g -Xmx32g -XX:+UseStringCache -XX:+UseCompressedOops -XX:NewRatio=3 \
-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8  \
-XX:+UseConcMarkSweepGC -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m \
-XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=70 -XX:CMSTriggerPermRatio=80 \
-XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts

I found some values that allowed me to support slightly larger objects sizes on my test cluster, for my particular test set (large JSON objects, 1-2M with 40-10k indexes of various types, updated 8-25 client threads), but expanding those options to be yokozuna default seems too pedantic.

I analyzed opened sockets in my tests, and when the JVM takes a long time, the connection remains active until the process is complete. If the data coming in is too large and fast, with insufficient JVM resources to manage the data being indexed in a timely manner, you'll see ports in use. The only solution I could see to this problem is ensuring that the JVM runs fast enough that the connections aren't used for more than a few seconds, which depends entirely on Solr being fast.

Other change

There's also a minor change in here. While making these _yz_solr changes, Dmitri found an error where his key was named for the state of the Oregon, causing the delete query to fail on a reserved word (_yz_rk:OR), so I wrapped it in quotes.

Documentation

Issue #330 gives us a glimpse of some problems that an ill-fitting JVM can cause. We're going to need to find and share plenty of documentation about how/where a user can got o investigate their resource requirements, and troubleshoot problems: basho/basho_docs#1012.

@coderoshi coderoshi added this to the 2.0-beta milestone Apr 9, 2014
@coderoshi coderoshi self-assigned this Apr 9, 2014
@coderoshi coderoshi modified the milestones: 2.0-RC, 2.0-beta Apr 9, 2014
@@ -320,10 +321,10 @@ encode_commit() ->
%% @doc Encode a delete operation into a mochijson2 compatiable term.
-spec encode_delete(delete_op()) -> term().
encode_delete({key,Key}) ->
Query = ?YZ_RK_FIELD_S ++ ":" ++ ibrowse_lib:url_encode(binary_to_list(Key)),
Query = ?YZ_RK_FIELD_S ++ ":\"" ++ ibrowse_lib:url_encode(binary_to_list(Key)) ++ "\"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we are breatking UTF-8 here? I don't think we have tests for indexing/searching/deleting UTF-8 keys. I'll write up a separate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solves the reserved word problem by using a phrase query but the more appropriate solution down the line is to use the raw or term query parsers.

http://wiki.apache.org/solr/QueryParser

@rzezeski
Copy link
Contributor

rzezeski commented Apr 9, 2014

@wbrown If you have the chance I would love to hear if this patch improves the issues you noticed in #330 & #320.

@wbrown
Copy link

wbrown commented Apr 9, 2014

@rzezeski Awesome, I've read these notes. So, this is informative and tells me that if I want good Solr performance, I need to architect for it. i.e. highly clocked machines. I'll give the patch set a shot. Do I apply this against the 2.0.0-pre20 download?

@rzezeski
Copy link
Contributor

rzezeski commented Apr 9, 2014

@wbrown I'm not sure if it will apply cleanly to pre20 download, but you can certainly try.

@rzezeski rzezeski mentioned this pull request Apr 11, 2014
@@ -104,7 +104,7 @@
<dynamicField name="*_set" type="string" indexed="true" stored="false" multiValued="true" />

<!-- catch-all field -->
<dynamicField name="*" type="text_general" indexed="true" stored="false" multiValued="true" />
<dynamicField name="*" type="text_general" indexed="false" stored="false" multiValued="true" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not going to index the catch-all field then set the type to ignored and drop the other attributes so they are inherited from the type. IMO this makes the drop semantic obvious. It also may prevent unnecessary analyzing since it uses the StrField type (I would hope indexed=false prevents analysis but use a non-analyzed type just in case).

@rzezeski
Copy link
Contributor

Since all cache auto warmers were set to zero, I also bumped up with default maxWarmingSearchers from 2 (suitable for read-only slaves, as mentioned in the config comments) to 4. I found that on high loads of large objects, I'd get an exceeded limit of maxWarmingSearchers=X message, which this resolved.

My reading of this is we should NOT raise the limit to 4. If Solr is
overloaded and can't keep up with the load then allowing more
concurrent warming searchers is only going to drag it down more. I
would rather shed load than avoid error messages in the log.

before the warming searcher from first commit has enough time to warm up, then there can be multiple searchers all competeing for resources at the same time, EVEN HTOUGH ONE OF THEM WILL BE THROWN AWAY AS SOON AS THE NEXT ONE IS READY. (emphasis mine)

This sentence makes me want to set the max to 1. If Solr can't keep up
with the commit load then throwing more concurrent load on the pile
will not help. Given that we auto soft-commit at 1s interval what
sense does it make to pile up warmers that are just going to
invalidate each other? That is, if a commit takes 5s to warm then I
don't want 4 concurrent warmers running, 3 of which are going to
quickly invalidated once the fourth finishes. Yes, updates will take
longer to be visible but in this case I think it's worth trading
harvest for yield because piling up concurrent warmers could end up
taking down the node; or worse, slowing it down just enough to make
latency balloon.

If you encounter this error a lot, you can (in theory) increase the number in your maxWarmingSearchers, but that is risky to do unless you are confident you have the system resources (RAM, CPU, etc...) to do it safely. A MORE CORRECT WAY TO DEAL WITH THE SITUATION IS TO REDUCE HOW FREQUENTLY YOU SEND COMMITS. (emphases mine)

I think this is the crux of the issue. In the future I would like
Yokozuna to expose a soft/hard commit call with the ability to turn
off the auto commits. It won't make sense for everyone but it could
allow more efficient use of resources for those that have more regular
data load patterns.

@coderoshi Did the exceeded limit of maxWarmingSearchers=X error
cause any actual issues like crashing the node or cause failed writes?
Even if it failed writes I think I'm still inclined to prefer that
behavior over adding more load. I'm tempted to play around with this a
bit myself. Is there code I could run to reproduce your benchmark?

@coderoshi
Copy link
Contributor Author

@rzezeski no, it didn't cause any failures, it was just filling the logs.

Here's a script I was running. https://gist.github.com/coderoshi/329b93ee7987b1beecdf

I later ran some tests directly against solr in order to riak/yz out of the equation, for straight solr tuning

bh = big_hash.clone
bh['_yz_id'] = SecureRandom.hex.to_s
doc = {"commit" => {}, "add" => {"doc" => bh}}.to_json
File.open("tmp/#{bh['_yz_id']}.json", 'w+') {|f| f.write(doc)}
`curl 'http://192.168.1.135:10034/solr/herp/update' -H'Content-type:application/json' --data-binary @tmp/#{bh['_yz_id']}.json 2> /dev/null`

@hectcastro
Copy link
Contributor

Putting this together as a way to verify my understanding of the maxWarmingSearchers changes proposed:

  • We're setting all of the cache autowarmCount settings to 0 in order to minimize the duration of auto warming so that maxWarmingSearchers is not exceeded
  • Here we are trading the value of caching for speed of making new values available in query results
  • We have no solr.QuerySenderListener configured by default, so combining that with setting all of the autowarmCount settings to 0, we've done pretty much all we can in minimizing the warming done before searchers are moved out of the "warming" state
  • maxWarmingSearchers was bumped from 2 to 4 because we removed as much warming overhead as possible
  • We may want maxWarmingSearchers to be > 1 because it is something that happens per index (having lots of indexes and setting this to 1 would hurt overall ability to provide fresh query results across all indexes)

@rzezeski
Copy link
Contributor

We may want maxWarmingSearchers to be > 1 because it is something that happens per index (having lots of indexes and setting this to 1 would hurt overall ability to provide fresh query results across all indexes)

Are you sure this applies across all cores? Each core gets its own solrconfig.xml so that is surprising behavior to me.

@rzezeski
Copy link
Contributor

maxWarmingSearchers was bumped from 2 to 4 because we removed as much warming overhead as possible

My understand is it was bumped to avoid logging noise. AFAICT running
out of warmers is not a hard error but rather a safety valve to
prevent overloading the core. Reading the following wiki entry makes
me think lower would be better given our 1s auto soft-commit and
preference for availability. This becomes even more important if
warming searchers is per-core. If we have 16 cores that could be 48
threads crunching.

http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F

@coderoshi
Copy link
Contributor Author

So my reasoning for zeroing out autowarmCount was in response to the solr documentation suggestions for slow commit times, and the fact that autowarmCount="0" is actually the default for current solr.

As for bumping up maxWarmingSearchers, that was, as @rzezeski suggested, solely to reduce the complaints in the log.

I do believe we should keep autowarmCount at zero, however, I'm not so strident about keeping maxWarmingSearchers at 4. Increasing the size didn't seem to make much of a difference for my tests, so I figured what the hell?

If log noise doesn't bother anyone, I'm fine with dropping maxWarmingSearchers back to 2 or even 1.

@rzezeski
Copy link
Contributor

I do believe we should keep autowarmCount at zero

I agree. Given our 1s auto commit I think auto warming may be hurting
more than helping. This is something I've meant to benchmark for a
while now.

The real solution, in the future, is to provide APIs for manipulating
certain aspects of the solrconfig.xml because things like auto
commit, caches and auto warming are all highly dependent on query
profile. One size does not fit all here. I foresee plenty of complaints
in the future about the hardcoded solrconfig.xml but the 2.0 feature
ship has sailed so until then we can try to protect ourselves (if
users REALLY want a custom setup they should be able to override
manually at filesystem level).

@wbrown
Copy link

wbrown commented Apr 21, 2014

@rzezeski I grabbed the patch from this commit, and applied it to the riak 2.0beta1 sources that went up for download. I even have a nice Arch Linux packaging script that will build me a Riak 2.0 beta1 package with that patch installed.

Unfortunately, I ended up beating my head against the desk for hours. What had worked before doesn't appear to work anymore, and it possibly has to do with the bucket types stuff. It's worked in the past by setting the bucket's index_search value to the index and schema in question.

So I got the latest bleeding edge Python client that had bucket types support added, worked my way through some of the bugs and lack of documentation there, and transitioned my indexing engine classes to use bucket types. I have indexes being created in Solr, buckets being assigned to bucket types, and bucket objects being grabbed from the bucket type object to ensure that we use the Yokozuna indexing. But I am absolutely not seeing any documents going into Yokozuna/Solr.

My setup step are:

  • Initialize new Riak cluster.
  • Create bucket types from riak-admin with the following script:
bts=(obs entity raw)
for bt in "${bts[@]}";
do
    bin/riak-admin bucket-type create $bt '{"props":{}}'
    bin/riak-admin bucket-type activate $bt
done
  • Upon initialization of my application, it does the following:
        ### create our schema if it doesn't exist
        try:
            self.riak.get_search_schema(schemaName)
        except riak.RiakError:
            self.riak.create_search_schema(schemaName,
                open(schemaFile).read())

       ### our bucket type is the same name as our bucket name
       indexedBucketType = self.riak.bucket_type( self.bucketName )
       self.bucket = indexedBucketType.bucket( self.bucketName )

        ### create our index in Yokozuna if it doesn't exist
        if not self.bucketName in [ x['name']
                            for x in self.riak.list_search_indexes() ]:
            self.riak.create_search_index( RKeyString( self.bucketName ),
                                           RKeyString( schemaName ),
                                           self.replicas )
            print "* Index created for", self.bucketName
            time.sleep(5)

        ### Set the bucket type to our desired n_val.
        if indexedBucketType.get_property( "n_val", None ) != self.replicas:
            indexedBucketType.set_property( "n_val", self.replicas )

        ### set our bucket type's index to our desired index
        try:
            searchIndex = indexedBucketType.get_property("search_index", None)
        except KeyError:
            indexedBucketType.set_property("search_index", self.bucketName)
            searchIndex = indexedBucketType.get_property("search_index", None)
        if searchIndex != self.bucketName:
            indexedBucketType.set_property("search_index", self.bucketName)

Am I missing any obvious steps here?

When I store to an indexed bucket, I ensure that I acquire the bucket object via the bucket_type instantiation method:

        if indexEngine:
            self.getBucket = self.riak.bucket_type( bucketName ).bucket
        else:
            self.getBucket = self.riak.bucket

        self.bucket = self.getBucket(self.name)

When I double check the object representation with print, it is giving the expected representation -- a bucket type of raw and a bucket of raw.

@rzezeski
Copy link
Contributor

There seem to dialyzer issues on develop so going to figure those out first before getting the final review done on this issue.

@wbrown
Copy link

wbrown commented Apr 22, 2014

@rzezeski @coderoshi So, reporting in now that I've successfully gotten this to work. It was some subtle issues brought up by the bleeding edge version of the Python client that I had to use to get the bucket_type stuff working that had an impact on my code. It's not ready for prime time, unfortunately, with some breaking API changes. I will need to sit down and figure out what issues and patches I can make against the client when I have more time than I do at the moment.

It is feeling like for high load situations like mine, JVM tuning is absolutely essential. I took the JVM settings that were outlined in the initial comment opening this PR, and applied it. To make it work, I had to enable Huge Pages, and allocate about 2048 huge pages at 2MB each. This is with a slightly reduced -Xms4g -Xmx8g setting as my development machines have 32GB each. I also had to raise the kernel shared memory maximum.

Things are looking better now:

node_puts : 22254
node_puts_total : 1189134
node_put_fsm_time_mean : 4309
node_put_fsm_time_median : 1325
node_put_fsm_time_95 : 3574
node_put_fsm_time_99 : 119076
node_put_fsm_time_100 : 259319

The above numbers put me in the ballpark of 300 to 400 new indexed keys a second, with an apparent median latency of 1325us

I'm seeing client side put latency of 4.84ms to 10.96ms with a load of about 10-30 active client-side threads, with Yokozuna and a Bitcask backend. This is interleaved with a client get latency of about 1.01ms to 2.28ms. Not bad at all considering that the client is connecting to my Riak nodes other Gigabit.

Looking at the network side of the picture:

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep TIME_WAIT | wc -l
49

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep ESTABLISHED | wc -l
20

So, about 20 in TIME_WAIT and 10 in ESTABLISHED, and the number isn't growing.

We'll see how this does once I've fully reloaded my dataset thus far into the new database. It's been my observation that Yokozuna usually starts out great, and then slows down. A part of this improvement is that I've improved my selection and versioning algorithms since I last used Yokozuna, and had been having good performance with 2i -- but this is quite promising. Yokozuna allows a lot more flexibility in query terms than 2i, as well as multiple query terms.

Thank you -- and let me know if there's anything I can test or check out here. I am quite motivated to see a performant Yokozuna in 2.0 release. I'll keep you guys in the loop as I rebuild my dataset.

@jonmeredith
Copy link
Contributor

Thanks for the detailed report @wbrown.

On Mon, Apr 21, 2014 at 7:10 PM, wbrown notifications@github.com wrote:

@rzezeski https://github.com/rzezeski @coderoshihttps://github.com/coderoshiSo, reporting in now that I've successfully gotten this to work. It was
some subtle issues brought up by the bleeding edge version of the Python
client that I had to use to get the bucket_type stuff working that had an
impact on my code. It's not ready for prime time, unfortunately, with some
breaking API changes. I will need to sit down and figure out what issues
and patches I can make against the client when I have more time than I do
at the moment.

It is feeling like for high load situations like mine, JVM tuning is
absolutely essential. I took the JVM settings that were outlined in the
initial comment opening this PR, and applied it. To make it work, I had to
enable Huge Pages, and allocate about 2048 huge pages at 2MB each. This is
with a slightly reduced -Xms4g -Xmx8g setting as my development machines
have 32GB each. I also had to raise the kernel shared memory maximum.

Things are looking better now:

node_puts : 22254
node_puts_total : 1189134
node_put_fsm_time_mean : 4309
node_put_fsm_time_median : 1325
node_put_fsm_time_95 : 3574
node_put_fsm_time_99 : 119076
node_put_fsm_time_100 : 259319

The above numbers put me in the ballpark of 300 to 400 new indexed keys a
second, with an apparent median latency of 1325us

I'm seeing client side put latency of 4.84ms to 10.96ms with a load of
about 10-30 active client-side threads, with Yokozuna and a Bitcask
backend. This is interleaved with a client get latency of about 1.01ms to
2.28ms. Not bad at all considering that the client is connecting to my
Riak nodes other Gigabit.

Looking at the network side of the picture:

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep TIME_WAIT | wc -l
49

[riak@cumulus ~]$ netstat -anp | grep :8093 | grep ESTABLISHED | wc -l
20

So, about 20 in TIME_WAIT and 10 in ESTABLISHED, and the number isn't
growing.

We'll see how this does once I've fully reloaded my dataset thus far into
the new database. It's been my observation that Yokozuna usually starts out
great, and then slows down. A part of this improvement is that I've
improved my selection and versioning algorithms since I last used Yokozuna,
and had been having good performance with 2i -- but this is quite
promising. Yokozuna allows a lot more flexibility in query terms than 2i,
as well as multiple query terms.

Thank you -- and let me know if there's anything I can test or check out
here. I am quite motivated to see a performant Yokozuna in 2.0 release.
I'll keep you guys in the loop as I rebuild my dataset.

Reply to this email directly or view it on GitHubhttps://github.com//pull/358#issuecomment-40994183
.

Jon Meredith
VP, Engineering
Basho Technologies, Inc.
jmeredith@basho.com

@wbrown
Copy link

wbrown commented Apr 22, 2014

Update after the load test/data reloading had been running for about 14 hours. Shortly before I went to bed, I went ahead and started a second client feeding another set of data in.

I'm at about 30 million keys indexed now. With two observation engine clients and a GeoIP database client populating Riak over Gigabit on a single workstation, my client-side latency stats are:

  • GET ranges from 5.07ms to 20.26ms with an apparent median of ~11ms
  • PUT ranges from 7.20ms to 42.53ms with an apparent median of ~30ms

This is with a current rate of about 562 keys a second PUT into Riak.

node_put_fsm_active : 5
node_put_fsm_active_60s : 34074
node_put_fsm_in_rate : 429
node_put_fsm_out_rate : 427

Latency measured by Riak is pretty good, though 95th and 99th percentile times aren't great at 327ms to 2679ms:

node_put_fsm_time_mean : 13050
node_put_fsm_time_median : 1413
node_put_fsm_time_95 : 11724
node_put_fsm_time_99 : 327825
node_put_fsm_time_100 : 2679843

GET latencies are pretty good from the Riak side still, though 99th and 100th is horrible at 173ms and 2799ms:

node_get_fsm_time_mean : 5374
node_get_fsm_time_median : 254
node_get_fsm_time_95 : 5424
node_get_fsm_time_99 : 173151
node_get_fsm_time_100 : 2799747

With 30 million keys, that's enough to outsize the 32GB of RAM in my two test nodes, and 32GB of RAM in my Infiniband ZFS filer. Each node is up to about 129GB of data stored, with a compressratio of 2.35x, for an actual dataset size of 303GB, so we're getting a pretty good picture of 'real world' dataset load.

Riak with Bitcask is at 9.1GB of RAM and growing, while Solr is at 2.57GB of RAM.

So, things are overall improved versus my first attempt basically causing Riak, Yokozuna, and Solr to crash and burn. However, some confusing flies in the ointment:

  • I am still seeing a fair amount of socket turnover. The numbers are even, but there's still connections dropped and connected despite being under load.
[root@cumulus wbrown]# netstat -anp | grep :8093  | grep EST | grep java | wc -l
10
[root@cumulus wbrown]# netstat -anp | grep :8093  | grep TIME_WAIT  | wc -l
45

Looking closer:

[root@cumulus wbrown]# netstat -anp | grep :8093  | grep EST | grep java; echo "Waiting 10s"; sleep 10; netstat -anp | grep :8093  | grep EST | grep java;
tcp6       0      0 127.0.0.1:8093          127.0.0.1:56964         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:43465         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:40233         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:40236         ESTABLISHED 24086/java          
tcp6     946      0 127.0.0.1:8093          127.0.0.1:45451         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:47951         ESTABLISHED 24086/java          
tcp6    5962      0 127.0.0.1:8093          127.0.0.1:38766         ESTABLISHED 24086/java          
Waiting 10s
tcp6       0      0 127.0.0.1:8093          127.0.0.1:40469         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:50734         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:42569         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:45451         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:60645         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:48799         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:39427         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:47883         ESTABLISHED 24086/java          
tcp6       0    147 127.0.0.1:8093          127.0.0.1:38766         ESTABLISHED 24086/java          
tcp6       0      0 127.0.0.1:8093          127.0.0.1:51918         ESTABLISHED 24086/java 

We can see that the connections have completely turned over in a 10s interval with a single connection maintained.. It's almost as if the timeout settings are not working. Another interesting note to make -- the connections via lo are via IPv6 rather than IPv4, so I wonder if that's making an impact. There shouldn't be a near complete connection turnover in 10s if the timeouts are set to 60s.

A side question - I'm really big on metrics. You cannot do performance analysis without metrics. Do we have any Yokozuna latency metrics exposed anywhere, or do I need to set up Riak on OmniOS for dtrace? :) It'd be nice to have in the riak-admin status pane. I can open an issue for this, if indicated helpful.

@rzezeski
Copy link
Contributor

@wbrown First, thank you so much for all the great information from
the field. The detail of your reporting is much appreciated. That
said, here is a reply to one of your comments:

I am still seeing a fair amount of socket turnover. The numbers are even, but there's still connections dropped and connected despite being under load.
...
We can see that the connections have completely turned over in a 10s interval with a single connection maintained.. It's almost as if the timeout settings are not working. Another interesting note to make -- the connections via lo are via IPv6 rather than IPv4, so I wonder if that's making an impact. There shouldn't be a near complete connection turnover in 10s if the timeouts are set to 60s.

Yes this is what I was trying to explain in my
comment about the ibrowse load balancer the other day. It
defaults to a 10s inactivity_timeout. That, along with its poor
algorithm causes connections to constantly his this 10s timeout. This
timeout was NOT adjusted in this patch. I tried setting it in the
global config but ibrowse didn't seem to pick it up. I want to try
again but setting the timeout in the code but I may open a more
focused PR as this one is already making too many changes at once.

@rzezeski
Copy link
Contributor

My findings are that higher time out doesn't prevent the
EofException; it simply requires higher latency of the request to
hit it. I verified this by setting both the ibrowse request timeout
and Jetty idle timeout to 5 seconds and applying load. It managed to
produce plenty of EofException. The only method to completely
prevent EofException was to make the ibrowse request time out
substantially higher than the Jetty idle time out. This allows Jetty
to be in control of closing connections. This can be explored in a
different patch if need be.

I also found that these changes don't solve connection churn as that
is related to a combination of the inactivity_timeout and bad load
balancing algorithm [1]. This, in theory, could be fixed with a
raise to the inactivity_timeout but I will leave that to a focused
PR rather than appending to this one.

Overall, the 60 second time out did make things quieter since it makes
it harder to time out. I also noticed a bump in throughput of about
27% from 1.1 writes/s to 1.4 writes/s. This patch actually causes the
median latency to raise slightly because it is allowing requests to
take longer than 30 seconds. The 99th is lower, probably due to less
crashes and thus less busy work on the server.

Throughput (Ops/s) Median 99th (ms)
develop 1.1 17.1 s 44.5 s
patch 358 1.4 19.5 s 40.3 s

In general, I think time outs need to be looked at more closely,
especially in regards to user semantics. If a user sets a timeout of 5
seconds on the write request should the index operation allow to block
the vnode for 60 seconds?

With that I'm going to consider the time outs reviewed so that I can
get to the other changes made in this PR.

@rzezeski
Copy link
Contributor

The riak tests look good:

Test Results:
yz_wm_extract_test-bitcask   : pass
yz_stat_test-bitcask         : pass
yz_solr_start_timeout-bitcask: pass
yz_siblings-bitcask          : pass
yz_security-bitcask          : pass
yz_schema_admin-bitcask      : pass
yz_rs_migration_test-bitcask : pass
yz_ring_resizing-bitcask     : pass
yz_pb-bitcask                : pass
yz_monitor_solr-bitcask      : pass
yz_mapreduce-bitcask         : pass
yz_languages-bitcask         : pass
yz_index_admin-bitcask       : pass
yz_fallback-bitcask          : pass
yz_errors-bitcask            : pass
yz_dt_test-bitcask           : pass
yokozuna_essential-bitcask   : pass
aae_test-bitcask             : pass
---------------------------------------------
0 Tests Failed
18 Tests Passed
That's 100.0% for those keeping score

I'm not away of `ON` being a reserved word in Lucene query
syntax. Given the original reporter mentioned `OR` I've updated the
test to use that.
@rzezeski
Copy link
Contributor

I don't recall ON being a reserved word in Lucene query syntax so I changed to OR. Test still passes.

Test Results:
yz_languages-bitcask: pass
---------------------------------------------
0 Tests Failed
1 Tests Passed
That's 100.0% for those keeping score

@rzezeski
Copy link
Contributor

+1 61d632

List of the things I reviewed:

  • compile/unit
  • dialyzer
  • riak test
  • review timeouts
  • review OR change
  • review solrconfig.xml changes
  • review schema changes
  • review jetty.xml changes

@rzezeski
Copy link
Contributor

+1 61d6326

borshop added a commit that referenced this pull request Apr 22, 2014
Soften JVM overload failures

Reviewed-by: rzezeski
@rzezeski
Copy link
Contributor

@borshop merge

@borshop borshop merged commit 61d6326 into develop Apr 22, 2014
@rzezeski rzezeski mentioned this pull request Apr 23, 2014
2 tasks
rzezeski added a commit that referenced this pull request Apr 23, 2014
Add `json_obj_valgen` to the Yokozuna BB driver. I used this to test
indexing "large" JSON objects while investigating #358.  The config
line to use this value generator looks like so:

```
{value_generator, {function, yz_driver, json_obj_valgen,
[<NumFields>]}}.
```

This will generate a static JSON object (i.e. generated once at start
up) which contains `<NumFields>` integer fields.  All workers will
index the identical object.  Here is an example config from one of my
testing runs:

```
{mode, max}.
{concurrent, 32}.
{driver, yz_driver}.
{code_paths, ["/root/work/yz-bb/misc/bench"]}.
{secure, false}.
{bucket, {<<"data">>, <<"largeobj">>}}.
{index, <<"largeobj">>}.
{pb_conns, [{"10.0.1.201", 8087}]}.
{http_conns, []}.
{duration, infinity}.
{key_generator, {to_binstr, "~B", {partitioned_sequential_int, 0,
1000}}}.
{value_generator, {function, yz_driver, json_obj_valgen, [4000]}}.
{operations, [{load_pb, 1}]}.
```

This uses 32 workers to write the same JSON object containing 4000
integer fields over the key space 0 to 999.
rzezeski added a commit that referenced this pull request May 9, 2014
A combination of ibrowse's inefficient load balancing algorithm and
default socket inactivity timeout of 10 seconds can cause TIME-WAIT
load.  It becomes worse as the pool size of pipeline sizes are
increased.  This patch is a temporary workaround to reduce socket
churn and thus TIME-WAIT load.  It does so by increasing the
inactivity timeout to 60 seconds across the board.  Below is a chart
of showing the amount of socket churn in connections per minute for
the different timeout values both at idle and while under load from
basho bench.  These numbers were calculated by a DTrace script which
counted the number of new connections being accepted on port 8093.

| Timeout | Socket Churn At Idle | Socket Churn Under Load |
|---------|----------------------|-------------------------|
| 10s     | ~59 conns/min        | ~29 conns/min           |
| 60s     | ~9 conns/min         | ~6 conns/min            |

The timeout is set via application env because ibrowse has the
absolute most complex configuration management code I have ever seen
and this was the easiest way to make sure the timeout is set
correctly.  This is just a workaround until after 2.0 when other HTTP
clients and pools may be tested.  ibrowse seems to have many issues,
this is but just one.

For more background see the following issues:

#367
#358
#330
#320
@rzezeski rzezeski mentioned this pull request May 9, 2014
rzezeski added a commit that referenced this pull request May 9, 2014
A combination of ibrowse's inefficient load balancing algorithm and
default socket inactivity timeout of 10 seconds can cause TIME-WAIT
load.  It becomes worse as the pool size of pipeline sizes are
increased.  This patch is a temporary workaround to reduce socket
churn and thus TIME-WAIT load.  It does so by increasing the
inactivity timeout to 600 seconds across the board.  Below is a chart
of showing the amount of socket churn in connections per minute for
the different timeout values both at idle and while under load from
basho bench.  These numbers were calculated by a DTrace script which
counted the number of new connections being accepted on port 8093.

| Timeout | Socket Churn At Idle | Socket Churn Under Load |
|---------|----------------------|-------------------------|
| 10s     | ~59 conns/min        | ~29 conns/min           |
| 600s    | ~9 conns/min         | ~6 conns/min            |

The timeout is set via application env because ibrowse has the
absolute most complex configuration management code I have ever seen
and this was the easiest way to make sure the timeout is set
correctly.  This is just a workaround until after 2.0 when other HTTP
clients and pools may be tested.  ibrowse seems to have many issues,
this is but just one.

For more background see the following issues:

#367
#358
#330
#320
rzezeski added a commit that referenced this pull request May 20, 2014
Add `json_obj_valgen` to the Yokozuna BB driver. I used this to test
indexing "large" JSON objects while investigating #358.  The config
line to use this value generator looks like so:

```
{value_generator, {function, yz_driver, json_obj_valgen,
[<NumFields>]}}.
```

This will generate a static JSON object (i.e. generated once at start
up) which contains `<NumFields>` integer fields.  All workers will
index the identical object.  Here is an example config from one of my
testing runs:

```
{mode, max}.
{concurrent, 32}.
{driver, yz_driver}.
{code_paths, ["/root/work/yz-bb/misc/bench"]}.
{secure, false}.
{bucket, {<<"data">>, <<"largeobj">>}}.
{index, <<"largeobj">>}.
{pb_conns, [{"10.0.1.201", 8087}]}.
{http_conns, []}.
{duration, infinity}.
{key_generator, {to_binstr, "~B", {partitioned_sequential_int, 0,
1000}}}.
{value_generator, {function, yz_driver, json_obj_valgen, [4000]}}.
{operations, [{load_pb, 1}]}.
```

This uses 32 workers to write the same JSON object containing 4000
integer fields over the key space 0 to 999.
@rzezeski rzezeski deleted the er/throughput branch May 22, 2014 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants