[Performance On Large clusters] Reduce updates on large services #4720

pierresouchay · 2018-09-28T07:05:30Z

Checks do update services/nodes only when really modified to avoid too many updates on very large clusters.

In a large cluster, when having a few thousands of nodes, the anti-entropy mechanism performs lots of changes (several per seconds) while there is no real change. This patch wants to improve this in order to increase Consul scalability when using many blocking requests on health for instance.

For the record, we are having anti-entropy running every 6-7 minutes, just this change does around more than 15 changes in catalog/nodes, but in our large services (around 1k nodes, that's still a lot) even in stable services.
What it means is that our large services are constantly updated and all health watcher being notified.

This has huge impact on load on Consul server and completely break performance.

This optimization reduces the load on Consul servers since all watchers will not being notified for nothing when Anti-Antropy or any request not changing the content or a node/service/check is being performed.

We have lots of performance issues those days, we would really appreciate a quick review.

Thank you very much

…when really modified to avoid too many updates on very large clusters In a large cluster, when having a few thousands of nodes, the anti-entropy mechanism performs lots of changes (several per seconds) while there is no real change. This patch wants to improve this in order to increase Consul scalability when using many blocking requests on health for instance.

…vice is really modified

…is really modified

pierresouchay · 2018-09-28T17:14:03Z

@banks this modification might greatly enhance the performance of your cache implementation in progress for large clusters with services having many instances

pierresouchay · 2018-10-03T15:53:55Z

@banks @mkeeler @pearkes We applied this patch on a middle size cluster of ours (3400 nodes, 1300 kind of services). It more than divided by 2 the network bandwidth needed on our servers!

We also tested it on our preprod clusters with nice changes, similar to this, but the improvement increase quite a lot with the number of nodes!

On a large cluster, this is a game changer, as index per service was a very important change for scalability.

This patch basically allow services not changing much to avoid changing for nothing and notifies clients only when real changes occur. Here are our result...

Rate of Changes with wait=10m on a stable Service

This shows the number of updates on a stable service with 32 instances (means that anti-entropy does not change the service unless really needed). It means that if your use wait=10m with correct index on a blocking query, on a stable service, you will get a result after 10m. On our side, we have more than 1300 real services (x2 because of proxies), it means all watcher will get far less notifications

One single small service (32 instances) watched by many apps, rate per 10m:

On all services at the same time, here is the result (rate per 10m)

Real Life impact on this Datacenter

Ok, this was theoretical, here are the raw results

99th percentile on Read Stale delay

While we used to have up to 1s of latency on this DC, the latency did drop to less than 256ms

Load Average / CPU Load

Decreasing really nicely on the servers

Network bandwidth on server

Divided by more than 2...

On 7 days (last bar on right is new version with this patch)

Req/s on a single service

Req/sec on server for services shown above (32 nodes). Some missing points before the MEP because of temp breakage of our metric system.

Various Consul Metrics for 7 days (last bar is the MEP with this patch)

Conclusion

This patch has a real impact on large clusters. I am also gonna test it tomorrow on larger DCs (up to 7k nodes), we expect even larger improvements.

With this patch, this is the first time in weeks we do succeed to be below the SLA we provide to our internal clients, so I really think this is a very important optimization (and not that big or intrusive).

Could you please have a look?

Kind regards

pearkes · 2018-10-05T16:34:37Z

Thanks @pierresouchay this looks like some great investigative work and the benefits seem awesome 😄. Sorry we haven't had a chance to review in detail yet but we will likely come back around after a release we're leading up to in the next week or so (1.3).

pierresouchay · 2018-10-05T17:34:07Z

Thank you very much @pearkes,

I get it, perfectly fine. However, would it be possible to have a look to #3551 which was kind of ready for a while with all fixes provided and that we need to backport, update all conflicts whith each release for almost 1 year.

This one is almost without any risk since it does not touch anything to servers nor modify any kind of API...

Kind regards

mkeeler

@pierresouchay This looks really good.

Is it safe to say the performance gains are due to not waking anything up blocking on the catalog due to not reinserting data when its the same. Or inversely the performance benefit is not mostly due to the overhead of reindexing within memdb.

Marking this as "Requesting Changes" because I have a few questions and requests for an additional code comment or 2 but the actual code looks ready to me.

mkeeler · 2018-10-05T18:35:21Z

agent/consul/state/catalog.go

+		entry.CreateIndex = serviceNode.CreateIndex
+		entry.ModifyIndex = serviceNode.ModifyIndex
+		if entry.IsSame(serviceNode) {
+			modified = false


I assume there is a reason why we cannot just return here like you did for the IsSame check on the node. If so it would be good to have a comment indicating why you can't, why the node needs to still be Inserted but why the "index" table doesn't require updating.

Not really, was just to avoid to change too many tests.
Another possible reason was adding into Consul some values in DB, for instance pre-filled values from ToServiceNodes(), for instance Weights.
While semantics to not change, we are sure we are keeping the same kind of data before and after the patch (while upgrade semantics stays untouched).

Added comments in next patch

As for the default values coming from ToServiceNode, that call happens prior to the IsSame check so if any of the defaults change then IsSame should return false and cause reinsertion.

As for the other reason of wanting to return ErrMissingNode, why can't we just move the node check block to before the ToServiceNode call?

You are right, I wanted to have a change as small as possible, but you are definitely right, changing that...

agent/structs/structs.go

pierresouchay · 2018-10-05T19:35:22Z

@mkeeler I added the comment and made contract for IsSame() a no-brainer for future user of function.

mkeeler · 2018-10-05T20:06:34Z

@pierresouchay Looks like GitHub isn't showing the updated comment at the bottom of this page so I will ask it here again to make sure its easily visible.

I am not certain we can't return early when the IsSame check for services returns true. As for wanting to keep backwards compatibility with returning ErrMissingNode I think you are right. But the next question becomes why not move that whole check to before the ToServiceNode call.

As for the other reason of ensuring default values get inserted shouldn't the IsSame check pick up those changes and ensure that the reinsertion happens?

pierresouchay · 2018-10-05T20:11:16Z

@mkeeler I am also a bit lost with thread handling on GitHub :-)

You are right, my last patch fixes it, I removed the modfied attribute, so I think you will be happy. I wanted to change as few lines as possible, but your approach is more reasonnable.

mkeeler

Looks good to me.

agent/structs/structs_test.go

…ces_when_not_needed

agent/structs/structs.go

agent/structs/structs_test.go

…est coverage of IsSameService

mkeeler · 2018-10-10T20:20:54Z

@pierresouchay That travis failure is legit. I think you need to update the ServiceNode.IsSame check. The f-envoy branch which just got merged adds a couple new fields to the ServiceNode which you will need to check for in that function as well.

…ces_when_not_needed

pierresouchay · 2018-10-10T23:31:16Z

@mkeeler @banks I have a very strange error.

On my Mac Laptop, I cannot reproduce the same error as Travis:

I have:

go test -timeout 30s github.com/hashicorp/consul/api -run '^TestAPI_CatalogConnect$'
...
--- FAIL: TestAPI_CatalogConnect (9.19s)
    retry.go:116: catalog_test.go:433: Unexpected response code: 500 (1 error(s) occurred:

        * ProxyDestination must be non-empty for Connect proxy services)

FAIL
FAIL	github.com/hashicorp/consul/api	9.206s

I have no clue about the reason why (spent 3 hours on it)...
My version is:

go version
go version go1.11 darwin/amd64

The weird thing is that I have the same exact failure on master with be52793aad664c771637689e8666d5bbf26ac028 :

git rev-parse --verify HEAD && go clean && go test -timeout 30s github.com/hashicorp/consul/api -run '^TestAPI_CatalogConnect$'
be52793aad664c771637689e8666d5bbf26ac028
2018/10/11 01:26:54 CONFIG JSON: {"node_name":"node-72de5520-ce8e-ccd0-4bfd-72a951eabe51","node_id":"72de5520-ce8e-ccd0-4bfd-72a951eabe51","performance":{"raft_multiplier":1},"bootstrap":true,"server":true,"data_dir":"/tmp/consul-test/TestAPI_CatalogConnect-consul163839338/data","segments":null,"disable_update_check":true,"log_level":"debug","bind_addr":"127.0.0.1","addresses":{},"ports":{"dns":29501,"http":29502,"https":29503,"serf_lan":29504,"serf_wan":29505,"server":29506},"acl_enforce_version_8":false,"connect":{"ca_config":{"cluster_id":"11111111-2222-3333-4444-555555555555"},"enabled":true,"proxy":{"allow_managed_api_registration":true}}}
bootstrap = true: do not enable unless necessary
==> Starting Consul agent...
[...]
--- FAIL: TestAPI_CatalogConnect (9.20s)
    retry.go:116: catalog_test.go:433: Unexpected response code: 500 (1 error(s) occurred:

        * ProxyDestination must be non-empty for Connect proxy services)

FAIL
FAIL	github.com/hashicorp/consul/api	9.214s

So either my version of Golang 1.11 (I also tried with same result on Go 1.10.3) is completely broken (I just installed it a few hours ago), either something really weird has changed in the tests.

I'll try tomorrow from a Linux machine, but I have no clue of the reason of this weirdness. Does any of you have an idea?

banks · 2018-10-11T11:35:30Z

@pierresouchay I suspect it's because you are running API tests without using make file.

API tests run the consul binary from your $PATH as the server so you probably have an older version of consul built in your path.

This is really gross and we should be able to fix it, but you either need to make dev first or run the test using make test GOTEST_PKGS="./api".

The API test reliance on the consul binary in path is a sad thing that has caught most Consul devs out at least once...

banks · 2018-10-11T11:41:30Z

I checked out your branch and it passes locally for me - CI is green too now 🎉 thanks for the time spent!

pierresouchay · 2018-10-12T17:11:49Z

@pierresouchay I suspect it's because you are running API tests without using make file.

API tests run the consul binary from your $PATH as the server so you probably have an older version of consul built in your path.

This is really gross and we should be able to fix it, but you either need to make dev first or run the test using make test GOTEST_PKGS="./api".

The API test reliance on the consul binary in path is a sad thing that has caught most Consul devs out at least once...

Well spotted, I had the exact same issue while trying to implement api/agent in #3551 (I spent so much time trying to figure out why my changes were not taken into account)

pierresouchay added 3 commits September 28, 2018 08:58

[Performance for large clusters] Only updates index of service if ser…

6515fb5

…vice is really modified

[Performance for large clusters] Only updates index of nodes if node …

11ce5a2

…is really modified

kamaradclimber approved these changes Sep 28, 2018

View reviewed changes

ShimmerGlass mentioned this pull request Oct 3, 2018

Add LastModifyTime to catalog entities #4744

Closed

mkeeler requested a review from a team October 5, 2018 18:17

mkeeler requested changes Oct 5, 2018

View reviewed changes

Added comments / ensure IsSame() has clear semantics

eb95545

Avoid having modified boolean, return nil directly if stutures are Same

82a903e

Fixed unstable unit tests TestLeader_ChangeServerID

efce3f1

mkeeler approved these changes Oct 8, 2018

View reviewed changes

banks reviewed Oct 9, 2018

View reviewed changes

agent/structs/structs_test.go Show resolved Hide resolved

banks added this to the 1.3.0 milestone Oct 9, 2018

pierresouchay added 2 commits October 9, 2018 16:16

Rewrite TestNode_IsSame() for better readability as suggested by @banks

4b1f466

Rename ServiceNode.IsSame() into IsSameService() + added unit tests

aab2d16

pierresouchay force-pushed the checks_do_not_modify_services_when_not_needed branch from a1ed401 to aab2d16 Compare October 9, 2018 16:33

Merge remote-tracking branch 'origin' into checks_do_not_modify_servi…

af4c8d7

…ces_when_not_needed

banks reviewed Oct 10, 2018

View reviewed changes

agent/structs/structs.go Show resolved Hide resolved

banks reviewed Oct 10, 2018

View reviewed changes

agent/structs/structs_test.go Outdated Show resolved Hide resolved

pierresouchay added 2 commits October 10, 2018 19:06

Do not duplicate TestStructs_ServiceNode_Conversions() and increase t…

8c535cd

…est coverage of IsSameService

Clearer documentation in IsSameService

b29fcd5

Take into account ServiceProxy into ServiceNode.IsSameService()

f8bf4c8

pierresouchay added 2 commits October 10, 2018 23:45

Merge remote-tracking branch 'origin' into checks_do_not_modify_servi…

4ca41d5

…ces_when_not_needed

Fixed IsSameService() with all new structures

9c1ffd6

banks merged commit 51b33ef into hashicorp:master Oct 11, 2018

pierresouchay mentioned this pull request Nov 22, 2018

Service resynced every ~2mins, causes consul index to grow #4960

Closed

pierresouchay mentioned this pull request Dec 4, 2018

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance On Large clusters] Reduce updates on large services #4720

[Performance On Large clusters] Reduce updates on large services #4720

pierresouchay commented Sep 28, 2018 •

edited

Loading

pierresouchay commented Sep 28, 2018

pierresouchay commented Oct 3, 2018 •

edited

Loading

pearkes commented Oct 5, 2018

pierresouchay commented Oct 5, 2018

mkeeler left a comment

mkeeler Oct 5, 2018

pierresouchay Oct 5, 2018

mkeeler Oct 5, 2018

pierresouchay Oct 5, 2018

pierresouchay commented Oct 5, 2018

mkeeler commented Oct 5, 2018

pierresouchay commented Oct 5, 2018

mkeeler left a comment

mkeeler commented Oct 10, 2018

pierresouchay commented Oct 10, 2018

banks commented Oct 11, 2018 •

edited

Loading

banks commented Oct 11, 2018

pierresouchay commented Oct 12, 2018

[Performance On Large clusters] Reduce updates on large services #4720

[Performance On Large clusters] Reduce updates on large services #4720

Conversation

pierresouchay commented Sep 28, 2018 • edited Loading

pierresouchay commented Sep 28, 2018

pierresouchay commented Oct 3, 2018 • edited Loading

Rate of Changes with wait=10m on a stable Service

One single small service (32 instances) watched by many apps, rate per 10m:

On all services at the same time, here is the result (rate per 10m)

Real Life impact on this Datacenter

99th percentile on Read Stale delay

Load Average / CPU Load

Network bandwidth on server

On 7 days (last bar on right is new version with this patch)

Req/s on a single service

Various Consul Metrics for 7 days (last bar is the MEP with this patch)

Conclusion

pearkes commented Oct 5, 2018

pierresouchay commented Oct 5, 2018

mkeeler left a comment

Choose a reason for hiding this comment

mkeeler Oct 5, 2018

Choose a reason for hiding this comment

pierresouchay Oct 5, 2018

Choose a reason for hiding this comment

mkeeler Oct 5, 2018

Choose a reason for hiding this comment

pierresouchay Oct 5, 2018

Choose a reason for hiding this comment

pierresouchay commented Oct 5, 2018

mkeeler commented Oct 5, 2018

pierresouchay commented Oct 5, 2018

mkeeler left a comment

Choose a reason for hiding this comment

mkeeler commented Oct 10, 2018

pierresouchay commented Oct 10, 2018

banks commented Oct 11, 2018 • edited Loading

banks commented Oct 11, 2018

pierresouchay commented Oct 12, 2018

pierresouchay commented Sep 28, 2018 •

edited

Loading

pierresouchay commented Oct 3, 2018 •

edited

Loading

banks commented Oct 11, 2018 •

edited

Loading