Performance & Metrics changes #56

mikeyyuen · 2020-02-18T18:55:07Z

Changes that Bloomberg have been using in production to support 3,000+
nodes with on 100 ESM's.

Adds metrics for health checks and performance
Adds support for > 64 ESM Instances
Improvements to allow a checks to run ever 1s safely
Improvements to prevent spurious node status updates
Improvements to allow disabling node coordinate updates
Fix an issue where check updates can fail to propagate to ESM Followers
Extra Logging
Exposed new config options
Adds ability to skip node coordinate updates if they are small (reducing spurious updates)
Updated to go mod for vendored dependencies

Please let us know what you think.

hashicorp-cla · 2020-02-18T18:55:11Z

All committers have signed the CLA.

lornasong · 2020-02-19T22:28:27Z

Hi @mikeyyuen - thanks so much for this PR. Looking forward to reviewing this!

lornasong · 2020-02-21T20:43:15Z

@mikeyyuen, thanks again for the pull request!

Would you be able to update/rebase the changes? It looks like the fork is ~34 commits behind, which has led to some conflicts. We've also stopped committing our vendor directory so updating as such would help resolve some of those conflicts.

Let me know if you have any questions. Thank you

edevil · 2020-04-30T16:04:25Z

Hello. This work would be useful for us. Is anyone working on this? @mikeyyuen Is help required?

lornasong · 2020-04-30T20:42:51Z

@edevil thanks for the comment. Actually as timing would have it, I started working on rebasing this just a couple days ago and am currently working through conflicts. Don't hesitate to let me know if you have any questions or thoughts.

@mikeyyuen just an FYI (see above). If you have any questions or concerns, please let me know.

Thank you

Changes that Bloomberg have been using in production to support 3,000+ nodes with on 100 ESM's. - Adds metrics for health checks and performance - Adds support for > 64 ESM Instances - Improvements to allow a checks to run ever 1s safely - Improvements to prevent spurious node status updates - Improvements to allow disabling node coordinate updates - Fix an issue where check updates can fail to propagate to ESM Followers - Extra Logging - Exposed new config options - Adds ability to skip node coordinate updates if they are small (reducing spurious updates) - Updated to go mod for vendored dependencies

edevil · 2020-05-02T09:37:04Z

@lornasong That's great news! Let me know if I can be of help.

lornasong · 2020-05-06T22:23:43Z

Hi @mikeyyuen, we really appreciate your changes! They will greatly benefit our community. Since we haven't heard from you in a couple months, we decided to carry your changes and make progress with edits ourselves.

I've initially rebased and resolved the conflicts with #58. Note, it looks like unit tests aren't passing which I'm happy to resolve.

Would you be able to take a look at #58 and make sure that you think it accurately reflects your work? Once it looks good, would it be possible for you to sign the CLA again?

For context on re-signing the CLA: our CLA whitelists only by PRs originating from a fork in a whitelisted org. Since I've branched off of your fork within HashiCorp's repo, your original CLA doesn't 'transfer over'.

If you have any questions/concerns or other alternatives, please reach out. Thank you!

mikeyyuen · 2020-05-09T09:54:28Z

Taking a look now! Thank you for rebasing.

I can forrce push your changes on to this PR that should allow you to get a clean CLA (I don't think the corporate CLA will sign properly unless it comes from the Bloomberg org)

mikeyyuen · 2020-05-12T16:48:58Z

I was able to fix a bunch of the tests, they were just missing the new config parameters, there is still something fundamentally flaky about tests that go near agent_test.go, I'm not sure exactly what it is. I can take another look tomorrow, but @lornasong if you don't mind having a look too?

Also is there something I need to do to trigger circleci from this PR?

lornasong · 2020-05-12T21:47:02Z

@mikeyyuen thanks again for the review and push!

Yes, we definitely have some flaky tests. I typically see them in leader_test but let me know if there's a particular error you see in agent_test. I wasn't able to reproduce it but would be happy to look into it. Regardless, your commit triggered CircleCI and the tests are passing, thanks!

I'd like to get your opinion. It is likely we'll want to add some changes to your PR. For example, we are planning to have our products convene on using OpenTelemetry and would update this in your PR. We could either make changes and ask you to force-push them or another option would be to update your PR to merge to a non-master branch and we could take it from there. I'm guessing you have a lot on your plate and we'd like to avoid adding to it.

Please let us know how you'd like to move forward. Thanks!

mikeyyuen · 2020-05-12T22:08:51Z

Is OpenTelemetry backward compatible? If so I’d suggest landing this on master to avoid any more rebasing!

If not either way would work for me let me know!

edevil · 2020-05-25T08:04:11Z

Any news on this?

lornasong · 2020-05-26T21:01:53Z

@edevil, thanks for checking in and apologies for not being more proactive with updates.

Last week, I reviewed the changes and spoke with the team. As an update, here’s what we’re thinking of doing:

Merge the PR to a non-master branch
We’d like to split out the telemetry changes from the rest of the changes
For non-telemetry changes, make a few additional commits and then merge into master
For telemetry changes, get further input since we want to move from go-metrics to OpenTelemetry. From initial team feedback, it seems like it might not be backwards compatible since go-metrics supports more features than OpenTelemetry (e.g. statsd). More research is needed here.

@edevil, please let me know if there are specific changes you’re planning to use. That will help us prioritize. I will plan to switch to focusing back on this PR tomorrow.

@mikeyyuen, if you have any feedback, please let me know.

Thanks!

mikeyyuen · 2020-05-26T21:17:00Z

Thanks for the update. I think for us statsd is kind of critical. Its the main way we have for getting oss metrics into a common metrics backend. I’d definitely hope anything Consul uses down the like supports it!

edevil · 2020-05-27T10:45:14Z

Thank you for the update @lornasong. We use Prometheus, which I think is already supported by OpenTelemetry, so that will not be a problem in our case.

As for the rest of the changes, since they are informed by the experience of running a large scale cluster, we find them very valuable.

lornasong · 2020-05-27T16:06:07Z

@mikeyyuen, thanks for the feedback. That's helpful to know that statsd is critical for your team. I'll make sure this is included as we sort out how to handle telemetry.

@edevil, thanks for sharing more about your use-case. That's good to know that you're using Prometheus and interested in the large-scale cluster features.

lornasong · 2020-05-29T16:58:48Z

For non-telemetry changes, please route discussion to #63. Additional details and context provided there.

I will provide an update for telemetry-specific changes once I’ve understood how to move forward regarding the CLA.

Thank you

lornasong · 2020-06-10T17:46:06Z

An update here:

The change to upgrade from Go version 1.12 to 1.13 was moved into a separate PR: Update Go Version From 1.12 to 1.13 #64
PR for telemetry-specific changes: Bloomberg PR: Telemetry improvements #67. Please continue any discussions here!
I have the go-ahead from our legal team to merge Bloomberg PR: Non-telemetry improvements #63 and Bloomberg PR: Telemetry improvements #67 when ready - details will be within those PRs.

Thank you

lornasong mentioned this pull request May 5, 2020

Performance & Metrics changes (carries #56) #58

Closed

lornasong added enhancement waiting for reply labels May 6, 2020

mikeyyuen force-pushed the to_upstream branch from 0b456ec to 950800c Compare May 12, 2020 16:08

Fixes for coordinate tests

78a25dc

lornasong added thinking and removed waiting for reply labels May 18, 2020

lornasong changed the base branch from master to bloomberg-dev May 27, 2020 16:10

lornasong merged commit ebf5aec into hashicorp:bloomberg-dev May 27, 2020

lornasong mentioned this pull request May 28, 2020

Bloomberg PR: Non-telemetry improvements #63

Merged

lornasong removed the thinking label May 28, 2020

lornasong mentioned this pull request Jun 8, 2020

Update Go Version From 1.12 to 1.13 #64

Merged

lornasong mentioned this pull request Jun 10, 2020

Bloomberg PR: Telemetry improvements #67

Merged

mikeyyuen deleted the to_upstream branch June 21, 2020 11:18

lornasong added this to the 0.4.0 milestone Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance & Metrics changes #56

Performance & Metrics changes #56

mikeyyuen commented Feb 18, 2020

hashicorp-cla commented Feb 18, 2020 •

edited

Loading

lornasong commented Feb 19, 2020

lornasong commented Feb 21, 2020

edevil commented Apr 30, 2020

lornasong commented Apr 30, 2020

edevil commented May 2, 2020

lornasong commented May 6, 2020

mikeyyuen commented May 9, 2020

mikeyyuen commented May 12, 2020

lornasong commented May 12, 2020

mikeyyuen commented May 12, 2020

edevil commented May 25, 2020

lornasong commented May 26, 2020

mikeyyuen commented May 26, 2020

edevil commented May 27, 2020

lornasong commented May 27, 2020

lornasong commented May 29, 2020

lornasong commented Jun 10, 2020

Performance & Metrics changes #56

Performance & Metrics changes #56

Conversation

mikeyyuen commented Feb 18, 2020

hashicorp-cla commented Feb 18, 2020 • edited Loading

lornasong commented Feb 19, 2020

lornasong commented Feb 21, 2020

edevil commented Apr 30, 2020

lornasong commented Apr 30, 2020

edevil commented May 2, 2020

lornasong commented May 6, 2020

mikeyyuen commented May 9, 2020

mikeyyuen commented May 12, 2020

lornasong commented May 12, 2020

mikeyyuen commented May 12, 2020

edevil commented May 25, 2020

lornasong commented May 26, 2020

mikeyyuen commented May 26, 2020

edevil commented May 27, 2020

lornasong commented May 27, 2020

lornasong commented May 29, 2020

lornasong commented Jun 10, 2020

hashicorp-cla commented Feb 18, 2020 •

edited

Loading