Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance & Metrics changes #56

Merged
merged 2 commits into from
May 27, 2020

Conversation

mikeyyuen
Copy link
Contributor

Changes that Bloomberg have been using in production to support 3,000+
nodes with on 100 ESM's.

  • Adds metrics for health checks and performance
  • Adds support for > 64 ESM Instances
  • Improvements to allow a checks to run ever 1s safely
  • Improvements to prevent spurious node status updates
  • Improvements to allow disabling node coordinate updates
  • Fix an issue where check updates can fail to propagate to ESM Followers
  • Extra Logging
  • Exposed new config options
  • Adds ability to skip node coordinate updates if they are small (reducing spurious updates)
  • Updated to go mod for vendored dependencies

Please let us know what you think.

@hashicorp-cla
Copy link

hashicorp-cla commented Feb 18, 2020

CLA assistant check
All committers have signed the CLA.

@lornasong
Copy link
Member

Hi @mikeyyuen - thanks so much for this PR. Looking forward to reviewing this!

@lornasong
Copy link
Member

@mikeyyuen, thanks again for the pull request!

Would you be able to update/rebase the changes? It looks like the fork is ~34 commits behind, which has led to some conflicts. We've also stopped committing our vendor directory so updating as such would help resolve some of those conflicts.

Let me know if you have any questions. Thank you

@edevil
Copy link
Contributor

edevil commented Apr 30, 2020

Hello. This work would be useful for us. Is anyone working on this? @mikeyyuen Is help required?

@lornasong
Copy link
Member

@edevil thanks for the comment. Actually as timing would have it, I started working on rebasing this just a couple days ago and am currently working through conflicts. Don't hesitate to let me know if you have any questions or thoughts.

@mikeyyuen just an FYI (see above). If you have any questions or concerns, please let me know.

Thank you

Changes that Bloomberg have been using in production to support 3,000+
nodes with on 100 ESM's.

 - Adds metrics for health checks and performance
 - Adds support for > 64 ESM Instances
 - Improvements to allow a checks to run ever 1s safely
 - Improvements to prevent spurious node status updates
 - Improvements to allow disabling node coordinate updates
 - Fix an issue where check updates can fail to propagate to ESM Followers
 - Extra Logging
 - Exposed new config options
 - Adds ability to skip node coordinate updates if they are small (reducing spurious updates)
 - Updated to go mod for vendored dependencies
@edevil
Copy link
Contributor

edevil commented May 2, 2020

@lornasong That's great news! Let me know if I can be of help.

@lornasong
Copy link
Member

Hi @mikeyyuen, we really appreciate your changes! They will greatly benefit our community. Since we haven't heard from you in a couple months, we decided to carry your changes and make progress with edits ourselves.

I've initially rebased and resolved the conflicts with #58. Note, it looks like unit tests aren't passing which I'm happy to resolve.

Would you be able to take a look at #58 and make sure that you think it accurately reflects your work? Once it looks good, would it be possible for you to sign the CLA again?

For context on re-signing the CLA: our CLA whitelists only by PRs originating from a fork in a whitelisted org. Since I've branched off of your fork within HashiCorp's repo, your original CLA doesn't 'transfer over'.

If you have any questions/concerns or other alternatives, please reach out. Thank you!

@mikeyyuen
Copy link
Contributor Author

Taking a look now! Thank you for rebasing.

I can forrce push your changes on to this PR that should allow you to get a clean CLA (I don't think the corporate CLA will sign properly unless it comes from the Bloomberg org)

@mikeyyuen
Copy link
Contributor Author

I was able to fix a bunch of the tests, they were just missing the new config parameters, there is still something fundamentally flaky about tests that go near agent_test.go, I'm not sure exactly what it is. I can take another look tomorrow, but @lornasong if you don't mind having a look too?

Also is there something I need to do to trigger circleci from this PR?

@lornasong
Copy link
Member

@mikeyyuen thanks again for the review and push!

Yes, we definitely have some flaky tests. I typically see them in leader_test but let me know if there's a particular error you see in agent_test. I wasn't able to reproduce it but would be happy to look into it. Regardless, your commit triggered CircleCI and the tests are passing, thanks!

I'd like to get your opinion. It is likely we'll want to add some changes to your PR. For example, we are planning to have our products convene on using OpenTelemetry and would update this in your PR. We could either make changes and ask you to force-push them or another option would be to update your PR to merge to a non-master branch and we could take it from there. I'm guessing you have a lot on your plate and we'd like to avoid adding to it.

Please let us know how you'd like to move forward. Thanks!

@mikeyyuen
Copy link
Contributor Author

Is OpenTelemetry backward compatible? If so I’d suggest landing this on master to avoid any more rebasing!

If not either way would work for me let me know!

@edevil
Copy link
Contributor

edevil commented May 25, 2020

Any news on this?

@lornasong
Copy link
Member

@edevil, thanks for checking in and apologies for not being more proactive with updates.

Last week, I reviewed the changes and spoke with the team. As an update, here’s what we’re thinking of doing:

  • Merge the PR to a non-master branch
  • We’d like to split out the telemetry changes from the rest of the changes
  • For non-telemetry changes, make a few additional commits and then merge into master
  • For telemetry changes, get further input since we want to move from go-metrics to OpenTelemetry. From initial team feedback, it seems like it might not be backwards compatible since go-metrics supports more features than OpenTelemetry (e.g. statsd). More research is needed here.

@edevil, please let me know if there are specific changes you’re planning to use. That will help us prioritize. I will plan to switch to focusing back on this PR tomorrow.

@mikeyyuen, if you have any feedback, please let me know.

Thanks!

@mikeyyuen
Copy link
Contributor Author

Thanks for the update. I think for us statsd is kind of critical. Its the main way we have for getting oss metrics into a common metrics backend. I’d definitely hope anything Consul uses down the like supports it!

@edevil
Copy link
Contributor

edevil commented May 27, 2020

Thank you for the update @lornasong. We use Prometheus, which I think is already supported by OpenTelemetry, so that will not be a problem in our case.

As for the rest of the changes, since they are informed by the experience of running a large scale cluster, we find them very valuable.

@lornasong
Copy link
Member

@mikeyyuen, thanks for the feedback. That's helpful to know that statsd is critical for your team. I'll make sure this is included as we sort out how to handle telemetry.

@edevil, thanks for sharing more about your use-case. That's good to know that you're using Prometheus and interested in the large-scale cluster features.

@lornasong lornasong changed the base branch from master to bloomberg-dev May 27, 2020 16:10
@lornasong lornasong merged commit ebf5aec into hashicorp:bloomberg-dev May 27, 2020
@lornasong
Copy link
Member

For non-telemetry changes, please route discussion to #63. Additional details and context provided there.

I will provide an update for telemetry-specific changes once I’ve understood how to move forward regarding the CLA.

Thank you

@lornasong
Copy link
Member

An update here:

Thank you

@mikeyyuen mikeyyuen deleted the to_upstream branch June 21, 2020 11:18
@lornasong lornasong added this to the 0.4.0 milestone Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants