-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance & Metrics changes #56
Conversation
Hi @mikeyyuen - thanks so much for this PR. Looking forward to reviewing this! |
@mikeyyuen, thanks again for the pull request! Would you be able to update/rebase the changes? It looks like the fork is ~34 commits behind, which has led to some conflicts. We've also stopped committing our vendor directory so updating as such would help resolve some of those conflicts. Let me know if you have any questions. Thank you |
Hello. This work would be useful for us. Is anyone working on this? @mikeyyuen Is help required? |
@edevil thanks for the comment. Actually as timing would have it, I started working on rebasing this just a couple days ago and am currently working through conflicts. Don't hesitate to let me know if you have any questions or thoughts. @mikeyyuen just an FYI (see above). If you have any questions or concerns, please let me know. Thank you |
Changes that Bloomberg have been using in production to support 3,000+ nodes with on 100 ESM's. - Adds metrics for health checks and performance - Adds support for > 64 ESM Instances - Improvements to allow a checks to run ever 1s safely - Improvements to prevent spurious node status updates - Improvements to allow disabling node coordinate updates - Fix an issue where check updates can fail to propagate to ESM Followers - Extra Logging - Exposed new config options - Adds ability to skip node coordinate updates if they are small (reducing spurious updates) - Updated to go mod for vendored dependencies
@lornasong That's great news! Let me know if I can be of help. |
Hi @mikeyyuen, we really appreciate your changes! They will greatly benefit our community. Since we haven't heard from you in a couple months, we decided to carry your changes and make progress with edits ourselves. I've initially rebased and resolved the conflicts with #58. Note, it looks like unit tests aren't passing which I'm happy to resolve. Would you be able to take a look at #58 and make sure that you think it accurately reflects your work? Once it looks good, would it be possible for you to sign the CLA again? For context on re-signing the CLA: our CLA whitelists only by PRs originating from a fork in a whitelisted org. Since I've branched off of your fork within HashiCorp's repo, your original CLA doesn't 'transfer over'. If you have any questions/concerns or other alternatives, please reach out. Thank you! |
Taking a look now! Thank you for rebasing. I can forrce push your changes on to this PR that should allow you to get a clean CLA (I don't think the corporate CLA will sign properly unless it comes from the Bloomberg org) |
I was able to fix a bunch of the tests, they were just missing the new config parameters, there is still something fundamentally flaky about tests that go near agent_test.go, I'm not sure exactly what it is. I can take another look tomorrow, but @lornasong if you don't mind having a look too? Also is there something I need to do to trigger circleci from this PR? |
@mikeyyuen thanks again for the review and push! Yes, we definitely have some flaky tests. I typically see them in I'd like to get your opinion. It is likely we'll want to add some changes to your PR. For example, we are planning to have our products convene on using OpenTelemetry and would update this in your PR. We could either make changes and ask you to force-push them or another option would be to update your PR to merge to a non-master branch and we could take it from there. I'm guessing you have a lot on your plate and we'd like to avoid adding to it. Please let us know how you'd like to move forward. Thanks! |
Is OpenTelemetry backward compatible? If so I’d suggest landing this on master to avoid any more rebasing! If not either way would work for me let me know! |
Any news on this? |
@edevil, thanks for checking in and apologies for not being more proactive with updates. Last week, I reviewed the changes and spoke with the team. As an update, here’s what we’re thinking of doing:
@edevil, please let me know if there are specific changes you’re planning to use. That will help us prioritize. I will plan to switch to focusing back on this PR tomorrow. @mikeyyuen, if you have any feedback, please let me know. Thanks! |
Thanks for the update. I think for us statsd is kind of critical. Its the main way we have for getting oss metrics into a common metrics backend. I’d definitely hope anything Consul uses down the like supports it! |
Thank you for the update @lornasong. We use Prometheus, which I think is already supported by OpenTelemetry, so that will not be a problem in our case. As for the rest of the changes, since they are informed by the experience of running a large scale cluster, we find them very valuable. |
@mikeyyuen, thanks for the feedback. That's helpful to know that statsd is critical for your team. I'll make sure this is included as we sort out how to handle telemetry. @edevil, thanks for sharing more about your use-case. That's good to know that you're using Prometheus and interested in the large-scale cluster features. |
For non-telemetry changes, please route discussion to #63. Additional details and context provided there. I will provide an update for telemetry-specific changes once I’ve understood how to move forward regarding the CLA. Thank you |
An update here:
Thank you |
Changes that Bloomberg have been using in production to support 3,000+
nodes with on 100 ESM's.
Please let us know what you think.