Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2110 Review load balancing and independent fetching the list of services in Kube provider #2111

Conversation

antikorol
Copy link
Contributor

@antikorol antikorol commented Jun 26, 2024

Fixes #2110

Proposed Changes

Move the creation of the services list from the class field to the method, to prevent list modification from different threads, which caused the partially modified list of services to be returned.

…thod, to prevent modification list from different threads
@raman-m raman-m added bug Identified as a potential bug Service Discovery Ocelot feature: Service Discovery Load Balancer Ocelot feature: Load Balancer Consul Service discovery by Consul Oct'24 October 2024 release labels Jun 27, 2024
@raman-m raman-m added this to the Summer'24 milestone Jun 27, 2024
Copy link
Member

@raman-m raman-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the significant concurrency issue in the Kube class mentioned in the #2110 (comment) has been resolved, yet a genuine fixing solution for the RoundRobin class is still required.

Please note that our development process necessitates both unit and acceptance testing. It is advisable to replicate the issue with an acceptance test before implementing the bug-fixing solution.
We need to replicate the bug using an acceptance test. I can draft a preliminary version of the test, and you can then refine it to create the final version.

src/Ocelot.Provider.Kubernetes/Kube.cs Show resolved Hide resolved
src/Ocelot.Provider.Kubernetes/Kube.cs Outdated Show resolved Hide resolved
src/Ocelot.Provider.Kubernetes/Kube.cs Show resolved Hide resolved
@raman-m raman-m changed the title Fix for race condition on fetching the list of services in Kubernetes Service Provider #2110 Fix for race condition on fetching the list of services in Kubernetes service discovery provider Jun 30, 2024
@antikorol
Copy link
Contributor Author

antikorol commented Jun 30, 2024

It appears that the significant concurrency issue in the Kube class mentioned in the #2110 (comment) has been resolved, yet a genuine fixing solution for the RoundRobin class is still required.

Please note that our development process necessitates both unit and acceptance testing. It is advisable to replicate the issue with an acceptance test before implementing the bug-fixing solution. We need to replicate the bug using an acceptance test. I can draft a preliminary version of the test, and you can then refine it to create the final version.

@raman-m I'm not quite sure which fix you are referring to. In this case, there will no longer be a List of the form [service, null] because the issue with modifying the List of services from different threads has been fixed. Or do you mean adding a check that if the service or the HostAndPort property is null, then log the error and return a 404 status code for that request?

@raman-m
Copy link
Member

raman-m commented Jul 9, 2024

Or do you mean adding a check that if the service or the HostAndPort property is null, then log the error and return a 404 status code for that request?

Yes, I do. Could you include this check within the RoundRobin class plz?
Thank you for adding the unit test. I'm assessing the efficiency of writing an acceptance test, which is expected to be quite complicated. Have you tested the bug fix in your environment?

@antikorol antikorol force-pushed the bugfix/race-condition-on-fetching-services-in-service-discovery-provider branch from 75a3090 to c360228 Compare July 9, 2024 19:45
@antikorol
Copy link
Contributor Author

Or do you mean adding a check that if the service or the HostAndPort property is null, then log the error and return a 404 status code for that request?

Yes, I do. Could you include this check within the RoundRobin class plz? Thank you for adding the unit test. I'm assessing the efficiency of writing an acceptance test, which is expected to be quite complicated. Have you tested the bug fix in your environment?

Added new error to handle this case when service or HostAndPort is null.
The fix has been tested in dev/preprod environments, and the fix was checked with load testing at around 30 rps.

@raman-m
Copy link
Member

raman-m commented Jul 10, 2024

Thank you for conducting the tests!
Is the official hotfix release package, version 23.3.4, needed to update your all environments and conduct the final bugfix test?
Can you update environments with manually built DLLs?

Additionally, I'm going to review the code once more...
Having unit/integration tests is likely sufficient to deem the code acceptable.

Copy link
Member

@raman-m raman-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary concern is the redundancy of the ServiceIsInvalidError class.

@raman-m raman-m requested review from RaynaldM and ggnaegi July 10, 2024 11:15
@raman-m raman-m added hotfix Gitflow: Hotfix issue, PR related to hotfix branch high High priority labels Jul 10, 2024
@antikorol
Copy link
Contributor Author

Thank you for conducting the tests! Is the official hotfix release package, version 23.3.4, needed to update your all environments and conduct the final bugfix test? Can you update environments with manually built DLLs?

At your discretion, I can release it with my fix for now and remove it in the future.

@raman-m raman-m removed the hotfix Gitflow: Hotfix issue, PR related to hotfix branch label Jul 10, 2024
@raman-m
Copy link
Member

raman-m commented Jul 10, 2024

Great, Roman!
I'd like to merge it into the develop branch, after which you can create a temporary build from your forked repository and deploy it to the Prod environment, okay?

Could you provide more details about your Production environment?
Please include information on the hosting server (Docker, IIS, self-hosted), the number of Ocelot instances, CPU & memory resources per Ocelot instance, requests per second, and the technical stack (specific Ocelot features utilized), etc.

Is the Ocelot gateway deployed on a public or private network?
Also, could you specify the industry that the Prod product is part of?

@antikorol antikorol force-pushed the bugfix/race-condition-on-fetching-services-in-service-discovery-provider branch from c360228 to bbed7d1 Compare July 10, 2024 12:30
@antikorol
Copy link
Contributor Author

Is the Ocelot gateway deployed on a public or private network?

Unfortunately, the game is still in closed beta testing, and the loads on prod are much times lower than what we simulate during tests. Ocelot is deployed as a regular deployment in k8s.
But If I find any issues, I will let you know :)

@raman-m raman-m removed this from the Summer'24 milestone Jul 10, 2024
@raman-m raman-m changed the title #2110 Fix for race condition on fetching the list of services in Kubernetes service discovery provider #2110 Review load balancing and independent fetching the list of services in Kube provider Aug 5, 2024
@raman-m
Copy link
Member

raman-m commented Aug 5, 2024

I've opted for the third option and implemented the Retry pattern. The previously unstable scenario test is now performing much better. I would consider it quite stable. So, I've resolved the issue with the 404 status, as mentioned above. However, sometimes Ocelot returns a 500 status, and the underlying cause is currently unclear to me.
image
Investigating this issue necessitates further development and execution of tests. However, this error is quite rare, occurring at a rate of 1/100, 1/500, or even 1/1000, which means it demands extensive testing in production environments, as production is certainly not the same as testing.
In conclusion, I believe we can proceed with the release at this hotfix quality. 😎

@raman-m
Copy link
Member

raman-m commented Aug 5, 2024

Development Complete❕

Most important changes are


@ggnaegi commented on Aug 4, 2024

Please review!

src/Ocelot.Provider.Kubernetes/Kube.cs Show resolved Hide resolved
src/Ocelot.Provider.Kubernetes/Kube.cs Outdated Show resolved Hide resolved
src/Ocelot.Provider.Kubernetes/Kube.cs Outdated Show resolved Hide resolved
src/Ocelot.Provider.Kubernetes/KubeServiceCreator.cs Outdated Show resolved Hide resolved
src/Ocelot/Infrastructure/DesignPatterns/Retry.cs Outdated Show resolved Hide resolved
src/Ocelot/LoadBalancer/LoadBalancers/LeastConnection.cs Outdated Show resolved Hide resolved
src/Ocelot/Values/ServiceHostAndPort.cs Outdated Show resolved Hide resolved
src/Ocelot/Values/ServiceHostAndPort.cs Outdated Show resolved Hide resolved

public bool Equals(ServiceHostAndPort other) => this == other;
public override bool Equals(object obj)
=> obj != null && obj is ServiceHostAndPort o && this == o;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public override bool Equals(object obj)
        => obj is ServiceHostAndPort o && this == o;

merge into pattern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one? Sorry, I can't get you!

I have attempted to adhere to all documentation and best practices from XML documentation links for the equality operator. Thus, I've combined various practices to write more concise code. The expression this == o signifies that I am utilizing the native class operator ==.
image
Certainly, all properties comparison predicate can be transferred from operator == to Equals(ServiceHostAndPort). Ultimately, it seems challenging to distill further patterns from this specific version of the code. 😄 Not a fan of the MS docs examples 🤣 so I've extracted patterns from MS docs definitely 🙊

Copy link
Member

@raman-m raman-m Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggnaegi wrote

merge into pattern?

Swiss humor? Yeah! 🥲 I know you don't like my Retry pattern because you like Polly. 😏
Let's ask another big fan of Polly... not @RaynaldM 🤣 ... and this is Mr. Seth @ks1990cn Hello Tanmay! 😉

Copy link
Member

@raman-m raman-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggnaegi I've addressed all issues from your code review. Awaiting your approval or next suggestions...
@RaynaldM Welcome!

@raman-m
Copy link
Member

raman-m commented Aug 7, 2024

@ggnaegi What else?
Did I miss something from your code review?

@raman-m raman-m requested review from raman-m and ggnaegi August 7, 2024 09:35
Copy link
Member

@ggnaegi ggnaegi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fine for me, thanks!

@raman-m raman-m merged commit 19a8e2f into ThreeMammals:develop Aug 7, 2024
1 check passed
@raman-m
Copy link
Member

raman-m commented Aug 7, 2024

TODO

@raman-m raman-m removed the Oct'24 October 2024 release label Aug 12, 2024
@antikorol
Copy link
Contributor Author

antikorol commented Aug 19, 2024

@raman-m commented on Aug 5, 2024

Please share your thoughts
P.S. I believe the 1st option is entirely impractical, and I am deliberating between the 2nd and 3rd options, convinced that the Retry pattern must be adopted.

Hello @raman-m Raman.
Sorry. I was on vacation and didn't notice, among the many notifications, that there was a question directed at me. However, I see the solution has already been implemented :)

@raman-m
Copy link
Member

raman-m commented Aug 20, 2024

@antikorol Right! The hotfix has been implemented! However I've asked you to start testing.
Will you wait for official 23.3.4 package release?

@antikorol
Copy link
Contributor Author

@antikorol Right! The hotfix has been implemented! However I've asked you to start testing. Will you wait for official 23.3.4 package release?

@raman-m
According to company policy, we are not allowed to publish packages to Artifactory that are not related to our project, and we can only use specific sources for NuGet packages. As soon as you release a new version, I will update and thoroughly test the fixes.

However, I copied your changes related to fix into my custom Kubernetes provider, which we are currently using as a fix, and so far everything is working well.

Thank you

raman-m added a commit that referenced this pull request Oct 3, 2024
…Blue Olympic Balumbes release

* #2084 Apply default config file paths in `GetMergedOcelotJson` when providing the `folder` argument of `AddOcelot` (#2120)

* Adding unit test first

* Fixing default global config file not being found in folder

* Adding PR trait to test

* Backing out whitespace changes

* Code review by @raman-m

* Create Configuration feature folder and move test classes

* Adjust namespace and review what we have

* Acceptance tests for #2084 user scenario

---------

Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>

* Bump Steeltoe.Discovery.Eureka from 3.2.5 to 3.2.8 in /src/Ocelot.Provider.Eureka (#2122)

* Bump Steeltoe.Discovery.Eureka in /src/Ocelot.Provider.Eureka

Bumps [Steeltoe.Discovery.Eureka](https://github.com/SteeltoeOSS/Steeltoe) from 3.2.5 to 3.2.8.
- [Release notes](https://github.com/SteeltoeOSS/Steeltoe/releases)
- [Changelog](https://github.com/SteeltoeOSS/Steeltoe/blob/main/Steeltoe.Release.ruleset)
- [Commits](SteeltoeOSS/Steeltoe@3.2.5...3.2.8)

---
updated-dependencies:
- dependency-name: Steeltoe.Discovery.Eureka
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Bump Steeltoe.Discovery.ClientCore from 3.2.5 to 3.2.8

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>

* #2110 Review load balancing and independent fetching the list of services in `Kube` provider (#2111)

* Move the creation of the services list from the class field to the method, to prevent modification list from different threads

* Early return after data checking

* Add unit test for concurrent get list of services

* Add logging for invalid service configuration error in RoundRobin load balancer

* Code review by @raman-m

* Workaround for mistakes made during acceptance testing of load balancing versus service discovery, where tests designed for parallel requests were mistakenly executed sequentially. This resulted in load balancers being loaded by sequential `HttpClient` calls, which was a significant oversight.

* Let's DRY StickySessionsTests

* Add acceptance tests, but...
RoundRobin is not actually RoundRobin 😁 -> 😆

* Independent static indexing iterators per route via service names

* Stabilize `CookieStickySessions` load balancer.
Review tests after refactoring of `RoundRobin` load balancer

* Refactor Lease operation for load balancing.
Review LeastConnection load balancer

* Leasing mechanism in Round Robin load balancer

* Acceptance tests, final version

* Apply Retry pattern for K8s endpoint integration

* Fix IDE warnings and messages

* Follow suggestions and fix issues from code review by @ggnaegi

* Bump KubeClient from 2.4.10 to 2.5.8

* Fix warnings

* Final version of `Retry` pattern

---------

Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>

* Downgrade the Warning to Information on missing `Content-Length` header in `MultiplexingMiddleware` (#2146)

* fix: downgrade the warning to information on missing content-length header

* chore: add route name to logs

* test: fixing multiplexing middleware tests

* Code review by @raman-m

---------

Co-authored-by: Paul Roy <paul.roy@astriis.com>
Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>

* Correct the broken link to the GraphQL sample's `README.md` (#2149)

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>

* #2116 Escaping unsafe pattern values of `Regex` constructor ​​derived from URL query parameter values containing special `Regex` chars (#2150)

* regex escape handling for url templates

* refactored regex method to lamda version

* Quick code review by @raman-m

* added acceptance test for url regex bug

* moved acceptance test to routing tests

* Convert to theory: define 2 test cases

---------

Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>

* #2119 Review load balancing (2nd round) and redesign `DefaultConsulServiceBuilder` with `ConsulProviderFactory` refactoring to make it thread safe and friendly (#2151)

* Review tests

* History of Service Discovery testing: add traits

* LoadBalancer traits

* #2119 Steps to Reproduce

* Reuse service handlers of `ConcurrentSteps`

* Reuse service counters of `ConcurrentSteps`

* Add LoadBalancer namespace and move classes

* Move `Lease`

* Move `LeaseEventArgs`

* Analyze load balancers aka `ILoadBalancerAnalyzer` interface objects

* Prefer using named local methods as delegates over anonymous methods for awesome call stack, ensuring the delegate's typed result matches the typed balancer's creator. Additionally, employ an IServiceProvider workaround.

* Review load balancing. Assert service & leasing counters as concurrent step. Final version of acceptance test.

* Fixed naming violation for asynchronous methods: `Lease` -> `LeaseAsync`

* Fix ugly reflection issue of dymanic detection in favor of static type property

* Propagate the `ConsulRegistryConfiguration` object through `HttpContext` in the scoped version of the default service builder, utilizing the injected `IHttpContextAccessor` object.
Update `ConsulProviderFactory`.
Update docs.
Update tests.

* Add tests from clean experiment

* Final review of the tests

* Review `IHttpContextAccessor` logic.
Convert anonymous delegates to named ones in placeholders processing

* Tried to enhance more, but failed

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Ben Bartholomew <70723971+ben-bartholomew@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Roman <61905975+antikorol@users.noreply.github.com>
Co-authored-by: Paul Roy <paul.achess.roy@gmail.com>
Co-authored-by: Paul Roy <paul.roy@astriis.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Finn <26823828+int0x81@users.noreply.github.com>
@raman-m
Copy link
Member

raman-m commented Oct 8, 2024

@antikorol FYI → Ocelot 23.3.4 Nuget package

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Identified as a potential bug high High priority Kubernetes Service discovery by Kubernetes Load Balancer Ocelot feature: Load Balancer Service Discovery Ocelot feature: Service Discovery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500
3 participants