-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500 #2110
Comments
Hello Roman! It's difficult to determine what happened. Could you please upload the entire solution to GitHub for review? Alternatively, you could attach the following artifacts to this thread:
This information will likely assist in identifying the issue.
You didn't observe the same errors in the Ocelot logs for version 23.2.x, did you? Since you're utilizing Kubernetes, could you provide more details about your K8s server deployment? |
Regarding the exception details...
Are you a C# developer? Can you debug C# code? Do you use the official Ocelot NuGet package? Deploying the Release version of the Ocelot package DLLs means there is no debug information for the called lines in the Call Stack trace. However, we can analyze the code of the Ocelot/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs Lines 19 to 38 in 8c0180a
It appears the issue may lie here: Ocelot/src/Ocelot/LoadBalancer/LoadBalancers/RoundRobin.cs Lines 32 to 33 in 8c0180a
There's a missing null check for line 33. Theoretically, the service provider could return a null object, which should be accounted for.
To conclude, it seems that for some reason, the |
@antikorol The subsequent step in the investigation involves searching for subsequent messages in Ocelot's log. Ocelot/src/Ocelot.Provider.Kubernetes/KubeServiceBuilder.cs Lines 22 to 33 in 8c0180a
We need to search for the following message in the log:
In {
"Logging": {
"LogLevel": {
"Default": "Debug",
"Microsoft.AspNetCore": "Warning"
}
}
} Deploy and review the log for the message:
If this message appears in the log, it indicates that the Kubernetes setup is correct, and we can proceed to investigate the issue more thoroughly. |
It seems that the requests to the Kubernetes endpoint were failing around the 20th to 30th attempts, as indicated by the code in these lines: Ocelot/src/Ocelot.Provider.Kubernetes/Kube.cs Lines 34 to 36 in 8c0180a
This suggests that the Kubernetes client connectivity might be unstable. It would be advisable to switch from the Kube to the PollKube provider and observe the behavior again. The error should no longer occur if this change is made. Additionally, the PollingInterval should be set to less than half the average duration between the occurrences of the 500 error.
Let's figure out what's happening and possibly provide a hotfix for your user scenario to eliminate the Null Reference Error logging. We need to handle this 500 status case and process it accordingly. |
Below is the log with debug information.
|
Dear Roman, |
Please be aware that any custom logic may compromise stability, and the Ocelot team cannot be held accountable for it. It appears your existing custom logic is only compatible with version 23.2. It would be helpful to examine the service definition in your Kubernetes configuration. Could you provide it, please? |
No, I removed that logic because your changes already cover the case I needed with multiple ports. |
Good! After a quick review:
I wonder that your Ocelot app started at all! Did you read Configuration docs? So, I don't see configuration part in these lines L27-L42 ❗ builder.WebHost
.ConfigureAppConfiguration((context, config) =>
{
var env = context.HostingEnvironment;
config.SetBasePath(env.ContentRootPath)
.AddJsonFile("appsettings.json", true, true)
.AddJsonFile($"appsettings.{env.EnvironmentName}.json", true, true)
.AddOcelot(env) // happy path
.AddEnvironmentVariables();
}) So, it's very important to configure via .ConfigureAppConfiguration((context, config) => config.AddOcelot()) This is minimal required configuring code! But better to use official template from docs. |
Good! You wanna use the feature: Downstream Scheme vs Port Names Also you enabled the port finder via "DownstreamScheme": "http". |
There was update from 18.0.0 to 23.2.2 a long time ago and now to 23.3.3 According to the Microsoft documentation, it is not mandatory to explicitly specify appsettings.json or its variations depending on the environment.
yes, what I should provide? |
I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same. I don't understand what you're referring to.
I apologize, but uncommenting line L53 does not result in any replacement. You need to implement actual service injection using the official helpers: Custom Load Balancers. Have you consulted this documentation? Even if you substitute the services with custom ones, it won't aid in identifying the failing line, as this necessitates the inclusion of Debug version DLLs in your Docker file, as seen here: L21. Therefore, you must build with the debug configuration |
Please provide the Kubernetes service definition in text, JSON format, or the Kubernetes client response JSON. From the log you've shared, it appears you have defined only one service; however, the number of ports in the definition is unclear. Additionally, I'm curious as to why every twentieth Kubernetes client request fails or returns an invalid response, causing the Ocelot provider to throw Lastly, I trust that you have the expertise to debug the provider, identify the issue, and potentially write tests or even submit a pull request with a fix. |
Sorry for the confusion. The files are identical. I didn't want to change the project build type. In the release build, I can see lines in the stack trace in my code. I wanted to understand in which exact line the error occurred so that I could add logs later. Surprisingly, the error didn't reproduce when I used just a copy |
I am awaiting a PR that will resolve the issue... |
@raman-m
This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it, RoundRobinCreator is also registered as a singleton. At the same time, we use the Kube.GetAsync method call in async code, where we can easily encounter a "Collection was modified" exception. Maybe somewhere deep in the List.Clear or elsewhere, due to resize or clear array, we might lost references and encounter an "Object reference not set to an instance..." exception.What do you think about this part of code? |
Look at the actual registration method Ocelot/src/Ocelot.Provider.Kubernetes/OcelotBuilderExtensions.cs Lines 9 to 17 in 8c0180a
So, the Kube class is not registered at all. Only KubernetesProviderFactory.Get delegate is registered:
Kube is instantiated on every request. The logic is quite complicated. And I think Kube is scoped service.Also, IKubeApiClient is scoped registered service. The problem could be bad/empty response from IKubeApiClient
The issue seems to be with the load balancing logic, which involves a singleton What's happening in my opinion. I believe the problem is on the line 33
It must be rewritten! |
You are right that problem at this row. But the main problem in concurrent issue that I described before in the class Kube because the collection of services is declared at the class field level.
I spent a lot of time for confirming the issue, I hope it will be enough. The second element of list is null I can see it in class RoundRoubin Example of logs:
If the logs aren't entirely clear, I created a code sample that shows if the list is modified from different threads, we can get null where it shouldn't be. In the sample, I always add one element to the list, but in different threads, and I end up with the list size being more than 1 and the second element is null. And we don't get an exeption that the collection was modified. |
@antikorol Roman, If you open a fix PR, I will prepare an additional hotfix release. However, I'm concerned that we need to refactor the FYI, I've included this bug issue in the Summer'24 milestone. This represents the least favorable delivery scenario. I am optimistic that we will manage to deliver the fix within a week or two, or alternatively, we can assign the issue to the Annual'23 release which is early bird aka current release. |
Tasks
|
A quick fix is to move the creation of the List from the class fields to a method. |
Thank you, Roman! |
Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful: var endpoint = await _kubeApi
.ResourceClient(client => new EndPointClientV1(client))
.GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace); |
|
Thank you for the JSON samples! I will utilize them to write an acceptance test. |
Roman, I have the next question regarding load balancing in your project... If the project is small and the number of downstream services is constant (with all service IPs known), you can use static service definitions in the So, it's just a question of... |
In my opinion, Kubernetes is a good choice for a startup/initial stage of the project as a simple and quick way to implement service discovery (it can be a five-minute job). In the future, after analyzing the load, you can implement another solution. Using a static IP for a pod, in my view, is an anti-pattern. At the same time, to reduce the load on Kubernetes, I can simply use the PollKube. However, unfortunately, this bug affects this provider too, but this fix resolves it. :)
I hope it will be huge in the future :D, but even now there is scaling during peak load hours. |
@antikorol Hi Roman! Be aware that there are stable and unstable theory cases; however, both exhibit the 500 status issue. To reveal the 500 error in the green test, you must use the "Run Until Failed" option. Additionally, both cases, when in Debug mode, might display the exception under discussion, because debug mode show exceptions being thrown. |
@raman-m Hello Raman
This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop. I see that your CI failed on this test case. Can I simply cherry-pick the commit and check it on your CI? |
We use Windows and Linux only for now. Don't use MacOS please. |
@antikorol, you've highlighted an important aspect concerning MacOS builds. We will soon establish automated Windows CI-CD builds as certain features of Ocelot necessitate it. Linux builds satisfy 99.9% of our requirements, and the .NET SDK versus Linux is highly stable since Microsoft conducts extensive testing of the .NET SDK on Linux OS. I am in the process of ascertaining whether the issue lies with the .NET SDK on MacOS or if it is an issue unique to Ocelot? |
@raman-m Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias. If I don't use loopback here, the tests pass (the method is called with different ports). |
Understood! As a MacOS user, you can suggest a directive preprocessor correction to ensure this helper runs successfully in MacOS environments. MacOS machines should indeed have
You can manually configure aliases such as |
Hello! I have a few questions regarding the response JSON from Kubernetes when a service instance goes offline:
Links to related Kubernetes documentation would be appreciated. |
Hello.
|
If the service has not been deleted but has gone offline, Kubernetes clients should still return one or more service instances as addresses in subsets. If there were two instances and one went offline without being deleted, the question is: What happens to the list of services in the JSON response? Will it still be present or marked with a special tag, annotation, etc.? From the details of your issue and the logs provided, it appears there was one instance online. For some reason, this instance went offline, resulting in an empty JSON addresses/subsets list, causing Ocelot's |
…ices in `Kube` provider (#2111) * Move the creation of the services list from the class field to the method, to prevent modification list from different threads * Early return after data checking * Add unit test for concurrent get list of services * Add logging for invalid service configuration error in RoundRobin load balancer * Code review by @raman-m * Workaround for mistakes made during acceptance testing of load balancing versus service discovery, where tests designed for parallel requests were mistakenly executed sequentially. This resulted in load balancers being loaded by sequential `HttpClient` calls, which was a significant oversight. * Let's DRY StickySessionsTests * Add acceptance tests, but... RoundRobin is not actually RoundRobin 😁 -> 😆 * Independent static indexing iterators per route via service names * Stabilize `CookieStickySessions` load balancer. Review tests after refactoring of `RoundRobin` load balancer * Refactor Lease operation for load balancing. Review LeastConnection load balancer * Leasing mechanism in Round Robin load balancer * Acceptance tests, final version * Apply Retry pattern for K8s endpoint integration * Fix IDE warnings and messages * Follow suggestions and fix issues from code review by @ggnaegi * Bump KubeClient from 2.4.10 to 2.5.8 * Fix warnings * Final version of `Retry` pattern --------- Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com>
Hello Roman,
While running acceptance tests in a local testing environment, I observed an increased likelihood of encountering 404/500 status errors as more services went offline. This issue may stem from the heavy load on the Ocelot/test/Ocelot.AcceptanceTests/ServiceDiscovery/KubernetesServiceDiscoveryTests.cs Line 149 in 19a8e2f
The root cause remains unidentified, necessitating further testing in production environments, as my tests were only conducted locally. In conclusion, the Kube provider seems unstable under heavy loads with an error ratio of approximately 1/500, suggesting that the issue may still lie within the IKubeApiClient integration.
P.S.Consider switching to the PollKube provider, which reduces the load on the |
…Blue Olympic Balumbes release * #2084 Apply default config file paths in `GetMergedOcelotJson` when providing the `folder` argument of `AddOcelot` (#2120) * Adding unit test first * Fixing default global config file not being found in folder * Adding PR trait to test * Backing out whitespace changes * Code review by @raman-m * Create Configuration feature folder and move test classes * Adjust namespace and review what we have * Acceptance tests for #2084 user scenario --------- Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com> * Bump Steeltoe.Discovery.Eureka from 3.2.5 to 3.2.8 in /src/Ocelot.Provider.Eureka (#2122) * Bump Steeltoe.Discovery.Eureka in /src/Ocelot.Provider.Eureka Bumps [Steeltoe.Discovery.Eureka](https://github.com/SteeltoeOSS/Steeltoe) from 3.2.5 to 3.2.8. - [Release notes](https://github.com/SteeltoeOSS/Steeltoe/releases) - [Changelog](https://github.com/SteeltoeOSS/Steeltoe/blob/main/Steeltoe.Release.ruleset) - [Commits](SteeltoeOSS/Steeltoe@3.2.5...3.2.8) --- updated-dependencies: - dependency-name: Steeltoe.Discovery.Eureka dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Bump Steeltoe.Discovery.ClientCore from 3.2.5 to 3.2.8 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com> * #2110 Review load balancing and independent fetching the list of services in `Kube` provider (#2111) * Move the creation of the services list from the class field to the method, to prevent modification list from different threads * Early return after data checking * Add unit test for concurrent get list of services * Add logging for invalid service configuration error in RoundRobin load balancer * Code review by @raman-m * Workaround for mistakes made during acceptance testing of load balancing versus service discovery, where tests designed for parallel requests were mistakenly executed sequentially. This resulted in load balancers being loaded by sequential `HttpClient` calls, which was a significant oversight. * Let's DRY StickySessionsTests * Add acceptance tests, but... RoundRobin is not actually RoundRobin 😁 -> 😆 * Independent static indexing iterators per route via service names * Stabilize `CookieStickySessions` load balancer. Review tests after refactoring of `RoundRobin` load balancer * Refactor Lease operation for load balancing. Review LeastConnection load balancer * Leasing mechanism in Round Robin load balancer * Acceptance tests, final version * Apply Retry pattern for K8s endpoint integration * Fix IDE warnings and messages * Follow suggestions and fix issues from code review by @ggnaegi * Bump KubeClient from 2.4.10 to 2.5.8 * Fix warnings * Final version of `Retry` pattern --------- Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com> * Downgrade the Warning to Information on missing `Content-Length` header in `MultiplexingMiddleware` (#2146) * fix: downgrade the warning to information on missing content-length header * chore: add route name to logs * test: fixing multiplexing middleware tests * Code review by @raman-m --------- Co-authored-by: Paul Roy <paul.roy@astriis.com> Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com> * Correct the broken link to the GraphQL sample's `README.md` (#2149) Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com> * #2116 Escaping unsafe pattern values of `Regex` constructor derived from URL query parameter values containing special `Regex` chars (#2150) * regex escape handling for url templates * refactored regex method to lamda version * Quick code review by @raman-m * added acceptance test for url regex bug * moved acceptance test to routing tests * Convert to theory: define 2 test cases --------- Co-authored-by: Raman Maksimchuk <dotnet044@gmail.com> * #2119 Review load balancing (2nd round) and redesign `DefaultConsulServiceBuilder` with `ConsulProviderFactory` refactoring to make it thread safe and friendly (#2151) * Review tests * History of Service Discovery testing: add traits * LoadBalancer traits * #2119 Steps to Reproduce * Reuse service handlers of `ConcurrentSteps` * Reuse service counters of `ConcurrentSteps` * Add LoadBalancer namespace and move classes * Move `Lease` * Move `LeaseEventArgs` * Analyze load balancers aka `ILoadBalancerAnalyzer` interface objects * Prefer using named local methods as delegates over anonymous methods for awesome call stack, ensuring the delegate's typed result matches the typed balancer's creator. Additionally, employ an IServiceProvider workaround. * Review load balancing. Assert service & leasing counters as concurrent step. Final version of acceptance test. * Fixed naming violation for asynchronous methods: `Lease` -> `LeaseAsync` * Fix ugly reflection issue of dymanic detection in favor of static type property * Propagate the `ConsulRegistryConfiguration` object through `HttpContext` in the scoped version of the default service builder, utilizing the injected `IHttpContextAccessor` object. Update `ConsulProviderFactory`. Update docs. Update tests. * Add tests from clean experiment * Final review of the tests * Review `IHttpContextAccessor` logic. Convert anonymous delegates to named ones in placeholders processing * Tried to enhance more, but failed --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Ben Bartholomew <70723971+ben-bartholomew@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Roman <61905975+antikorol@users.noreply.github.com> Co-authored-by: Paul Roy <paul.achess.roy@gmail.com> Co-authored-by: Paul Roy <paul.roy@astriis.com> Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com> Co-authored-by: Finn <26823828+int0x81@users.noreply.github.com>
Fixed in version 23.3.4 ✅ |
Expected Behavior / New Feature
Resolving a service address to route an API request
Actual Behavior / Motivation for New Feature
Sometimes, about every 20th - 30th request finished with status code 500
Steps to Reproduce the Problem
Specifications
The text was updated successfully, but these errors were encountered: