-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterClientReception Ask missing response to Client after a few thousand #3417
Comments
I have additional details on this. This is reproducible only if
When we make the actual worker node a seed node onto itself, we are unable to reproduce this even though we continue to use custom HyperionSerializer. I.E, we are not missing any message responses back to cluster client when all our nodes are seed nodes by themselves. If it helps to know, when seed nodes are separated out, due to custom serializer the message is actually bouncing around between seed nodes and worker nodes twice. Once when the message arrives, it first goes to seed node, deserializes and then is sent to the actual worker node which is again deserialized. And then when the result is sent back through ClusterClientReceptionist, the message is sent first to the seed node, deserializes and then seed node inturn sends it to the clusterclient. My thought is, this is probably because the initial contacts we defined are for seed nodes. Hence I am guessing the message has to always pass through the seed node. But i donno if this leads to dropped responses (back to client). |
More information:- |
@leo12chandu have you experienced any network disconnections in during that test? |
Yes and No. Oddly, I see all 3 combinations. Even with same geography. Because I am running with a volume of 4000-10000, I do see some disconnects but they get reconnected and the responses flow back through to the client just fine (all of them). However, in some cases (when nodes and client in different geography like US and India), I see no disconnects but the response messages from server to the cluster client are missing/dropped. I know the server is receiving and running all the requests because I have logged them on the server side. In rare cases (again when nodes and client in different geography like US and India), I do see the following error on the server side with 9 deadletters. But this error does not always show up. Sometimes, there is no error but the cluster client does not receive the messages. Is there an option to flush the messages after Client.Tell so the client definitely receives? It kinda feels like ClusterClientReceptionist is the problem child but I could be wrong. |
Hi, I have put together a test application to reproduce this issue if it makes it easier. A zipped project can be found at the below location. Instructions to reproduce:-
You can try this above exercise with 1 instance of UI app too but just make sure to choose 10000 batches in "No of Batches" textbox. |
I've been able to reproduce the result that not all of these messages get processed successfully via the I haven't dug into the code yet, but I have a couple of theories on what is going on here... |
Ah. Thanks for looking. You don't think there is such a thing as flushing the socket in the distributedPubSubMediator or something, do you? Assuming ClusterClient uses DistributedPubSub |
I found the issue - problem is that the default buffer size allowed by the akka.net/src/contrib/cluster/Akka.Cluster.Tools/Client/reference.conf Lines 77 to 83 in 6f32f6a
I was able to bump this value up to 10000 via the following C# code: var settings = ClusterClientSettings.Create(actorSystem).WithBufferSize(10000).WithInitialContacts(initialContacts);
clusterClientActor = actorSystem.ActorOf(Akka.Cluster.Tools.Client.ClusterClient.Props(settings), "client"); And that fixed the issue. What's happening here is you're filling up the outbound buffer before the akka.net/src/contrib/cluster/Akka.Cluster.Tools/Client/ClusterClient.cs Lines 467 to 484 in 6f32f6a
So a larger buffer size might be all you need, but a better way of doing this would be to not send anything until your |
Should note that reversing the order of the |
Here's some code I came up with to help delay the sending of any initial messages until the // spin until we've made at least one contact point
while (true)
{
var points = await ClusterClientActor
.Ask<ContactPoints>(GetContactPoints.Instance, TimeSpan.FromSeconds(1)).ConfigureAwait(false);
if (points.ContactPointsList.Count > 0)
break;
}
var taskAsk = this.ClusterClientActor.Ask<BatchResult>(
new Akka.Cluster.Tools.Client.ClusterClient.Send("/user/coordinator",
new BatchRequest() { BatchToExecute = obj, EnablePersistence = enablePersistence, ExceptionAtOperation = exceptionAtOperation }),
cancelToken); I was able to successfully send thousands of messages using this without changing the buffer size. However, I still noticed some dropped messages when I tried sending 20,000 messages down the pipe here, so this might merit me digging into it a little deeper... |
For future reference, both of these logging statements should probably be warnings so it's easier for developers to understand why their messages aren't being delivered: Going to submit a second PR to address those. |
Thanks a ton for looking into this and coming up with solutions!!! The solution with increasing the buffer size alone works wonderfully (although a buffer size of 20,000 fails but not a biggie). However, the delay until cluster client finds atleast one receptionist is not working for me. About 40-50% of the messages dropped when I tried 10,000. And the code I tried is below. Not working:-
Working:- I am going to end up just increasing the buffer size for now. Thank You again!!!!! This eliminates a big blocker for us. |
@leo12chandu the 10000 buffer size limit is a hard-coded constraint built into the I'll keep digging into this issue and see why the code for waiting until a receptionist is available didn't work... |
@Aaronontheweb - I think I found another issue with the ClusterClient dropping messages when the actors are performing some blocking operations for a second like database call. To reproduce this, you can use the same test application I put together above except, in the BatchActor.cs, add Thread.Sleep(1000) to mimic the database call. Now follow the same instructions as above. You will see out of 1000 messages sent, we get only about 150 responses, rest of them are dropped. BatchActor.cs
|
Please dont use thread.sleep to mimic "work" its not representative. At all. In this case any task scheduled to run on that Thread will be blocked. |
@Danthar in the case @leo12chandu is describing, it's probably ok. Sounds like the |
I think I may have found another situation where ClusterClient is dropping messages. When the message size is large both on the receiving and sending side, it drops after processing certain messages. I've removed the Thread.Sleep(). Instead the changes I made with the tool now are
This works fine when I use ClusterDirect which essentially sends messages directly to actors without ClusterClient. So something is definitely up with ClusterClient. |
@Aaronontheweb - Any luck reproducing this? |
@leo12chandu so I made the changes you suggested and ran everything using the latest Akka.NET 1.3.9 nightlies. One solution made it to completion (9000) and another was missing about ~50 messages. The repro solution is currently running Akka.NET v1.3.3 so I'd recommend upgrading to at least Akka.NET v1.3.8. |
So I've run this several times now and I always make it to completion. I think the ~50 missing on that one run may have been an issue with me copying the result set from the repro app too early. Try upgrading to at least Akka.NET v1.3.8, and failing that, try the Akka.NET Nightly builds for 1.3.9 - I think whatever was causing this issue has been fixed. |
Interesting, Let me try updating it to 1.3.8 and also with 1.3.9 and try both the Large message issue and blocking operation issue. You didn't happen to increase the number of messages on TCP port to avoid port exhaustion or anything like that have you? |
We upgraded the version of DotNetty we're using in 1.3.8, but we also have made some bugfixes to |
That is really weird. I've updated and tried both 1.3.8 as well as 1.3.9 nightly build, and I am able to reproduce dropped messages in both the blocking call case as well as large messages case. Here is the complete code with the large message and blocking call implemented. (without packages and bin/debug to reduce file size) |
So the issue ended up being a couple of things: When the akka.net/src/contrib/cluster/Akka.Cluster.Tools/Client/reference.conf Lines 11 to 24 in 6f32f6a
Prior to adding the Please let me know if there are any other issues here! |
When creating a new issue, please make sure the following information is part of your issue description. (if applicable). Thank You!
1.3.4
Windows 7 & Visual Studio 2017
Is there other ways to send/receive/ask messages to a cluster apart from ClusterClientReceptionist from outside the cluster?
We are currently registering actors with ClusterClientReceptionist and use ClusterClient to "ask" messages to cluster from outside. However, when we test with 10,000 (small) messages after a few thousand even though the server node processes those messages, the client does not receive a reply for them through Ask. Is there a way to flush the messages back to cluster after Client.Tell?
ClusterClientReceptionist with 2 seed and 2 worker (non-seed) nodes seem to evenly distribute the load between both the worker nodes. I am curious if ClusterClientReceptionist is the recommended way to distribute load across nodes or if there is a better way to handle it (including the ability to send/ask messages from outside of cluster).
The text was updated successfully, but these errors were encountered: