-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using blocking publisher confirms with concurrent publishers #959
Comments
Thank you for your time. Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team). We get at least a dozen of questions through various venues every single day, often light on details. Please post this to rabbitmq-users. Thank you. |
Some of these approaches cannot possibly work because concurrent publishing on a shared channel is explicitly not supported. Opening a channel and enabling confirms mode on it in a critical section, but not for actual publishing, cannot possibly work either and will result in connection-level errors due to incorrect framing in RabbitMQ logs.
In
This basic integration test is a good starting point. It does not use confirms or streaming confirms but it does demonstrate what kind of concurrent access scenarios are expected (or not). |
There is not a single concurrent publishing on a channel on any of the methods. Please, read the code - I reuse channels just like it's been stated in documentation. Even on "per thread" there is no path where currently executing code could use another thread's channel. 6.x does not handle even work on Your suggested integration test assuming that for using v6 application should use SetMinThreads? You can't be serious about that. You can't demand that modern microservice application to limit itself with multithreading just to use rabbitmq v6. No other library we're using in our microservices platform implies such limitations - rabbitmq 5, mssql, pgsql, cassandra, redis, consul. There is something wrong with you library and multiple people trying to prove it to you and you guys just keep denying it and closing tickets. |
@MichaelLogutov please be patient. If it seems like we don't understand your code, you should take the time to explain. Version 6.0 of this library uses the .NET Thread Pool. You must ensure that enough threads are available in the thread pool when your application starts by doing something like this: With the above change, your code works fine using version 6.2.1 of this library. Similar changes fixed #860
Please educate yourself as to what @michaelklishin could you suggest a better title for this issue? Something like "Use of ThreadPool in 6.X should be more clearly documented". @stebet now that the thread pool is being used how can we better guide users so that they do not make the mistake here and in #860? Can we detect if default ThreadPool settings are in place and throw an exception? @MichaelLogutov here is the output of a run on my Arch Linux workstation. I have no idea why the
|
If highly concurrent publishing is desired, I'd suggest using channels (from System.Threading.Channels) as intermediate buffers, and create worker tasks that consume from those channels and use batch publishing if enough messages are waiting. That way you'll also reduce IO overhead by utilizing fewer connections/models more efficiently. I'll write up a sample that could be added to the documentation. @michaelklishin @lukebakken would it make sense to remove the "WaitForConfirms*" methods from the 7.0 version and rather document how users could utilize streaming confirms? It'd simplify a lot of logic in the current client and probably prevent incorrect usage of confirms. |
@lukebakken Thank you for taking time explaning and reading through my code. @stebet Would be great to see working prototype with the same throughput as v5. You guys doing a great job and I'm sure you did a lot of work with v6, it's just I responsible for all "platform" libraries in our company and it seems even though I'm a long time rabbitmq library user I can't come up with any solution that can utilize v6 to be as effective as it v5 was in high concurrent web applications. And based on other tickets I'm not the only one. |
I've analyzed the two programs to some degree. Regarding the Threadpool threads issue: In v5, all the functions were sync, single-threaded and blocking. In v6, all the network traffic is offloaded on the threadpool (still sync & blocking). So in conclusion, it's a combination of us blocking the threads in our functions as well as the way the threadpool is used for parallelism. PS: If you use Regarding the @lukebakken output: |
This is a really bad place to be. You should never block a threadpool thread. |
Hmm I see now that wording is chosen very poorly... But yes, it is still not optimal to block the threads calling the public API (as they could, and in this case are, from the threadpool), this is why for 7.0 a full async API is being investigated. |
@bollhals Thanks, your investigation helps to shed some light on the problem. Setting SetMinThreads(150, 150) helped, but still I've got 10 times more latency (on average) for sending one message compared to v5. Changing Task.Run to Task.Factory.StartNew with LongRunning without setting SetMinThreads works too, but still it's 10 times slower than v5. I say that it's not very clear if it's even recommended using v6 in web applications for those reasons:
|
Just some quick input here to think about. |
I've changed the way test is performed making each task to publish 100 messages. So 100 task in parallel sending 100 messages. And results are more adequate - using per_thread I was able to achieve the same speed on v6 as v5 so I stand corrected on point 3 of my previous message. |
@bollhals that should be possible assuming that the client and the calling code are async-oriented/aware. See Strategies for Using Publisher Confirms. Confirms coming from the server are inherently asynchronous. As is the publishing part. |
About streaming confirms. In our platform rabbitmq plays crucial role delivering asynchronous business events between microservices. It's important for application to send event after business operation is completed. So we use only durable queues and exchanges, mirroring and persistent messages with publish confirmation. We even created reserve channel - an microservice that can receive queue messages via http that app failed to send to rabbitmq server. We even evaluating using XA transaction at this moment. |
@MichaelLogutov I'm not buying this argument. Switching to streaming confirms is always a possibility: the confirms arrive asynchronously and at an unknown moment in time anyway, the question is what the publisher does while it has to wait. This client is increasingly moving towards an async-oriented API and there is no reason why this "task" (used in the theoretical sense) cannot be performed without blocking. Sticking to the blocking API and expecting good throughput and latency is a losing proposition: most users of the client want it to move in the exact opposite direction. So you either stick to the "One-by-one-and-blocking" is a confirm use case that cannot be substantially optimized and will not be a priority for the maintainers and users of this client. |
This is a wild assumption: that "Web applications" look the same and can only use 100% of major changes in this client were driven by two needs of its users (and both code and benchmarking/profiling data was contributed by said users):
@stebet @bollhals and others have contributed very meaningful improvements in some of these areas, and this has continued for 7.0. This seems to have had a negative effect on a specific workload that uses blocking publisher confirms in N threads. I guess no user of this (generally recommended against) workload showed up to try it out and report the issue when the changes were discussed months ago. This is how open source software works: those who participate in its development get their ideas in. @MichaelLogutov please have some respect to the work that was contributed to this client for the needs of other users, or you will be banned from ever filing another issue in this project again. Yes, I am dead serious. |
Making a task publish a batch of messages before waiting is strategy number two outlined in the docs. It is a middle ground between the publish-one-and-block-waiting option and using streaming confirms. If there are specific suggestions to the defaults this library uses, we would consider them for a future |
@michaelklishin I never mean any disrespect for anybody involving in this project. Being in charge of more than 20 platform libraries used in 350+ microsevicces I know what it takes to make breaking changes. You guys don't have army or internal client testers like Microsoft so it's understandable (for me at least) that such cases can be happened. It's just reading couple of different tickets about the same root cause and seeing it's being closed leaded me to feeling you don't want to acknowledge the problem. I apologies if I misunderstood that. I was actually very happy back when I first see what's being suggested for v6. And I'm 100% supporting going forward into full async even it's require breaking changed more than need (because that what it takes sometime to have limited resources and big goal). Now, if I understand you correctly, what you're suggesting is that to use guaranteed delivery for a message we (clients) will have to implement it's own logic spinning channels and organizing awaitable TaskCancellationSource to track delivery of specific messages as a workaround. Ok then, I'll try to go this way. And also incorporate some logic calculating min threads into applications using new library and test it under heavy load - maybe it will be ok. |
Your understanding is largely correct in that right now publisher confirms is a mechanism that you integrate into your apps, not a feature flag you flip. We'd like to get there one day but right now channel pooling and confirm tracking are up to the publishing apps. It's not clear whether a library can provide One True Way™ of using publisher confirms. A lot of RabbitMQ team members would like to see something like that in most if not all clients. In the meantime, there's a clear need for producing some examples that demonstrate how to address a few scenarios, such as moderately high volume RPS Web application with publisher confirms enabled. We can also make the library be less painful to use for those users with all defaults. For |
I just double checked the code, and if we self implement a logic for the CountdownEvent that encorporates the async awaiting (and possibly the delivery tag tracking) we could implement a Async wait for confirmations. (Or we use something else than CountdownEvent) |
I understand it all. In theory, this is so. The main thing is that the theory does not contradict practice. |
A quick update. I've released channel-per-thread implementation on one of our production microservices (netcore 3.1, 35 containers, overall RPS ~2000, about 50 RPS per container with 6k hard CPU limit) and it's seems working the same as v5 did - no improvement, but it seems no degradation either. That's without using SetMinThreads. |
Regarding channel-per-thread. Do you create dedicated threads? |
No. I've used ThreadLocal. And a single publish connection for application. |
Ok, had to rollback the version. In the middle of the night all publishing stopped due to error:
I have one connection per application and all channels stored in ThreadLocal and they are being accessed like this:
|
It seems to me you just created more and more new channels, when threads with new identifiers were created in the thread pool, and the old channels remained open (not disposed). |
Apparently so. Anyway, this shows that using per-thread is not feasible in some applications. |
It's more about the ThreadLocal usage. Due to this, every thread that ever accessed it creates a new Channel, that is attached to the connection. In your case the connection eventually ran out of channels to be created (I think the absolute max is ushort.MaxValue, but can also be limited from a server side). To be fair, this is not a particular issue of this library, but yes using a ThreadLocal or ThreadStatic for this purpose is probably not a good idea if the application runs for a long time. Usage of a Concurrent Collection for pooling or an Objectpool probably makes more sense for your application. |
100% agree |
I have pushed a |
Status update. I've changed the code to "channel-per-publish" (create channel for each publish, publish, wait for confirm, dispose channel) and released again. It's working without errors, but even the median time for publish+confirm is higher on v6 compared to v5. I've released v6, v5 then again v6 and here are results (we've got our 5 cluster rabbitmq server heavy loaded today but the difference is still visible on all percentiles): |
With the "channel-per-publish"-approach you'll see decreased throughput due to the channel creation/close overhead. I really do recommend the pooling approach as said in my last comment. Also for direct comparison of v5 to v6, it's possible that there is a negative difference due to in v6 there are more threads involved to send and wait for confirmation. This is due to the offload to a worker thread for sending which brings limited benefit if you immediately wait for a single confirmation afterwards. Do you know how many concurrent open channels you have? (In your initial post you said ~100?) One alternative would be to make the waiting for confirmation different. Is your method that sends the message async? (or could it be modified) |
I don't have telemetry for that right now on production. The first post was about sandbox app.
But for that to work with suggested channel pooling I need to implement some sort of cross channel ack/nack watching? Or ack/nack properly sent to channels that published the message in the first place? |
AFAIK yes, they're sent back to the channel they were published. (I assume that's why the WaitForConfirmationOrDie methods are also placed on the channel and not on the connection) |
Moving to channel pooling helped but still it's slower than v5. |
Thanks to @bollhals, #999 should be a decent stop-gap measure in addressing this before the API eventually switches to be fully async/await-oriented. @MichaelLogutov would you have a chance to take a look at #999 and give it a try? Note that it has some limitations mentioned in the comments. |
Hey, thanks! I can't do it right now due to other project time restriction, but will definitely test it a bit later. |
Hello.
We're using rabbitmq library version 5 in web applications to publish messages from request. As you may guess it's highly concurrent environment and it's quite common to have 100-500 requests per second sending a message to rabbit.
Trying to update to version 6 of the library (6.2.1 at this time) shows a lot of timeout errors while sending the message. The problem can be reproduced locally and I've attached a simple solution with two projects:
Also I've added simple docker compose file to spin single local rabbitmq node on 127.0.0.11
Each project contains identical code which performs this steps:
Step 3 implemented in 3 different ways and can be switched by passing specific argument when running project:
Running all tests on v5 gets me a very good performance - 7-9 msec per message (100 messages total in parallel).
Running all tests on v6 never completed without timeout errors.
Can someone please look into this? Maybe I'm doing something wrong? How to use rabbitmq v6 in high concurrent environment like web?
RabbitMQ.zip
The text was updated successfully, but these errors were encountered: