-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonblocking callback #459
base: master
Are you sure you want to change the base?
Nonblocking callback #459
Conversation
Move all callback operations into an auxiliary `AraWorker` class that runs in a separate thread started by the `CallbackModule` class, which is now just a shell that puts messages into a queue for the worker thread to process. The callback plugin now does not block the main Ansible thread.
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 23s |
Running the thread as "daemon" makes the previous cleanup unecessary.
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 01s |
Prevents program from exiting before all ARA work is finished. (meant to commit this earlier - whoops!)
Hey @erik-overdahl and thanks for the PR ! I've looked at the CI failure and I think I know what's going on but it's a bit tricky. This is the error:
It's these tasks: ara/tests/integration/lookups.yaml Lines 71 to 91 in dc4344e
In other words, we are generating a failure for the sake of testing that we can query and search for it within the playbook. I don't have a good suggestion at this time but does my explanation make sense ? Edit: I haven't got around to reviewing and testing the code yet, apologies for the delay |
That is my understanding as well. I tried adding a |
Was reading your conversations with much interest. Where I'm a bit afraid of and I feel confirmed after the latest findings is that the reporting is no longer live. As long as the playbook runs to the end everything might be fine. But if the playbook run will be aborted or crashs due to any other reason then there is the possibility that later tasks are already performed before former tasks are fully reported to Ara. Thus the question of this change is speed vs accuracy. For me speed is important but more important is accuracy. I prefer correct results in Ara then faster playbook runs. (btw that's why I tell everyone in my company to not use the free strategy, because this led to problems in Ara too). |
I do not agree with your assessment. All results will still be logged by ARA. A failing task "crashes" the playbook but the Ansible program underneath continues to run. The ARA worker added in this PR runs as a daemon thread, so the program does not exit until the thread is done with its work. So there is no problem with accuracy or missing results unless the Ansible program itself crashes, and if your playbooks are causing that to happen, you have bigger problems to deal with. Additionally, in a more realistic Ansible scenario, tasks on hosts do not complete immediately like they do in our simple benchmark, giving ARA more time to complete the HTTP calls. In a normal situation, the reporting should line up closely to the actual run of the playbook. I attempted to reproduce this in my second set of benchmarks, and indeed you'll notice that the time differences between the runs with ARA and those without are much less severe than in the simple benchmark where all tasks complete immediately. Refactoring the callback to be nonblocking addresses more than just performance. The synchronous nature of callbacks in the Ansible architecture means that, in its current form, ARA makes callbacks for things like timing the execution of tasks completely useless. Not only is this bad in itself, but it is also not transparent; people using those callbacks may enable ARA and still be relying on those timings to be accurate representations of the actual task executions times on the hosts, unaware of the slowdown coming from ARA. Avoiding the free strategy because ARA cannot handle it is backwards thinking, imo. If there is a language feature that is perfectly fine by itself but you cannot use it because your logging system cannot handle it, then the problem is with the logging system. ARA should be implemented better. No one should even have to think about including this in their Ansible toolkit and they should not have to change their Ansible playbooks to accommodate ARA. |
that is exactly what I meant. Maybe a user kills it, or the controller (e.g. a VM) crashs due to any reason, must not be related to the playbook itself. But in such cases I want to have a propper Ara reporting as well.
But wouldn't it better then if directly Ansible supports kind of async callbacks in general? |
If the machine you are running Ansible on crashes, there isn't anything you can do, and current ARA also has the risk of losing results in this case. A task may finish, but if we are still waiting on the POST requests in the callback to finish when the crash happens, then you just lost all results you didn't send yet. If any exception happens in the Ansible process, the exception is caught at the top level and the process shuts down gracefully - see the EDIT: after running some tests, you are indeed correct that user cancelling will cause lost data when this patch is applied. I can fix this by bringing back the I agree that Ansible SHOULD implement the callback plugins in a non-blocking manner, but they don't, so it is up to the callback plugins themselves to not block. |
Even if the user cancels the run. Brings back the `__del__` method on CallbackModule removed in 6397d24
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 49s |
Right. I might describe ara's attempts to record data as "best effort", to some extent, due to the nature of how the callback interface works. ara's callback gets "woken up" when it receives a hook but if it stops receiving hooks, it can't do anything. There isn't much we can do about that, unfortunately.
Over the years, ara has helped with finding issues in the Ansible callback interface and while some were fixed, others were closed because fixing them would be too complicated or break stuff. One of those is about the blocking nature of the callback interface but that was more than 5 years ago now: ansible/ansible#27705 We could revisit that and ask about it upstream but the challenge is that any meaningful change would probably break compatibility and only be available in newer versions of ansible-core since it wouldn't be backported. Anyway, before we settled on the current threading implementation, every request was synchronous. Now, there are still some amount of synchronous requests because, for example, to create a play we need a playbook ID and so we need to make sure the playbook has been created before we try to create the play. For what it's worth, I went through the exercise of picturing a high level overview of how the sequence currently works from the perspective of the ara callback: |
I think to call the current durations of tasks There are certain tradeoffs in ara for the sake of simplicity and sometimes it comes at the expense of performance. Don't get me wrong, performance is important and I understand what you mean and how it would be more accurate to record the task duration as 3m 22s instead 3m 22.5s but that's neither useless or the end of the world.
ara should work reliably with the free strategy. It historically hasn't because I did not personally have a use case for it and so a lot of ara was built around assumptions of running synchronously with the serial strategy. There are some fixes for the free strategy in 1.6.0 and if there are still issues, we should document and fix them. |
Yes, upon further reflection, I agree with you that ARA calls being reported in execution time is mostly a non-issue. I was thinking of it as reporting the execution time of a task on a host, but that is plainly not the case when the number of hosts is greater than I personally think that the performance gains from this patch are worth it, but I understand if you do not want to take on the added complexity. It isn't much more complexity, but it isn't none either. I can try to update that integration test to wait until the record is available, if you like (although I'm not sure of the best way to do this - you can't really retry until the record appears, because the test would never fail if the record wasn't actually there, so then your option is, what, a timeout? but that seems flaky, especially in a test, which you would hope is 100% deterministic). Perhaps the most useful datapoint on ARA performance is this comment from the issue related to the benchmark blog post which claims that the 30% gains they saw from using Ansible with Mitogen were wiped out by ARA. I would like to have a wider range of benchmarks. The numbers in the original ARA blogpost make the situation look a lot worse than it actually is. But I think it would be difficult to develop something useful; there are a lot of potential scenarios in a realistic workload. Re: the free strategy - One of the things I found when I sprinkled logging statements throughout the callback code was that the ARA threading strategy ends up killing and creating a lot of ThreadPoolExecutors as the |
Hi @erik-overdahl, I just wanted to say I haven't forgotten about your PR and that I'll get back to it when the time is right. The callback has grown organically over the years: it has had features and fixes patched on along some reverse engineering of internal ansible-core's "API"s. It could be a good idea to take a step back, look at what it does and what we've learned over the years to come up with something a bit cleaner. ara didn't always have an API and the callback used to make database queries directly with flask so a lot has changed since then :p Performance, threading, queueing or asynchronicity can be part of an eventual refactor but I mean to say that a lot of what the callback does could benefit from being put in a library -- things like creating a playbook, a task, a host or a result, for example. I want to have a better idea of what this looks like before going down the road of optimizing performance such that we don't step on each other's feet. I appreciate your patience in the meantime ! |
If the risk of losing updates on crash is an issue, I think we can address those easily enough. As @erik-overdahl mentioned we should be able to delay the shutdown until the queue drains. A typical approach I see in other applications (including my own) with similar concerns is that SIGINT/SIGTERM will initiate a graceful shutdown (wait for things to terminate cleanly, and in this case for the queue to drain), and a second SIGINT/SIGTERM causes an immediate exit. We can also put in a max queue size. For example if we make the max size 0, then the request has to be passed synchronously to the thread, but once the thread has it ansible can continue and the thread would make the API call in the background. Any further attempts to push something onto the queue will block until the API call completes. I also just want to add my own real world experience. The benchmarks in the original post were overly optimistic for my experience. My playbooks have lots of very short tasks that run across hundreds of hosts. I saw playbooks taking over 3x longer when ARA was enabled. A playbook that took about 7 minutes was taking almost 30 minutes. Though this was a while ago, and I cant remember details. I might have been using django auth, and might have been using sqlite, of which I now use neither, and am using the async wrapper I posted in #171 (comment). |
I have tested the Async Wrapper. It does not work for us because it changes the timestamps in ARA (or in other words, these do not correspond to the actual time of execution). But I think the idea of the async wrapper (thanks for sharing @phemmer) is a good one. |
This PR makes it so that the ARA callback module no longer blocks the flow of Ansible playbook execution at all. Now the only blocking happens for the requests sent at the end of the playbook (this is unavoidable). This is accomplished by moving all logic for the callback to a separate thread and pushing received callback messages to the worker thread via a queue.
I ran some very simple benchmarks to confirm that this does in fact improve performance. I would like to have more robust measurements, but right now these are what I had time for.
All measurements were taken with 10 hosts performing 5 tasks across 10 forks. Times are reported in milliseconds and averaged over 5 runs.
First, the test used in the ARA benchmark article - each task is just
debug
The second test used the
ansible.builtin.shell
module to sleep for 1 - 3 seconds in order to mimic more realistic conditionsThe nonblocking client is faster than the current ARA master branch, and, importantly, makes a big performance difference in the situations where the previous threading-patch cannot help ( using sqlite as the database backend).