Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace RabbitMQ (concrete design notes) #42

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Nov 11, 2024

  • Used AEP template from AEP 0
  • Status is draft
  • Added type & status labels to PR
  • Added AEP to README.md
  • Provided github handles for authors

The #30 give the explanation and motivation on removing RabbitMQ. I extent on it as a concrete implementation AEP.
The prototype is put in https://github.com/unkcpz/tatzelwurm, I put it in private at the moment and shared only to aiida-core team members. Let me know if you want to take a look.

Go built aep page if need to read as whole post.

@unkcpz unkcpz mentioned this pull request Nov 11, 2024
5 tasks
@unkcpz
Copy link
Member Author

unkcpz commented Nov 11, 2024

I am aware of chrisjsewell/aiida-process-coordinator#4, the comments by @giovannipizzi and @sphuber are addressed sporadically in the design notes.
I pick some what I think important to explicitly reply here.

The only remaining question that sprung to mind was the usage of broadcasting where processes broadcasted state changes. There are some basic usages of this functionality, for example verdi process watch which allows to watch a process' state changes, but these are not that important. What is important though, is how workchains used this functionality to quickly start running again once the child processes they were waiting for were, reach the terminated state:

This is mentioned by @sphuber. I think if the priority of tasks and different type of runners are implemented the problem can be addressed gracefully.
Once the child processes waited by the workchain are finished, the workchain will have relatively higher priority than workchains that still have the child processes are running therefore will be picked up and run first.

Followings are mentioned by @giovannipizzi:

I suggest to use PostgreSQL for stress testing

In the prototype, before I mock the task pool for se/de the task, I created fake async/sync sleep task in the worker and the throughput I tested can be > 100,000 tasks without any noticeable delay (the bottleneck is then my terminal can not consume the print of so many logs).
When the load is large, I saw the CPU usage scaled up to more than 1 which attribute to the Rust tokio and it is not possible if using python with GIL.
For a more realistic benchmark, I use surrealDB which has memory storage and disk DB and support SQL. It said has potentially an extremely good performance.
But it should be very easy to test scenarios you are interested by yourself by just download the binaries.

is there still a concept of "slots" per worker, like now? (i.e. how many AiiDA processes a worker can take care of)?

In my plan, after the distinguish of runners type and tasks type I think an single async runner is enough to handle > 10,000 tasks per second, which is way more large then the current use case of AiiDA (not for the future if AiiDA is plan to compete with workflow engines for bio-information and ML use cases). Then the slots concept is applied to block function runner only.

@unkcpz unkcpz force-pushed the aep/remove-rabbit branch 2 times, most recently from b40433d to cbfdee8 Compare November 13, 2024 15:47
@unkcpz unkcpz force-pushed the aep/remove-rabbit branch 2 times, most recently from 633eff3 to 665712a Compare November 13, 2024 16:39
@unkcpz unkcpz changed the title Add AEP: Remove RabbitMQ (concrete design notes) Replace RabbitMQ (concrete design notes) Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant