-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure sidekiq workers increase their db pool size #362
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to fully reproduce the problem, but the logic makes sense to me
lib/dynflow/rails/configuration.rb
Outdated
@@ -96,7 +96,11 @@ def rake_task_with_executor? | |||
end | |||
|
|||
def increase_db_pool_size? | |||
!::Rails.env.test? && !remote? | |||
!::Rails.env.test? && !remote? || sidekiq_worker? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add parens to make the order clearer? like i assume its supposed to be:
!::Rails.env.test? && (!remote? || sidekiq_worker?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point, done
Historically dynflow distinguished between executors (those who perform work) and clients (those who produce work). Executors increased their db pool size so that every thread of the executor could get its own connection. Clients didn't do this, because they didn't really spawn their own thread so whatever configuration was set was considered correct. When we introduced Sidekiq we ended up in a slightly odd situation where only orchestrators were considered executors and other Sidekiq workers were treated as clients. This also meant, the clients didn't increase their db pool size and if pressed heavily enough, they could deplete the pool. This change makes other non-orchestrator workers increase size of their db pools.
I think I brought this up in IRC and not here, but some consideration thinking in this design. For Rails based projects, this can have potentially unintended consequences if not accounted for with respect to tuning. If we take Foreman as the use case. We deploy the main Foreman instance as a Puma Rails application with 2 workers by default. Each worker gets a full set of DB pools based on the For the baseline, Dynflow + Sidekiq deployment we have an executor, 2 workers and a hosts queue worker each with a DB pool size of 5. That implies 4*5 (20) needed connections. If we add a 5 connection buffer for Dynflow that goes up to All that is to say, if having a buffer per worker is needed to account for things like threads inside a Dynflow worker being created bay application code, or additional Database reads happening outside the main transaction I would request that this be a value that is based off configuration rather than a hidden value. This would allow both tuning of the buffer and automated calculation of PostgreSQL connection size and DB pool size in configuration code rather than having to know to add 5 in all the right places. Thoughts? |
Don't we deploy "just" orchestrator, worker and hosts queue worker?
Eh, this is bogus, Sequel's connections should be outside of the AR connection pool in question here so it doesn't really explain where those +5 AR connections come from, it is even possible that we only need those additional connections in the orchestrator instance, but not in the workers. Anyway it would still be good to bear in mind that Sequel has its own connection pool with some additional connections.
If we know how many additional connections we need and we know that by tuning it up we won't gain anything and that by tuning it down we would cause issues, why have it configurable? |
As an interesting side-effect of this, Sequel's connection pool gets increased to |
I suspect that loading Rails code somewhere creates AR connections. You can try to set Foreman's concurrency to 4 instead of 5 to see if it becomes +4 AR connections. |
This is possible, but probably not really relevant to us. When the dynflow world is ready, there doesn't seem to be any active AR connection.
The +5 is a constant given by Dynflow, so fiddling with settings elsewhere won't change it. I suspect it was just a buffer for when active record was used outside of the worker threads in the old model (for example for delayed plan deserialization or running execution plan hooks). If we had a simple reproducer then we could just try lowering it until it would break. Sadly the only reproducer I'm aware of has too many moving parts (syncing >=20 repos at once with katello) |
So we end up with 20 connections per worker? Ouch. And to educate me, Sequel is the library underlying Dynflow's database connections and lives outside of Rails database context? Would you be able to summarize the breakdown of connections that get created via Sequel and those via Rails? (i.e. orchestrator connections are via X, worker connections via Y, tasks spawn Sequel connections or Rails connections?) |
Correct. When running with rails we piggyback on the db config passed to rails and set up sequel based on that.
Dynflow on its own uses only Sequel. No matter if the process is an orchestrator or a worker, it will have at least one Sequel connection to register itself as a world in dynflow_coordinator_records table. In general any application code (anything defined outside dynflow/dynflow) uses AR, or to be more precise, doesn't share the Sequel connections we already have. Worker: Every time a step is executed in the worker, we try to flip its state in the db, so we need a connection. I don't think there's anything else related to Sequel going on in the worker, so we're at Orchestrator: Dynflow uses the actor model from concurrent-ruby heavily, if I'm reading things right the actors are backed by a cached thread pool, which is by default unbounded. The orchestrator should currently have 5 actors which can access the db. In addition to this the orchestrator can also run user code (execution plan hooks) and it updates Tasks, so it should need 2 AR connections Tasks: This uses both as it serves as a bridge between dynflow and the rest, so up to 1 + 1 per thread. |
The state flipping that happens by the worker, is that happening at the overall worker level or per task running on the worker? (I am assuming that given you said
Does that mean that each of the 5 actors has an unbounded thread pool and each thread could connect to AR?
Per thread here, is that threads within each unit of concurrency or the concurrency level itself? And is this per worker? Or part of the orchestrator? Or part of the Rails stack? |
It feels like we are starting to mix names here, since all the projects involved use the same names for different things so I'll start using more specific terms.
State flipping happens on a worker thread level (concurrency settings for sidekiq == number of worker threads). So if 5 steps (the atomic unit of work in dynflow) are running in a single worker process which has 5 worker threads, then states of all 5 steps could be flipped in parallel, requiring 5 sequel connections. At the same time, each of those steps could touch AR stuff, requiring 5 AR connections in total. Now the worker process might be doing some dynflow stuff outside of the worker threads, requiring at least 1 additional sequel connection, giving us the grand total of
There is one shared, but unbounded, thread pool in the worker process, the actors run in threads borrowed from that pool. However because each actor can run only once and there's a set number of actors, there would be at most 5 AR connections.
It is shared among all actors of a single process.
I think it still happens inside actors, but it goes over AR. To be honest I'm not fully sure if this would be 1 AR connection or more, these things kinda happen all over the place and it is hard to track what's happening where |
Just thinking out loud: can there be a situation where the connection isn't released back to the pool? Is there instrumentation to see these pool sizes? |
We've had those in the past. There shouldn't be any such situation, but we can't rule it out with 100% certainty.
It depends how you define instrumentation. If poking at internal state of objects from someone else's library counts then yes |
Ideally you'd be able to graph it over time. In my experience that's often very useful. |
That would be nice to have, but also would require code we currently don't have. To get this back on track, we know that we need at least |
I appreciate your patience on this @adamruzicka . One thing I was aiming to do was to try to ensure we ended up with the right math for tuning an installation from a db pool size and available postgresql connections point of view. With this change, and current code, is the following math true?
For a standard Foreman installation:
For a standard Foreman + Katello:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for detailed explanation. I would just like to ensure the math of connections but am good with this change.
@ehelms the orchestrator is a bit special, we set concurrency to 1 for it. If it was higher, it could have undesirable side effects. At the same time, there are more actors running inside it than inside the regular workers and probably just a single connection to the db should be used outside of any actor, so the formula is a bit different. I'd say something along the lines of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit. Apart of it 👍
lib/dynflow/rails.rb
Outdated
@@ -38,8 +38,8 @@ def initialize! | |||
init_world.tap do |world| | |||
@world = world | |||
config.run_on_init_hooks(false, world) | |||
config.increase_db_pool_size(world) if config.increase_db_pool_size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is checked inside of the method already.
config.increase_db_pool_size(world) if config.increase_db_pool_size? | |
config.increase_db_pool_size(world) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adamruzicka ! 👍
And thanks @ehelms and @ekohl for the discussion drive and @ehelms and @jlsherrill for the reviews ❤️ |
Historically dynflow distinguished between executors (those who perform work)
and clients (those who produce work). Executors increased their db pool size so
that every thread of the executor could get its own connection. Clients didn't
do this, because they didn't really spawn their own thread so whatever
configuration was set was considered correct.
When we introduced Sidekiq we ended up in a slightly odd situation where only
orchestrators were considered executors and other Sidekiq workers were treated
as clients. This also meant, the clients didn't increase their db pool size and
if pressed heavily enough, they could deplete the pool. This change makes other
non-orchestrator workers increase size of their db pools.
To verify pool size is increased, have an action print
ActiveRecord::Base.connection_pool.size
from its#run
. Alternatively with this patch[1] I was drain all the available connections from the pool.[1] - 0a0e566