Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure sidekiq workers increase their db pool size #362

Merged
merged 2 commits into from
Aug 28, 2020

Conversation

adamruzicka
Copy link
Contributor

@adamruzicka adamruzicka commented Aug 4, 2020

Historically dynflow distinguished between executors (those who perform work)
and clients (those who produce work). Executors increased their db pool size so
that every thread of the executor could get its own connection. Clients didn't
do this, because they didn't really spawn their own thread so whatever
configuration was set was considered correct.

When we introduced Sidekiq we ended up in a slightly odd situation where only
orchestrators were considered executors and other Sidekiq workers were treated
as clients. This also meant, the clients didn't increase their db pool size and
if pressed heavily enough, they could deplete the pool. This change makes other
non-orchestrator workers increase size of their db pools.

To verify pool size is increased, have an action print ActiveRecord::Base.connection_pool.size from its #run. Alternatively with this patch[1] I was drain all the available connections from the pool.

[1] - 0a0e566

Copy link
Contributor

@jlsherrill jlsherrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to fully reproduce the problem, but the logic makes sense to me

@@ -96,7 +96,11 @@ def rake_task_with_executor?
end

def increase_db_pool_size?
!::Rails.env.test? && !remote?
!::Rails.env.test? && !remote? || sidekiq_worker?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add parens to make the order clearer? like i assume its supposed to be:

!::Rails.env.test? && (!remote? || sidekiq_worker?)

Copy link
Contributor Author

@adamruzicka adamruzicka Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, done

Historically dynflow distinguished between executors (those who perform work)
and clients (those who produce work). Executors increased their db pool size so
that every thread of the executor could get its own connection. Clients didn't
do this, because they didn't really spawn their own thread so whatever
configuration was set was considered correct.

When we introduced Sidekiq we ended up in a slightly odd situation where only
orchestrators were considered executors and other Sidekiq workers were treated
as clients. This also meant, the clients didn't increase their db pool size and
if pressed heavily enough, they could deplete the pool. This change makes other
non-orchestrator workers increase size of their db pools.
@ehelms
Copy link
Contributor

ehelms commented Aug 18, 2020

I think I brought this up in IRC and not here, but some consideration thinking in this design.

For Rails based projects, this can have potentially unintended consequences if not accounted for with respect to tuning. If we take Foreman as the use case. We deploy the main Foreman instance as a Puma Rails application with 2 workers by default. Each worker gets a full set of DB pools based on the database.yml configuration and is driven by the number of threads per worker. So, out of the box Foreman itself needs 2*16 (32) PostgreSQL connections. I believe we set this to 5 currently (which implies we can overcommit there as well).

For the baseline, Dynflow + Sidekiq deployment we have an executor, 2 workers and a hosts queue worker each with a DB pool size of 5. That implies 4*5 (20) needed connections. If we add a 5 connection buffer for Dynflow that goes up to 4*(5 + 5) (40) connections we need to be accounting for. Out of the box, PostgreSQL is tuned for 100 open connections. For Katello, which adds Candlepin and Pulp 3, more connections come about.

All that is to say, if having a buffer per worker is needed to account for things like threads inside a Dynflow worker being created bay application code, or additional Database reads happening outside the main transaction I would request that this be a value that is based off configuration rather than a hidden value. This would allow both tuning of the buffer and automated calculation of PostgreSQL connection size and DB pool size in configuration code rather than having to know to add 5 in all the right places.

Thoughts?

@ekohl @evgeni you may find this change of interest

@adamruzicka
Copy link
Contributor Author

adamruzicka commented Aug 18, 2020

For the baseline, Dynflow + Sidekiq deployment we have an executor, 2 workers and a hosts queue worker

Don't we deploy "just" orchestrator, worker and hosts queue worker?

All that is to say, if having a buffer per worker is needed to account for things like threads inside a Dynflow worker being created bay application code, or additional Database reads happening outside the main transaction

If I'm reading the docs right, Sequel uses a pool of 4 connections by default. That means we should need up to $concurrency + 4 connections in total, just to shed some light onto where the additional connections come from.

Eh, this is bogus, Sequel's connections should be outside of the AR connection pool in question here so it doesn't really explain where those +5 AR connections come from, it is even possible that we only need those additional connections in the orchestrator instance, but not in the workers. Anyway it would still be good to bear in mind that Sequel has its own connection pool with some additional connections.

I would request that this be a value that is based off configuration rather than a hidden value

If we know how many additional connections we need and we know that by tuning it up we won't gain anything and that by tuning it down we would cause issues, why have it configurable?

@adamruzicka
Copy link
Contributor Author

As an interesting side-effect of this, Sequel's connection pool gets increased to $concurrency + 5, which arguably is the right thing to do, but leaves us with grand total of ($concurrency + 5) * 2 connections per worker.

@ekohl
Copy link
Contributor

ekohl commented Aug 19, 2020

Eh, this is bogus, Sequel's connections should be outside of the AR connection pool in question here so it doesn't really explain where those +5 AR connections come from, it is even possible that we only need those additional connections in the orchestrator instance, but not in the workers. Anyway it would still be good to bear in mind that Sequel has its own connection pool with some additional connections.

I suspect that loading Rails code somewhere creates AR connections. You can try to set Foreman's concurrency to 4 instead of 5 to see if it becomes +4 AR connections.

@adamruzicka
Copy link
Contributor Author

I suspect that loading Rails code somewhere creates AR connections.

This is possible, but probably not really relevant to us. When the dynflow world is ready, there doesn't seem to be any active AR connection.

You can try to set Foreman's concurrency to 4 instead of 5 to see if it becomes +4 AR connections.

The +5 is a constant given by Dynflow, so fiddling with settings elsewhere won't change it. I suspect it was just a buffer for when active record was used outside of the worker threads in the old model (for example for delayed plan deserialization or running execution plan hooks). If we had a simple reproducer then we could just try lowering it until it would break. Sadly the only reproducer I'm aware of has too many moving parts (syncing >=20 repos at once with katello)

@ehelms
Copy link
Contributor

ehelms commented Aug 19, 2020

As an interesting side-effect of this, Sequel's connection pool gets increased to $concurrency + 5, which arguably is the right thing to do, but leaves us with grand total of ($concurrency + 5) * 2 connections per worker.

So we end up with 20 connections per worker? Ouch.

And to educate me, Sequel is the library underlying Dynflow's database connections and lives outside of Rails database context? Would you be able to summarize the breakdown of connections that get created via Sequel and those via Rails? (i.e. orchestrator connections are via X, worker connections via Y, tasks spawn Sequel connections or Rails connections?)

@adamruzicka
Copy link
Contributor Author

Sequel is the library underlying Dynflow's database connections and lives outside of Rails database context?

Correct. When running with rails we piggyback on the db config passed to rails and set up sequel based on that.

Would you be able to summarize the breakdown of connections that get created via Sequel and those via Rails?

Dynflow on its own uses only Sequel. No matter if the process is an orchestrator or a worker, it will have at least one Sequel connection to register itself as a world in dynflow_coordinator_records table. In general any application code (anything defined outside dynflow/dynflow) uses AR, or to be more precise, doesn't share the Sequel connections we already have.

Worker: Every time a step is executed in the worker, we try to flip its state in the db, so we need a connection. I don't think there's anything else related to Sequel going on in the worker, so we're at $concurrency + 1. At the same time, workers are the part which runs all the user code, so here we need at least $concurrency active record connections.

Orchestrator: Dynflow uses the actor model from concurrent-ruby heavily, if I'm reading things right the actors are backed by a cached thread pool, which is by default unbounded. The orchestrator should currently have 5 actors which can access the db. In addition to this the orchestrator can also run user code (execution plan hooks) and it updates Tasks, so it should need 2 AR connections

Tasks: This uses both as it serves as a bridge between dynflow and the rest, so up to 1 + 1 per thread.

@ehelms
Copy link
Contributor

ehelms commented Aug 20, 2020

Worker: Every time a step is executed in the worker, we try to flip its state in the db, so we need a connection. I don't think there's anything else related to Sequel going on in the worker, so we're at $concurrency + 1. At the same time, workers are the part which runs all the user code, so here we need at least $concurrency active record connections.

The state flipping that happens by the worker, is that happening at the overall worker level or per task running on the worker? (I am assuming that given you said $concurrency + 1 its at the worker level, but wanted to double check). For example, if 5 tasks are running at the same time on a single worker, is there still only a single update happening at a time?

Orchestrator: Dynflow uses the actor model from concurrent-ruby heavily, if I'm reading things right the actors are backed by a cached thread pool, which is by default unbounded. The orchestrator should currently have 5 actors which can access the db. In addition to this the orchestrator can also run user code (execution plan hooks) and it updates Tasks, so it should need 2 AR connections

Does that mean that each of the 5 actors has an unbounded thread pool and each thread could connect to AR?
Does each actor get it's own database configuration or is it shared?
Does running user code (execution plan hooks) and updating tasks happens outside of the actors? (hence the 2 AR connections)

Tasks: This uses both as it serves as a bridge between dynflow and the rest, so up to 1 + 1 per thread.

Per thread here, is that threads within each unit of concurrency or the concurrency level itself? And is this per worker? Or part of the orchestrator? Or part of the Rails stack?

@adamruzicka
Copy link
Contributor Author

It feels like we are starting to mix names here, since all the projects involved use the same names for different things so I'll start using more specific terms.

The state flipping that happens by the worker, is that happening at the overall worker level or per task running on the worker?

State flipping happens on a worker thread level (concurrency settings for sidekiq == number of worker threads). So if 5 steps (the atomic unit of work in dynflow) are running in a single worker process which has 5 worker threads, then states of all 5 steps could be flipped in parallel, requiring 5 sequel connections. At the same time, each of those steps could touch AR stuff, requiring 5 AR connections in total. Now the worker process might be doing some dynflow stuff outside of the worker threads, requiring at least 1 additional sequel connection, giving us the grand total of 5 + 5 + 1 connections.

Does that mean that each of the 5 actors has an unbounded thread pool and each thread could connect to AR?

There is one shared, but unbounded, thread pool in the worker process, the actors run in threads borrowed from that pool. However because each actor can run only once and there's a set number of actors, there would be at most 5 AR connections.

Does each actor get it's own database configuration or is it shared?

It is shared among all actors of a single process.

Does running user code (execution plan hooks) and updating tasks happens outside of the actors? (hence the 2 AR connections)

I think it still happens inside actors, but it goes over AR. To be honest I'm not fully sure if this would be 1 AR connection or more, these things kinda happen all over the place and it is hard to track what's happening where

@ekohl
Copy link
Contributor

ekohl commented Aug 20, 2020

Just thinking out loud: can there be a situation where the connection isn't released back to the pool? Is there instrumentation to see these pool sizes?

@adamruzicka
Copy link
Contributor Author

can there be a situation where the connection isn't released back to the pool?

We've had those in the past. There shouldn't be any such situation, but we can't rule it out with 100% certainty.

Is there instrumentation to see these pool sizes?

It depends how you define instrumentation. If poking at internal state of objects from someone else's library counts then yes

@ekohl
Copy link
Contributor

ekohl commented Aug 20, 2020

Ideally you'd be able to graph it over time. In my experience that's often very useful.

@adamruzicka
Copy link
Contributor Author

Ideally you'd be able to graph it over time. In my experience that's often very useful.

That would be nice to have, but also would require code we currently don't have.

To get this back on track, we know that we need at least $concurrency + c connections for AR and $concurrency + c for Sequel, where the additional constant c is arguable. Currently, we are using whatever is set in /etc/foreman/database.yml, which is 5 by default. Even if the connections were being checked in and out correctly, this wouldn't be enough. Could we get this in, so we have $concurrency + 5 connections available for each of the db libraries and try to figure out if connections are being checked in and out correctly and if we could lower c in parallel? Since the connections are not actually opened until they are needed, it feels that in this case overcommiting would be safer than undercommiting

@ehelms
Copy link
Contributor

ehelms commented Aug 20, 2020

I appreciate your patience on this @adamruzicka . One thing I was aiming to do was to try to ensure we ended up with the right math for tuning an installation from a db pool size and available postgresql connections point of view.

With this change, and current code, is the following math true?

Max # of postgresql connections = (workers.reduce { |worker| worker['concurrency'] * 2 +1 }) + (orchestrator['concurrency'] + 1)

For a standard Foreman installation:

13 = (5*2 + 1) + (1 + 1)

For a standard Foreman + Katello:

24 = ((5*2 + 1) * 2) + (1 + 1)

Copy link
Contributor

@ehelms ehelms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailed explanation. I would just like to ensure the math of connections but am good with this change.

@adamruzicka
Copy link
Contributor Author

@ehelms the orchestrator is a bit special, we set concurrency to 1 for it. If it was higher, it could have undesirable side effects. At the same time, there are more actors running inside it than inside the regular workers and probably just a single connection to the db should be used outside of any actor, so the formula is a bit different. I'd say something along the lines of $number_of_actors + $number_of_actors_doing_activerecord_stuff + c, with numbers it would give us 5 + 2 + 1.

Copy link
Contributor

@ezr-ondrej ezr-ondrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. Apart of it 👍

@@ -38,8 +38,8 @@ def initialize!
init_world.tap do |world|
@world = world
config.run_on_init_hooks(false, world)
config.increase_db_pool_size(world) if config.increase_db_pool_size?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checked inside of the method already.

Suggested change
config.increase_db_pool_size(world) if config.increase_db_pool_size?
config.increase_db_pool_size(world)

Copy link
Contributor

@ezr-ondrej ezr-ondrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adamruzicka ! 👍

@ezr-ondrej
Copy link
Contributor

And thanks @ehelms and @ekohl for the discussion drive and @ehelms and @jlsherrill for the reviews ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants