Make sure sidekiq workers increase their db pool size #362

adamruzicka · 2020-08-04T12:02:11Z

Historically dynflow distinguished between executors (those who perform work)
and clients (those who produce work). Executors increased their db pool size so
that every thread of the executor could get its own connection. Clients didn't
do this, because they didn't really spawn their own thread so whatever
configuration was set was considered correct.

When we introduced Sidekiq we ended up in a slightly odd situation where only
orchestrators were considered executors and other Sidekiq workers were treated
as clients. This also meant, the clients didn't increase their db pool size and
if pressed heavily enough, they could deplete the pool. This change makes other
non-orchestrator workers increase size of their db pools.

To verify pool size is increased, have an action print ActiveRecord::Base.connection_pool.size from its #run. Alternatively with this patch[1] I was drain all the available connections from the pool.

[1] - 0a0e566

jlsherrill

I wasn't able to fully reproduce the problem, but the logic makes sense to me

jlsherrill · 2020-08-07T14:26:27Z

lib/dynflow/rails/configuration.rb

@@ -96,7 +96,11 @@ def rake_task_with_executor?
      end

      def increase_db_pool_size?
-        !::Rails.env.test? && !remote?
+        !::Rails.env.test? && !remote? || sidekiq_worker?


Should we add parens to make the order clearer? like i assume its supposed to be:

!::Rails.env.test? && (!remote? || sidekiq_worker?)

Ah, good point, done

Historically dynflow distinguished between executors (those who perform work) and clients (those who produce work). Executors increased their db pool size so that every thread of the executor could get its own connection. Clients didn't do this, because they didn't really spawn their own thread so whatever configuration was set was considered correct. When we introduced Sidekiq we ended up in a slightly odd situation where only orchestrators were considered executors and other Sidekiq workers were treated as clients. This also meant, the clients didn't increase their db pool size and if pressed heavily enough, they could deplete the pool. This change makes other non-orchestrator workers increase size of their db pools.

ehelms · 2020-08-18T18:13:40Z

I think I brought this up in IRC and not here, but some consideration thinking in this design.

For Rails based projects, this can have potentially unintended consequences if not accounted for with respect to tuning. If we take Foreman as the use case. We deploy the main Foreman instance as a Puma Rails application with 2 workers by default. Each worker gets a full set of DB pools based on the database.yml configuration and is driven by the number of threads per worker. So, out of the box Foreman itself needs 2*16 (32) PostgreSQL connections. I believe we set this to 5 currently (which implies we can overcommit there as well).

For the baseline, Dynflow + Sidekiq deployment we have an executor, 2 workers and a hosts queue worker each with a DB pool size of 5. That implies 4*5 (20) needed connections. If we add a 5 connection buffer for Dynflow that goes up to 4*(5 + 5) (40) connections we need to be accounting for. Out of the box, PostgreSQL is tuned for 100 open connections. For Katello, which adds Candlepin and Pulp 3, more connections come about.

All that is to say, if having a buffer per worker is needed to account for things like threads inside a Dynflow worker being created bay application code, or additional Database reads happening outside the main transaction I would request that this be a value that is based off configuration rather than a hidden value. This would allow both tuning of the buffer and automated calculation of PostgreSQL connection size and DB pool size in configuration code rather than having to know to add 5 in all the right places.

Thoughts?

@ekohl @evgeni you may find this change of interest

adamruzicka · 2020-08-18T19:53:51Z

For the baseline, Dynflow + Sidekiq deployment we have an executor, 2 workers and a hosts queue worker

Don't we deploy "just" orchestrator, worker and hosts queue worker?

All that is to say, if having a buffer per worker is needed to account for things like threads inside a Dynflow worker being created bay application code, or additional Database reads happening outside the main transaction

If I'm reading the docs right, Sequel uses a pool of 4 connections by default. That means we should need up to $concurrency + 4 connections in total, just to shed some light onto where the additional connections come from.

Eh, this is bogus, Sequel's connections should be outside of the AR connection pool in question here so it doesn't really explain where those +5 AR connections come from, it is even possible that we only need those additional connections in the orchestrator instance, but not in the workers. Anyway it would still be good to bear in mind that Sequel has its own connection pool with some additional connections.

I would request that this be a value that is based off configuration rather than a hidden value

If we know how many additional connections we need and we know that by tuning it up we won't gain anything and that by tuning it down we would cause issues, why have it configurable?

adamruzicka · 2020-08-19T12:03:36Z

As an interesting side-effect of this, Sequel's connection pool gets increased to $concurrency + 5, which arguably is the right thing to do, but leaves us with grand total of ($concurrency + 5) * 2 connections per worker.

ekohl · 2020-08-19T12:23:28Z

Eh, this is bogus, Sequel's connections should be outside of the AR connection pool in question here so it doesn't really explain where those +5 AR connections come from, it is even possible that we only need those additional connections in the orchestrator instance, but not in the workers. Anyway it would still be good to bear in mind that Sequel has its own connection pool with some additional connections.

I suspect that loading Rails code somewhere creates AR connections. You can try to set Foreman's concurrency to 4 instead of 5 to see if it becomes +4 AR connections.

adamruzicka · 2020-08-19T12:33:48Z

I suspect that loading Rails code somewhere creates AR connections.

This is possible, but probably not really relevant to us. When the dynflow world is ready, there doesn't seem to be any active AR connection.

You can try to set Foreman's concurrency to 4 instead of 5 to see if it becomes +4 AR connections.

The +5 is a constant given by Dynflow, so fiddling with settings elsewhere won't change it. I suspect it was just a buffer for when active record was used outside of the worker threads in the old model (for example for delayed plan deserialization or running execution plan hooks). If we had a simple reproducer then we could just try lowering it until it would break. Sadly the only reproducer I'm aware of has too many moving parts (syncing >=20 repos at once with katello)

ehelms · 2020-08-19T20:47:03Z

As an interesting side-effect of this, Sequel's connection pool gets increased to $concurrency + 5, which arguably is the right thing to do, but leaves us with grand total of ($concurrency + 5) * 2 connections per worker.

So we end up with 20 connections per worker? Ouch.

And to educate me, Sequel is the library underlying Dynflow's database connections and lives outside of Rails database context? Would you be able to summarize the breakdown of connections that get created via Sequel and those via Rails? (i.e. orchestrator connections are via X, worker connections via Y, tasks spawn Sequel connections or Rails connections?)

adamruzicka · 2020-08-20T07:43:10Z

Sequel is the library underlying Dynflow's database connections and lives outside of Rails database context?

Correct. When running with rails we piggyback on the db config passed to rails and set up sequel based on that.

Would you be able to summarize the breakdown of connections that get created via Sequel and those via Rails?

Dynflow on its own uses only Sequel. No matter if the process is an orchestrator or a worker, it will have at least one Sequel connection to register itself as a world in dynflow_coordinator_records table. In general any application code (anything defined outside dynflow/dynflow) uses AR, or to be more precise, doesn't share the Sequel connections we already have.

Worker: Every time a step is executed in the worker, we try to flip its state in the db, so we need a connection. I don't think there's anything else related to Sequel going on in the worker, so we're at $concurrency + 1. At the same time, workers are the part which runs all the user code, so here we need at least $concurrency active record connections.

Orchestrator: Dynflow uses the actor model from concurrent-ruby heavily, if I'm reading things right the actors are backed by a cached thread pool, which is by default unbounded. The orchestrator should currently have 5 actors which can access the db. In addition to this the orchestrator can also run user code (execution plan hooks) and it updates Tasks, so it should need 2 AR connections

Tasks: This uses both as it serves as a bridge between dynflow and the rest, so up to 1 + 1 per thread.

ehelms · 2020-08-20T12:30:39Z

Worker: Every time a step is executed in the worker, we try to flip its state in the db, so we need a connection. I don't think there's anything else related to Sequel going on in the worker, so we're at $concurrency + 1. At the same time, workers are the part which runs all the user code, so here we need at least $concurrency active record connections.

The state flipping that happens by the worker, is that happening at the overall worker level or per task running on the worker? (I am assuming that given you said $concurrency + 1 its at the worker level, but wanted to double check). For example, if 5 tasks are running at the same time on a single worker, is there still only a single update happening at a time?

Orchestrator: Dynflow uses the actor model from concurrent-ruby heavily, if I'm reading things right the actors are backed by a cached thread pool, which is by default unbounded. The orchestrator should currently have 5 actors which can access the db. In addition to this the orchestrator can also run user code (execution plan hooks) and it updates Tasks, so it should need 2 AR connections

Does that mean that each of the 5 actors has an unbounded thread pool and each thread could connect to AR?
Does each actor get it's own database configuration or is it shared?
Does running user code (execution plan hooks) and updating tasks happens outside of the actors? (hence the 2 AR connections)

Tasks: This uses both as it serves as a bridge between dynflow and the rest, so up to 1 + 1 per thread.

Per thread here, is that threads within each unit of concurrency or the concurrency level itself? And is this per worker? Or part of the orchestrator? Or part of the Rails stack?

adamruzicka · 2020-08-20T12:57:46Z

It feels like we are starting to mix names here, since all the projects involved use the same names for different things so I'll start using more specific terms.

The state flipping that happens by the worker, is that happening at the overall worker level or per task running on the worker?

State flipping happens on a worker thread level (concurrency settings for sidekiq == number of worker threads). So if 5 steps (the atomic unit of work in dynflow) are running in a single worker process which has 5 worker threads, then states of all 5 steps could be flipped in parallel, requiring 5 sequel connections. At the same time, each of those steps could touch AR stuff, requiring 5 AR connections in total. Now the worker process might be doing some dynflow stuff outside of the worker threads, requiring at least 1 additional sequel connection, giving us the grand total of 5 + 5 + 1 connections.

Does that mean that each of the 5 actors has an unbounded thread pool and each thread could connect to AR?

There is one shared, but unbounded, thread pool in the worker process, the actors run in threads borrowed from that pool. However because each actor can run only once and there's a set number of actors, there would be at most 5 AR connections.

Does each actor get it's own database configuration or is it shared?

It is shared among all actors of a single process.

Does running user code (execution plan hooks) and updating tasks happens outside of the actors? (hence the 2 AR connections)

I think it still happens inside actors, but it goes over AR. To be honest I'm not fully sure if this would be 1 AR connection or more, these things kinda happen all over the place and it is hard to track what's happening where

ekohl · 2020-08-20T13:37:51Z

Just thinking out loud: can there be a situation where the connection isn't released back to the pool? Is there instrumentation to see these pool sizes?

adamruzicka · 2020-08-20T13:42:20Z

can there be a situation where the connection isn't released back to the pool?

We've had those in the past. There shouldn't be any such situation, but we can't rule it out with 100% certainty.

Is there instrumentation to see these pool sizes?

It depends how you define instrumentation. If poking at internal state of objects from someone else's library counts then yes

ekohl · 2020-08-20T13:45:30Z

Ideally you'd be able to graph it over time. In my experience that's often very useful.

adamruzicka · 2020-08-20T14:20:58Z

Ideally you'd be able to graph it over time. In my experience that's often very useful.

That would be nice to have, but also would require code we currently don't have.

To get this back on track, we know that we need at least $concurrency + c connections for AR and $concurrency + c for Sequel, where the additional constant c is arguable. Currently, we are using whatever is set in /etc/foreman/database.yml, which is 5 by default. Even if the connections were being checked in and out correctly, this wouldn't be enough. Could we get this in, so we have $concurrency + 5 connections available for each of the db libraries and try to figure out if connections are being checked in and out correctly and if we could lower c in parallel? Since the connections are not actually opened until they are needed, it feels that in this case overcommiting would be safer than undercommiting

ehelms · 2020-08-20T17:04:47Z

I appreciate your patience on this @adamruzicka . One thing I was aiming to do was to try to ensure we ended up with the right math for tuning an installation from a db pool size and available postgresql connections point of view.

With this change, and current code, is the following math true?

Max # of postgresql connections = (workers.reduce { |worker| worker['concurrency'] * 2 +1 }) + (orchestrator['concurrency'] + 1)

For a standard Foreman installation:

13 = (5*2 + 1) + (1 + 1)

For a standard Foreman + Katello:

24 = ((5*2 + 1) * 2) + (1 + 1)

ehelms

Thanks for detailed explanation. I would just like to ensure the math of connections but am good with this change.

adamruzicka · 2020-08-24T14:22:44Z

@ehelms the orchestrator is a bit special, we set concurrency to 1 for it. If it was higher, it could have undesirable side effects. At the same time, there are more actors running inside it than inside the regular workers and probably just a single connection to the db should be used outside of any actor, so the formula is a bit different. I'd say something along the lines of $number_of_actors + $number_of_actors_doing_activerecord_stuff + c, with numbers it would give us 5 + 2 + 1.

ezr-ondrej

One nit. Apart of it 👍

ezr-ondrej · 2020-08-27T23:07:42Z

lib/dynflow/rails.rb

@@ -38,8 +38,8 @@ def initialize!
      init_world.tap do |world|
        @world = world
        config.run_on_init_hooks(false, world)
+        config.increase_db_pool_size(world) if config.increase_db_pool_size?


This is checked inside of the method already.

Suggested change

config.increase_db_pool_size(world) if config.increase_db_pool_size?

config.increase_db_pool_size(world)

ezr-ondrej

Thanks @adamruzicka ! 👍

ezr-ondrej · 2020-08-28T08:13:17Z

And thanks @ehelms and @ekohl for the discussion drive and @ehelms and @jlsherrill for the reviews ❤️

jlsherrill approved these changes Aug 7, 2020

View reviewed changes

jlsherrill reviewed Aug 7, 2020

View reviewed changes

adamruzicka force-pushed the pool-size branch from 63e0db8 to d13d003 Compare August 10, 2020 11:42

jlsherrill approved these changes Aug 18, 2020

View reviewed changes

ehelms approved these changes Aug 20, 2020

View reviewed changes

ezr-ondrej approved these changes Aug 27, 2020

View reviewed changes

Drop duplicate check

c7292c4

ezr-ondrej approved these changes Aug 28, 2020

View reviewed changes

ezr-ondrej merged commit fe2b31a into Dynflow:master Aug 28, 2020

adamruzicka mentioned this pull request Sep 10, 2020

Fixes #30789 - Set DB pool size dynamically theforeman/puppet-foreman#882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure sidekiq workers increase their db pool size #362

Make sure sidekiq workers increase their db pool size #362

adamruzicka commented Aug 4, 2020 •

edited

Loading

jlsherrill left a comment

jlsherrill Aug 7, 2020

adamruzicka Aug 10, 2020 •

edited

Loading

ehelms commented Aug 18, 2020

adamruzicka commented Aug 18, 2020 •

edited

Loading

adamruzicka commented Aug 19, 2020

ekohl commented Aug 19, 2020

adamruzicka commented Aug 19, 2020

ehelms commented Aug 19, 2020

adamruzicka commented Aug 20, 2020

ehelms commented Aug 20, 2020

adamruzicka commented Aug 20, 2020

ekohl commented Aug 20, 2020

adamruzicka commented Aug 20, 2020

ekohl commented Aug 20, 2020

adamruzicka commented Aug 20, 2020

ehelms commented Aug 20, 2020

ehelms left a comment

adamruzicka commented Aug 24, 2020

ezr-ondrej left a comment

ezr-ondrej Aug 27, 2020

ezr-ondrej left a comment

ezr-ondrej commented Aug 28, 2020

	config.increase_db_pool_size(world) if config.increase_db_pool_size?
	config.increase_db_pool_size(world)

Make sure sidekiq workers increase their db pool size #362

Make sure sidekiq workers increase their db pool size #362

Conversation

adamruzicka commented Aug 4, 2020 • edited Loading

jlsherrill left a comment

Choose a reason for hiding this comment

jlsherrill Aug 7, 2020

Choose a reason for hiding this comment

adamruzicka Aug 10, 2020 • edited Loading

Choose a reason for hiding this comment

ehelms commented Aug 18, 2020

adamruzicka commented Aug 18, 2020 • edited Loading

adamruzicka commented Aug 19, 2020

ekohl commented Aug 19, 2020

adamruzicka commented Aug 19, 2020

ehelms commented Aug 19, 2020

adamruzicka commented Aug 20, 2020

ehelms commented Aug 20, 2020

adamruzicka commented Aug 20, 2020

ekohl commented Aug 20, 2020

adamruzicka commented Aug 20, 2020

ekohl commented Aug 20, 2020

adamruzicka commented Aug 20, 2020

ehelms commented Aug 20, 2020

ehelms left a comment

Choose a reason for hiding this comment

adamruzicka commented Aug 24, 2020

ezr-ondrej left a comment

Choose a reason for hiding this comment

ezr-ondrej Aug 27, 2020

Choose a reason for hiding this comment

ezr-ondrej left a comment

Choose a reason for hiding this comment

ezr-ondrej commented Aug 28, 2020

adamruzicka commented Aug 4, 2020 •

edited

Loading

adamruzicka Aug 10, 2020 •

edited

Loading

adamruzicka commented Aug 18, 2020 •

edited

Loading