-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
makara connection pool management is not thread-safe #151
Comments
Thanks for taking the time to write this up @aks I don't suppose you want to give it a shot? 😄 |
Well, maybe. I've already forked the makara repo, and have added a hijack on the However, that change was insufficient. There are still AR timeouts occurring in our sidekiq processes, which are thread-full, and the stacktraces on those timeouts do not include makara code anywhere. Whether or not the timeouts and lack-of-thread-awareness in makara are related is still TBD. So, there appear to be two issues:
I've reviewed the ActiveRecord 4.2.6 and 5 code connection adapters and connection handling, to get an idea of how hard it would be to update/rewrite makara to become an alternative connection handler for AR. In the current AR 4.2.6+ or 5 code, each model has a connection handler, which is supposed to determine the appropriate connection using a connection spec. Once a connection is selected, it is cached. In order to support dynamic query balancing, each model would have to use a dynamic connection handler, which would choose a connection on each query, instead of using a static, cached connection. This appears to require a change to AR itself. Would you happen to be aware anyone seriously looking into these AR issues? Are you or anyone on your team doing so? The alternative to using an AR "insert" like makara is a proxy service, like Since we are already using As you might guess, right now I'm looking for the Path of Least Effort, much as the cat in Heinlein's book was always looking for the "Door Into Summer". I'm really hopeful that you might have a good answer to help us find that door. |
I kind of like how switch_point does it https://github.com/eagletmt/switch_point/blob/master/lib/switch_point/proxy.rb#L22-L33 by making an active record subclass just for getting access to a connection_pool, similar to https://github.com/customink/secondbase/blob/master/lib/second_base/base.rb or https://github.com/instructure/shackles/blob/master/lib/shackles/connection_handler.rb kind of extends the connection handler |
The |
@aks how have you guys mitigated this? We're evaluating makara with https://github.com/ankane/distribute_reads + sidekiq, and in our staging environment seeing that no Postgres data is being collected. My hunch is that this is related to makara's lack of thread safety. This seems like a deal breaker for anyone using makara, right? EDIT: I think this is just a reporting issue, what used to show up as |
@aks This prevented us from adopting Makara. Is there any movement on this? It would be great to see this fixed! |
@jeremy with your knowledge of the AR internals, any thoughts on this? |
@aks @jwg2s @bleonard We are evaluating using makara for read-write splitting and came across this thread. Do you have updates/suggestions/mitigation options for this issue? I don't see other well-maintained gems for real-write splitting in production, anyone has other suggestions that worked well in production environment? FYI We use AWS Aurora MySQL as our backing store and our application is hosted on Heroku. |
Check out the fresh_connection gem, or the pg-pool-II proxy.
…________________________________
From: Rajagopal <notifications@github.com>
Sent: Wednesday, August 1, 2018 5:36 PM
To: taskrabbit/makara
Cc: Alan Stebbens; Mention
Subject: Re: [taskrabbit/makara] makara connection pool management is not thread-safe (#151)
@aks<https://github.com/aks> @jwg2s<https://github.com/jwg2s> @bleonard<https://github.com/bleonard> We are evaluating using makara for read-write splitting and came across this thread. Do you have updates/suggestions/mitigation options for this issue?
I don't see other well-maintained gems for real-write splitting in production, anyone has other suggestions that worked well in production environment?
FYI We use AWS Aurora MySQL as our backing store and our application is hosted on Heroku.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#151 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAAPlqgGkzAjoR5m8KSvKljms4mfxvMeks5uMkmAgaJpZM4MdAFx>.
|
@rajagopals recently went through the same thing, needed to migrate off of pgpool/pgbouncer and the thread-safety issue ultimately prevented us from adopting Makara. We wound up using active_record_slave on top of an Aurora Postgres cluster. No issues yet in a multi-threaded production environment, handling ~40k queries per minute. YMMV |
Hey @aks, can you explain more about how the issue manifests itself, and how to reproduce it? |
Any updates here? |
@patrykk21 Are you investigating it? Please do share your findings! |
@jeremy I tried replicating the issue with both gems on our project locally without success. I'm generating a new clean Rails 5.1 app to have more control over the environment and trying to break it. Do you have any insight or any specific tests that would break it? |
@patrykk21 How'd you try to replicate it? |
@jeremy Well, using our app with multiple simultaneous requests through sidekiq and some manual tests through threads, expecting to have some timeout errors but it actually never occurred. |
@patrykk21 Hmm, that would be a matter of getting lucky to see an issue. We'd need to examine the code to see how to get concurrent threads into a state that causes specific problems. Considering you're using Sidekiq and there are other libraries that do suggest they're thread-safe (possibly by leaning more heavily on Active Record's connection handlers), I'd go with an alternative for now. |
Has anyone had any luck in reproducing the issues seen? We are also running with a similar configuration (puma / sidekiq). It would be nice to use this, but this issue concerns us. I looked at the alternatives and they do not seem to have a feature I like (failing over to master on high replica lag). If anyone has reproduced it we'd be willing to help resolve the issue. |
Consider that makara is using Thread.current, and that no Sidekiq middleware is in place to inject the correct context, this may be a thread reuse problem. See the RequestStore project (and the corresponding sidekiq solution). |
We were not seeing it, just wary of moving. I plan on setting up some tests to validate that your change fixes it. This would be great if it does because it means we can move forward :). |
Reading through @aks description of the problem again, I believe my fix doesn't address this. It seems the root that issue is not related to thread reuse, but rather the pooling mechanism itself. My fix is solving a different problem. |
Oops, yup you are correct. Well, I guess I'll take a stab at reproducing the pooling issue. |
Is the basic write-up of the problem accurate? This issue generates concern about Makara thread-safety but hasn't reach any clear answer in 2 years.
I believe Makara's @slave and @master pools are private to a single Makara Adapter; aka Active Record connection. This comes from inheriting from Makara::Proxy. As long as Active Record is limiting the connection to a single thread at a time - the Makara data structures inside that connection (including the actual underlying connections provided by real connection adapters) are never access concurrently. Have I misunderstood? Does Makara actually share underlying pools across different MakaraAdapters and is therefore subject to multi-threading concerns for it's internal data structures? |
This is the first part of the issue that @aks is describing. ActiveRecord 3 does not limit access or changes to the pool - or any individual connection in the pool - to a single thread. |
Active Record (versions 4, 5, 6) does limit access to a connection to one thread as long as ActiveRecord connection access patterns are honored by applications (that's critical for safely using most database connection implementations). @aks mentioned this in his initial report. See here for the model - note this pool has nothing to do with the term "pool" used inside Makara (which is an unfortunate naming collision): Given single-thread access to a MakaraAdapter at a time; I don't see any concern that it's internal data structures will be access concurrently. Can you elaborate on where you think concurrent use is possible assuming application threads that correctly follow ActiveRecord access patterns? |
@jeffdoering Firstly, I didn't realize AR 3 was off the table for this discussion :) My use case happens to be with a legacy app that is still on AR 3. Secondly, the problem is with the Makara pool itself (as you noted, this is distinct from AR connection pool). Unlike AR 4+, the Makara pool is managed as a singleton Array that is not thread safe. Individual connections provided by the AR pool (in a thread-safe manner), are then cached and held in the singleton Makara pool in an un-thread-safe manner. From an architectural standpoint, Makara doesn't inject itself downstream of the AR pool, rather it sits side-by-side and delegates calls to it as it sees fit. |
If this all applies equally to AR 3; great. @aks suggested in the original post that AR 3 itself might not be thread-safe in which case Makara thread-safety would be a side-note. In terms of Makara architecture:
As long as Active Record and the consuming application use the ActiveRecord::ConnectionPool correctly; only one thread should interact with an instance of MakaraAbstractAdapter at a time and the internal data structures of the MakaraAbstractAdapter (including it's Makara::Pool objects) do not need to be thread-safe. However; I wouldn't be following this issue or the Makara code in such detail if everything worked fine. We are seeing some problematic behavior with Makara that I believe is related to its interaction with the ActiveRecord::ConnectionPool. It's not specifically a thread-safety issue but rather a problem where Makara "corrupts" the state of the ActiveRecord::ConnectionPool such that threads cannot safely acquire connections from the pool. I'll file a detailed issue shortly. |
@jeffdoering -- Have you filed a detailed issue yet? We're seeing issues with our ConnectionPool becoming corrupted during database failovers in sidekiq only with makara and we are unable to recover without restarting the servers.
I saw you had a patch into rails core... Does it resolve the issue? |
I had not detailed the issue (oops); but now I have: Unfortunately my patch to rails core was a much narrower issue that I came across as a side-effect of this deep-dive. I have to eat crow some on the thread above. Makara's Proxy does have a thread-safety issue but the primary issue with failing to fully control real connector interactions with the ActiveRecord ConnectionPool can break things even without threading concerns. See the new issue for full details. In terms of this issue; the thread safety should be fixed but without a comprehensive solution to the encapsulation problem the pattern in general is unsafe. It's pretty easy to call legal methods on connections and connection pools and break the pool when Makara is used (by throwing errors and/or leaking underlying real connections into the pool). |
@jeremy @aks @bleonard @sax I want to chime in here and say that I am very surprised by this entire thread. When I was at Wanelo, we ended up adopting Makara specifically because it was the only read/write splitting gem that appeared thread safe. I am CC-ing @sax here so that if I forget any issues we had — Eric, please chime in. But I want to testify that I have successfully used Makara (with PostgreSQL adapter) on two very high traffic Rails applications that were both running sidekiq and puma (in a multi-threaded configuration). • One site used 5 threads in Puma, 20 in Sidekiq, and reached 200K RPMs traffic to Rails app TL;DRNeither of the high-traffic sites I mentioned had any issues with multi-threading, and both sites very actively used Makara's automatic recovery and blacklisting and connection pooling features. Just my 2c. |
Why do you think that When you use multiple threads you should create a much larger pool than the default. Rails defaults to pool size of 10 and I believe assumes a single thread. So create a pool of 30 if you use three threads and see how things go. If you do this, beware, especially with PostgreSQL,. about too many connections hitting your server. The solution here is Once again, I suspect Makara has nothing to do with this error. |
I don't think it's thread safe with MySQL, we are seeing an issue where the current_user is changing mid session. Hasn't happened since we removed makara. |
The connection pool management is not thread-safe. So, when running makara in thread-intensive code, such as sidekiq, subtle errors creep in.
The fix is not entirely straightforward: AR 4 and 5 has reorganized the connection pool management to be thread-safe, with lots of mutex usage. Makara has none of that, and it's connection pool management is not thread-safe.
In fact, makara uses an array of connections for each pool, which is traversed each time a connection decision is being made. Connections are added up to the maximum connection, but there are no protections against simultaneous access and changes across threads.
In addition, in the master and replica (slave) connection pools, there are connections with different states in each array: blacklisted or not.
We could protect the connection pool array with a semaphore, but traversing an array behind a semaphore is a Bad Idea because it blocks all other threads from traversing that same array. If the semaphore supported read-locking in addition to write-locking then multi-thread traversals would work in parallel, except for adding or removing connections.
However, IMHO, it would be better to have multiple arrays of connection pools, with each kind of connection being queued into a separate pool (array) of connections. So, blacklisted connections would be maintained in a blacklisted connection pool, and the remaining set of available connections would be in the available connection pool. Then, each pool could be managed as a thread-safe
queue
object.Makara should be updated to make best use of the latest AR connection pool code, becoming thread-safe in the process.
The text was updated successfully, but these errors were encountered: