Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StripeClient: CLOSE_WAIT possibly causing servers crash after a few hours #850

Closed
fabiolnm opened this issue Sep 19, 2019 · 9 comments
Closed

Comments

@fabiolnm
Copy link

Software Versions

  • Ruby: 2.6.4
  • stripe-ruby gem: 5.1.1

Original problems

  • Too many open files @ rb_sysopen - /proc/version
  • failed to get urandom

Investigation

sudo lsof is showing hundreds of lines containing:

server_name>api-34-202-153-183.stripe.com:https (CLOSE_WAIT)
sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l
10468

Reproducing CLOSE_WAIT in localhost

rails c

Stripe::Charge.retrieve charge_id, stripe_account: connected_account_id

Wait ~1 minute and run

lsof | grep CLOSE_WAIT

ruby       6216 fabio   17u     IPv4 0x4b0d201b3705f30f        0t0        TCP 192.168.0.12:51558->api-34-200-27-109.stripe.com:https (CLOSE_WAIT)

The process PID 6216 is the rails console:

fabio$ ps -ef | grep ruby
  501  6216  6782   0 10:58am ttys000    0:08.82 /Users/fabio/.rbenv/versions/2.6.4/bin/ruby bin/rails c

lsof keeps listing it until rails console is closed.

Related Google searches

@rebelvc
Copy link

rebelvc commented Sep 19, 2019

To add more context:
The crash is happening on our Sidekiq servers. We use Sidekiq to

  1. Retrieve payment processing fees
  2. Update meta tags of stripe payments.
    Sometimes we process a large amount of these background calls when we run recurring donations.

@ob-stripe
Copy link
Contributor

Thanks for the report. We were able to reproduce the issue and are looking into the best way to fix this. We'll post an update here as soon as possible.

@brandur-stripe
Copy link
Contributor

Sorry about the trouble here @fabiolnm @rebelvc. Could you comment on what your concurrency set up looks like? Are you using multiple processes and/or multiple threads and/or multiple fibers?

We've got likely ideas for where the problem may be, but are trying to narrow it down.

@fabiolnm
Copy link
Author

fabiolnm commented Sep 19, 2019

We usually have one Sidekiq instance with SIDEKIQ_CONCURRENCY set to 20 (threads).

Since yesterday, in order to mitigate the problem, we have 2 Sidekiq instances (different servers) with SIDEKIQ_CONCURRENCY set to 10, each.

@brandur-stripe
Copy link
Contributor

Thanks! Just wanted to let you know that I'm working on a fix for this.

I think my original assumption in the design was correct in the way that a connection left in CLOSE_WAIT is reusable — if I start 10 threads, have them start making requests and waiting more than 60 seconds between them (so that the connection enters CLOSE_WAIT), each subsequent request seems to be correctly reusing the connection that was originally created for the thread. The number of connections is stable no matter how long I run the program.

However, one pattern that I realize now that I'm not handling is if threads are used ephemerally. i.e. If a new thread is spun up for every task, or even if threads are occasionally rotated through as old ones die, connections on no-longer-used threads will not be reclaimed.

I think the best thing to do would is to ensure that there's a cleanup mechanism active in the code that GCs connection managers that haven't seen use in some time. Most of the infrastructure for this change is already in place, so it should be relatively quick to write.

brandur-stripe pushed a commit that referenced this issue Sep 19, 2019
Introduces a system for garbage collecting connection managers in an
attempt to solve #850.

Previously, the number of connection managers (and by extension the
number of connections that they were holding) would stay stable if a
program used a stable number of threads. However, if threads were used
disposably, the number of active connection managers out there could
continue to grow unchecked, and each of those could be holding one or
more dead connections which are no longer open, but still holding a file
descriptor waiting to be unlinked in disposed of by Ruby's GC.

This PR introduces a connection manager garbage collector that runs
periodically whenever a new connection manager is created. Connection
managers get a timestamp to indicate when they were last used, and the
GC runs through each one and prunes any that haven't seen use within a
certain threshold (currently, 300 seconds). This should have the effect
of removing connection managers as they're not needed anymore, and thus
resolving the socket leakage seen in #850.

I had to make a couple implementation tweaks to get this working
correctly. Namely:

* The `StripeClient` class now tracks thread contexts instead of
  connection managers. This is so that when we're disposing of a
  connection manager, we can set `default_connection_manager` on its
  parent thread context to `nil` so that it's not still tracking a
  connection manager that we're trying to get rid of.

* `StripeClient` instances can still be instantiated as before, but no
  longer internalize a reference to their own connection manager,
  instead falling back to the one in the current thread context. The
  rationale is that when trying to dispose of a connection manager, we'd
  also have to dispose of its reference in any outstanding
  `StripeClient` instances that might still be tracking it, and that
  starts to get a little unwieldy. I've left `#connection_manager` in
  place for backwards compatibility, but marked it as deprecated.
@brandur-stripe
Copy link
Contributor

Alright, I believe I have a working patch in #851.

brandur-stripe pushed a commit that referenced this issue Sep 19, 2019
Introduces a system for garbage collecting connection managers in an
attempt to solve #850.

Previously, the number of connection managers (and by extension the
number of connections that they were holding) would stay stable if a
program used a stable number of threads. However, if threads were used
disposably, the number of active connection managers out there could
continue to grow unchecked, and each of those could be holding one or
more dead connections which are no longer open, but still holding a file
descriptor waiting to be unlinked in disposed of by Ruby's GC.

This PR introduces a connection manager garbage collector that runs
periodically whenever a new connection manager is created. Connection
managers get a timestamp to indicate when they were last used, and the
GC runs through each one and prunes any that haven't seen use within a
certain threshold (currently, 300 seconds). This should have the effect
of removing connection managers as they're not needed anymore, and thus
resolving the socket leakage seen in #850.

I had to make a couple implementation tweaks to get this working
correctly. Namely:

* The `StripeClient` class now tracks thread contexts instead of
  connection managers. This is so that when we're disposing of a
  connection manager, we can set `default_connection_manager` on its
  parent thread context to `nil` so that it's not still tracking a
  connection manager that we're trying to get rid of.

* `StripeClient` instances can still be instantiated as before, but no
  longer internalize a reference to their own connection manager,
  instead falling back to the one in the current thread context. The
  rationale is that when trying to dispose of a connection manager, we'd
  also have to dispose of its reference in any outstanding
  `StripeClient` instances that might still be tracking it, and that
  starts to get a little unwieldy. I've left `#connection_manager` in
  place for backwards compatibility, but marked it as deprecated.
brandur-stripe pushed a commit that referenced this issue Sep 20, 2019
Introduces a system for garbage collecting connection managers in an
attempt to solve #850.

Previously, the number of connection managers (and by extension the
number of connections that they were holding) would stay stable if a
program used a stable number of threads. However, if threads were used
disposably, the number of active connection managers out there could
continue to grow unchecked, and each of those could be holding one or
more dead connections which are no longer open, but still holding a file
descriptor waiting to be unlinked in disposed of by Ruby's GC.

This PR introduces a connection manager garbage collector that runs
periodically whenever a new connection manager is created. Connection
managers get a timestamp to indicate when they were last used, and the
GC runs through each one and prunes any that haven't seen use within a
certain threshold (currently, 120 seconds). This should have the effect
of removing connection managers as they're not needed anymore, and thus
resolving the socket leakage seen in #850.

I had to make a couple implementation tweaks to get this working
correctly. Namely:

* The `StripeClient` class now tracks thread contexts instead of
  connection managers. This is so that when we're disposing of a
  connection manager, we can set `default_connection_manager` on its
  parent thread context to `nil` so that it's not still tracking a
  connection manager that we're trying to get rid of.

* `StripeClient` instances can still be instantiated as before, but no
  longer internalize a reference to their own connection manager,
  instead falling back to the one in the current thread context. The
  rationale is that when trying to dispose of a connection manager, we'd
  also have to dispose of its reference in any outstanding
  `StripeClient` instances that might still be tracking it, and that
  starts to get a little unwieldy. I've left `#connection_manager` in
  place for backwards compatibility, but marked it as deprecated.
brandur-stripe pushed a commit that referenced this issue Sep 20, 2019
Introduces a system for garbage collecting connection managers in an
attempt to solve #850.

Previously, the number of connection managers (and by extension the
number of connections that they were holding) would stay stable if a
program used a stable number of threads. However, if threads were used
disposably, the number of active connection managers out there could
continue to grow unchecked, and each of those could be holding one or
more dead connections which are no longer open, but still holding a file
descriptor waiting to be unlinked in disposed of by Ruby's GC.

This PR introduces a connection manager garbage collector that runs
periodically whenever a new connection manager is created. Connection
managers get a timestamp to indicate when they were last used, and the
GC runs through each one and prunes any that haven't seen use within a
certain threshold (currently, 120 seconds). This should have the effect
of removing connection managers as they're not needed anymore, and thus
resolving the socket leakage seen in #850.

I had to make a couple implementation tweaks to get this working
correctly. Namely:

* The `StripeClient` class now tracks thread contexts instead of
  connection managers. This is so that when we're disposing of a
  connection manager, we can set `default_connection_manager` on its
  parent thread context to `nil` so that it's not still tracking a
  connection manager that we're trying to get rid of.

* `StripeClient` instances can still be instantiated as before, but no
  longer internalize a reference to their own connection manager,
  instead falling back to the one in the current thread context. The
  rationale is that when trying to dispose of a connection manager, we'd
  also have to dispose of its reference in any outstanding
  `StripeClient` instances that might still be tracking it, and that
  starts to get a little unwieldy. I've left `#connection_manager` in
  place for backwards compatibility, but marked it as deprecated.
brandur-stripe pushed a commit that referenced this issue Sep 20, 2019
Introduces a system for garbage collecting connection managers in an
attempt to solve #850.

Previously, the number of connection managers (and by extension the
number of connections that they were holding) would stay stable if a
program used a stable number of threads. However, if threads were used
disposably, the number of active connection managers out there could
continue to grow unchecked, and each of those could be holding one or
more dead connections which are no longer open, but still holding a file
descriptor waiting to be unlinked in disposed of by Ruby's GC.

This PR introduces a connection manager garbage collector that runs
periodically whenever a new connection manager is created. Connection
managers get a timestamp to indicate when they were last used, and the
GC runs through each one and prunes any that haven't seen use within a
certain threshold (currently, 120 seconds). This should have the effect
of removing connection managers as they're not needed anymore, and thus
resolving the socket leakage seen in #850.

I had to make a couple implementation tweaks to get this working
correctly. Namely:

* The `StripeClient` class now tracks thread contexts instead of
  connection managers. This is so that when we're disposing of a
  connection manager, we can set `default_connection_manager` on its
  parent thread context to `nil` so that it's not still tracking a
  connection manager that we're trying to get rid of.

* `StripeClient` instances can still be instantiated as before, but no
  longer internalize a reference to their own connection manager,
  instead falling back to the one in the current thread context. The
  rationale is that when trying to dispose of a connection manager, we'd
  also have to dispose of its reference in any outstanding
  `StripeClient` instances that might still be tracking it, and that
  starts to get a little unwieldy. I've left `#connection_manager` in
  place for backwards compatibility, but marked it as deprecated.
@brandur-stripe
Copy link
Contributor

@fabiolnm @rebelvc I just released what I think is a fix in 5.2.0.

I wrote a program that spins up and then churns out many threads over an extended period of time, and was able to verify that the number of connection managers (and by extension, connections) being managed by StripeClient stayed stable.

Based on a review of the code, I think this is likely to address the problem you're seeing, but I'm not absolutely sure, so we could use your help verifying it. Would you mind updating to the new version and let us know what you're seeing?

@rebelvc
Copy link

rebelvc commented Sep 20, 2019

@brandur-stripe thank you so much for this fix. We have deployed it on production. Before the fix, we have increased the number of sidekiq server so each one has a less chance of overloading.

Now i ran sudo lsof | grep CLOSE_WAIT | grep stripe | wc -l
the result is 203 instead of in the thousands previously.

@brandur-stripe
Copy link
Contributor

@rebelvc Excellent, thanks for checking! That number you're seeing should also stay relatively stable over time (maybe not exactly stable, but pretty level), so let me know if you're seeing something else.

One other thing I'll note for clarity: just seeing CLOSE_WAIT isn't necessarily a bad thing — that means we have a closed connection because it's been idle for too long, but it can still be potentially reused if the owning thread wakes up and issues a new request. Even if each thread is making requests to api.stripe.com fairly actively, it may still be holding a connection to files.stripe.com or connect.stripe.com that's used much less frequently. So some of those 203 CLOSE_WAIT you're seeing may still be on threads that can still do work over those connections if the need arises.

Going to close this out for now, but thanks for reporting and the quick feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants