Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pre-fork servers #9

Closed
atombender opened this issue Feb 8, 2015 · 84 comments
Closed

Support pre-fork servers #9

atombender opened this issue Feb 8, 2015 · 84 comments

Comments

@atombender
Copy link

If you use this gem with a multi-process Rack server such as Unicorn, surely each worker will be returning just a percentage of the correct results (eg., number of requests served, total time), thus making the exposed metrics fairly meaningless?

To solve this the easiest solution is to create a block of shared memory in the master process that all workers share, instead of using instance variables.

@brian-brazil
Copy link

I believe this problem came up with some users of the (experimental) Python client, and some proposed work on the HAProxy exporter.

What you really want to do is separately scrape each worker, and then aggregate it up in the Prometheus server. This gives the most useful/correct results, and also gives you per-process statistics for things like memory and cpu. I'm not familiar with Unicorn or Rack, do they offer anything that'd allow for this?

Shared memory is an option, it has disadvantages though in that it'd be a source of contention and you can't do per-process analysis.

@juliusv
Copy link
Member

juliusv commented Feb 8, 2015

From the Unicorn Design doc:

"Unicorn is a traditional UNIX prefork web server. [...] Load balancing between worker processes is done by the OS kernel. All workers share a common set of listener sockets and does non-blocking accept() on them. The kernel will decide which worker process to give a socket to and workers will sleep if there is nothing to accept()."

Given that, I'm not sure how doable it is to scrape individual workers. Even if it was possible, the benefit of having individual insight and avoiding contention has to be weighed against potentially having an order of magnitude more time series.

I'd expect that on average, the workers on a single machine will behave similarly, except admittedly in pathological situations like this one pointed out in the Unicorn docs: "When your application goes awry, a BOFH can just "kill -9" the runaway worker process without worrying about tearing all clients down, just one".

It also states that "Unicorn deployments are typically only 2-4 processes per-core", which would mean multiplying the number of exposed time series on a typical large machine with e.g. 32 cores by 64x to 128x.

@grobie Did you have any plans wrt. supporting preforked servers already?

@atombender
Copy link
Author

@brian-brazil: It's possible to collect per-process information with shared memory: Just organize the memory by PID. You'll have to reap dead PIDs, though. If you're clever you can avoid locking altogether, by having each process write to its own dedicated segment of the shared memory.

Scraping individual workers, as @juliusv points out, would mostly be useful to the extent that it would capture misbehaving workers that consume too much memory or CPU.

Having run Unicorn in production for years, I would agree that this doesn't happen often enough to justify the overhead of monitoring individual worker. Unicorn's master process will "kill -9" processes that don't respond within a per-process request timeout. In my experience, this tends to cover the pathological cases.

In terms of scaling, One of our clusters has 300 Unicorn workers on each box (anything from 4-25 per app). I wouldn't want to have Prometheus poll that many workers every few seconds.

@juliusv
Copy link
Member

juliusv commented Feb 8, 2015

👍 Sounds like we need a shared-memory solution or similar.

@brian-brazil
Copy link

It's possible to collect per-process information with shared memory: Just organize the memory by PID. You'll have to reap dead PIDs, though. If you're clever you can avoid locking altogether, by having each process write to its own dedicated segment of the shared memory.

You need to be a little careful here, as you can't simply throw away data from old PIDs. You'd need to combine in the counters to ensure they stay monotonically increasing. Organizing on >128 byte boundaries (size of a cache line) should avoid locking issues during collection.

How would we deal with gauges? Both the set and inc/dec styles are problematic. The set style doesn't make sense when there's multiple processes (you really need it per-process), and with the inc/dec pattern to count in progress requests you probably want to reset to 0 for a pid when that pid dies.

I wouldn't want to have Prometheus poll that many workers every few seconds.

60s is a more than sufficient monitoring interval for most purposes. Benchmarking will determine the actual impact.

This seems like it'll get very complicated to implement and use, as most uses of instrumentation would need to take the process model into account to get useful and correct results out the other end.
How do other instrumentation systems solve this problem?

@atombender
Copy link
Author

You need to be a little careful here, as you can't simply throw away data from old PIDs.

How would this differ from the hypothetical non-shared case where each worker exposes a metrics port? If the worker dies and starts again, all its counters will restart at zero. In a hypothetical shared-memory implementation where each worker has a dedicated segment that is reported separately, it would be the same thing, no?

Anyway, the problem is moot because I just realized that Unicorn uniquely identifies workers by a sequential number, which is stable. So if worker 3 dies, the next worker spawned is also given the number 3. That means no reaping is needed, nor is it necessary to any kind of PID mapping, which I was worried about.

I did a little digging into the Unicorn sources and just realized that Unicorn uses a gem called raindrops to accomplish exactly what I have been describing so far. Raindrops uses shared memory to collect and aggregate worker metrics (Unicorn also seems to use Raindrops directly to detect idle workers), but it looks like it only cares about socket metrics, not anything related to HTTP. Worst case, if it's cannot be used as-is, I could look into extracting the relevant code from Raindrops into this gem.

@brian-brazil
Copy link

How would this differ from the hypothetical non-shared case where each worker exposes a metrics port?

Where the data comes from each worker, there's no persistence of data beyond a worker's lifetime and the rate() function works as expected. With shared memory where the client is smart and doing aggregation, you need to not lose increments from dead workers as counters need to be kept monotonic.

Prometheus has the paradigm that instrumentation is dumb, and all the logic is done in the prometheus server. You need to be careful when adding logic to clients that it maintains the same semantics as a dumb client.

So if worker 3 dies, the next worker spawned is also given the number 3.

That makes things easier alright. And as I presume everything is single-threaded, we don't need locking for metric updates.

Thinking on this over the past bit, I think we can make this work but only for Counter and Counter-like metrics (Summary without percentile and Histogram).

@juliusv
Copy link
Member

juliusv commented Feb 8, 2015

@brian-brazil Gauges are indeed a problem. I see four possibilities in principle for them:

  1. Allow the user to configure a merging strategy (averaging, summing, ...) for each gauge. The actual merging work wouldn't need to happen continuously, but only upon scrape.
  2. Make an exception for gauges and distinguish them by a "worker" label.
  3. Use "last write wins". But that's not useful at all for any common gauge use cases (queue lengths, etc.).
  4. Don't support gauges in this kind of environment, expecting that the bulk of relevant metrics for this use case would be counters and summaries anyways.

1 is probably what we'd want if we don't split metrics by worker otherwise.

@atombender If indeed one Unicorn server is always expected to have the same number of stably identified worker processes (at least between complete Unicorn restarts, which would also wipe the shared memory), your approach sounds sensible to me. If I understand the raindrops approach correctly, it uses a single counter for all workers ("counters are shared across all forked children and lock-free"), so there wouldn't even be any counter resets anywhere in that approach if only a single worker is killed?

@juliusv
Copy link
Member

juliusv commented Feb 8, 2015

(if you're only reading by mail - I edited my last comment because Markdown made a 5 out of my 1...)

@atombender
Copy link
Author

Note that I am not suggesting that the client should aggregate anything. I am suggesting that each worker has a slot in the shared memory segment that it is responsible for updating. When a worker gets a /metrics request, it simply returns the contents of all slots, separately labeled by worker ID.

Shouldn't that completely work around the gauge problem? I was initially thinking that the client would aggregate workers, because, as I said earlier, I think per-worker metrics aren't very useful. But if it poses a problem for gauges, I think it's better to "over-report" and do aggregation on the server.

(Pardon my obtuseness, but I'm not too familiar with how Prometheus works yet.)

@brian-brazil
Copy link

Don't support gauges in this kind of environment, expecting that the bulk of relevant metrics for this use case would be counters and summaries anyways.

This is my expectation. It sounds like entirely an online-serving use case, with no thread pools or the like where you'd usually want gauges. I can see some use cases for gauges (e.g. if you were batching up requests to send on elsewhere), but the counters uses are the primary ones.

@juliusv
Copy link
Member

juliusv commented Feb 8, 2015

@atombender Ah, ok. That'd definitely work and require fewer scrapes, though you'll still end up with a lot of dimensionality in your time series that might not prove to be of much value vs. their cost in transfer, storage, and queries.

@brian-brazil Yep.

@brian-brazil
Copy link

When a worker gets a /metrics request, it simply returns the contents of all slots, separately labeled by worker ID.

That'd work, I thought we were trying to avoid that due to all the additional timeseries.

Shouldn't that completely work around the gauge problem? I was initially thinking that the client would aggregate workers, because, as I said earlier, I think per-worker metrics aren't very useful. But if it poses a problem for gauges, I think it's better to "over-report" and do aggregation on the server.

Reporting everything is what I generally favour, assuming it works out resource wise.

That resolves the counter problem, it doesn't solve the gauge problem as gauge use cases tend to work under the assumption that the gauge is reset when the process dies/starts (think of a gauge that represented in-progress requests). If unicorn can clear them when a process dies, that should handle that.

@brian-brazil
Copy link

The other limit, is that callback-like collectors won't work with a shared memory model. This would affect things like runtime and process statistics, that you want to gather at scrape time.

@grobie
Copy link
Member

grobie commented Feb 9, 2015

@grobie Did you have any plans wrt. supporting preforked servers already?

When we had the use cases at SoundCloud, we discussed three options: a) scraping each worker independently b) letting the master process handle metrics (the current dumb lock implementation in the ruby client probably needs some improvement) c) let each worker push metrics to a pushgateway.

I believe we run all our Ruby services on JRuby by now using the java client. Going forward with b) sounds good to me.

@jeffutter
Copy link

I am very interested in this issue. I am wondering if it is seen with threaded servers (like Puma)? I think they work in a way that memory isn't shared? Depending on where/when things get initialized?

Is there there is a demo app in the repo I'll play around with it today.

Is anyone aware of any other rack middlewares that share global state that we could look to for solutions?

@jeffutter
Copy link

If raindrops is the right path to look at for this... They have a sample middleware. Their code isn't on github but here is a clone of that middleware: https://github.com/tmm1/raindrops/blob/master/lib/raindrops/middleware.rb

@grobie
Copy link
Member

grobie commented Nov 6, 2015

Thanks, I'll have a look.

@jeffutter
Copy link

I have been playing around with raindrops, but can't seem to get the processes writing to the same counters. I'll keep at it. In the meantime I found this: https://github.com/TheClimateCorporation/unicorn-metrics which might show a similar use case.

@jeffutter
Copy link

For those interested in this I have made a little progress. It seems that when using raindrops the Raindrop counters need to be declared before the workers fork. Otherwise each worker will end up instantiating separate instances of the counters. If you declare them before hand, it actually works as expected.

This makes the prometheus middleware a little inconvenient, as you would need to know all possible endpoints and other counters beforehand. This is the approach that the unicorn-metrics project above takes.

I have began looking at a different approach - where each worker exposes it's own metrics at /_metrics and then there would be a middleware at /metrics that queries the others and combines the results. I currently can't find a way to figure out what the other workers TCP/unix sockets are from inside the worker. There is a Unicorn::HttpServer::LISTENERS constant, but it seems to only contain reference to the current worker.

Hopefully this information can be helpful to someone, I would really love to see prometheus being useful on ruby servers.

@brian-brazil
Copy link

The approach I'm taking over in python is to have each process keep a store of it's metrics on a file per process on disk, and then read them in at scrape time.

I can see a challenge in distributing the scrape to workers if a worker is single-threaded. You want scapes to be independent of other processing.

@jeffutter
Copy link

I implemented backing the metrics with PStores. This seems to have the desired effect albeit causing a slowdown since it writes stats to disk on every request. My branch is here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker I tested it out with apache bench doing 2000 requests with 10 concurrent requests and 4 unicorn workers. After running that the /metrics endpoint reported the correct number of hits, not sure about the other types of stats.

On this branch, with the above setup, I'm getting ~ 350 requests /sec. While on the master branch. I'm getting over 1,700. Now, while this sounds like a big difference, the duration of the request that returns OK/200 is about 13ms.. I'm guessing in a real-world load, with real request times, this difference will be much more negligible.

That being said, there may be better backing stores to use than PStore or better ways to store it. Also, it should probably be an option to back with a hash when not running in a prefork environment.

Let me know what you think.

@brian-brazil
Copy link

That's pretty much the same approach as I'm using in Python, however I'm only letting this behaviour kick in when explicitly asked for as it has a number of limitations including performance (see prometheus/client_python#66).

Scrape-time performance doesn't matter too much. what's really important is that metric increments etc. be fast. Optimally that'd mean something where writes hit the kernel, but not the disk.

@jeffutter
Copy link

On my branch: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker I have updated it so there are backing "stores". I have included one with the original behavior (Hash) and one with the new behavior (PStore). If you pass the prefork: true argument to the middlewares it will use the PStore backend, otherwise it uses the original. I imagine this could be extended in the future to support other on-disk or even DB (redis?) backends.

Also (I'm not sure why) but the benchmarks have improved a little. The Hash one is giving me 2,100 requests/sec while the PStore one is up to 1,400. Perhaps may just be a system resource difference from before, I need some more scientific benchmark. However at 1,400 requests/sec I doubt that will cause a bottleneck in any real web load.

Currently there are a bunch of broken tests. If this solution seems like something that might be considered for merging I will fixup the tests, update the docs and clean everything up.

@jeffutter
Copy link

For continuity sake. Discussion for this topic has continued over on the mailing list here: https://groups.google.com/d/topic/prometheus-developers/oI3qPhaRY7I/discussion

@mortenlj
Copy link

Hi!

It seems discussion about this issue died out on the mailing list sometime before Christmas last year. Is there any movement towards a solution to this problem?

@grobie
Copy link
Member

grobie commented Aug 12, 2016

I'm not aware of anyone working on a solution for this problem at the moment. Ideas and contributions are very welcome.

@lzap
Copy link

lzap commented Nov 8, 2017

What would be the best workaround for Rails app running in Passenger? I am considering statsd_exporter and pushgateway. I expect the latter to be easier to integrate with but slower.

@SamSaffron
Copy link

SamSaffron commented Nov 9, 2017 via email

@kvitajakub
Copy link

Hi, is there any progress on this please?

Looks like it should have issue or feature label in addition to the question.

@SamSaffron
Copy link

I started my Discourse extraction here:

https://github.com/discourse/prometheus_exporter

Supports multiple processes with optional custom transport, recovery and chunked encoding. Should be released proper in a few weeks.

Works fine with Passenger, Unicorn and Puma in clustered mode. You can also aggregate sidekiq.

@lzap
Copy link

lzap commented Jan 11, 2018

Thanks. How many metrics do you send for each request in production? I am little bit concerned about that HTTP/JSON transport.

@SamSaffron
Copy link

@lzap will update the readme, I can send about 10k strings over in 200ms over a single request ... if you add a round trip of to_json and JSON.parse its around 600ms for 10k messages.

I persist the HTTP connection for 30 seconds, configurable

@SamSaffron
Copy link

Just putting this out here, in case anyone is still facing the problem

https://samsaffron.com/archive/2018/02/02/instrumenting-rails-with-prometheus

@grobie grobie changed the title Doesn't work with multiple Rack processes? Support pre-fork servers May 4, 2018
@fajpunk
Copy link

fajpunk commented Oct 14, 2018

Hi! I notice the Help wanted tag on this issue, but it's unclear from reading the comments exactly what help is wanted; all I see is a few proposed solutions and alternate libraries. Has a solution been chosen, and if so, is there a branch where work is being done? I'd like to help if I can.

@dmagliola
Copy link
Collaborator

dmagliola commented Oct 23, 2018

An update on this issue: We (GoCardless) are currently working on this. In the process, we're also planning to bring the Prometheus Client more in sync with the current best practices as recommended here.

We've opened an issue with an RFC explaining what we're planning to do.
And we have a PR with the first steps.

This PR doesn't solve everything, and specifically it doesn't deal with the multi-process issue, but it's the first step in allowing that to happen, by introducing "pluggable backends" where the metric data is stored, allowing consumers of the library to pick which is best for their needs, and to write their own if necessary.

We appreciate comments and feedback on these proposals and look forward to working with you on this!

@dmagliola
Copy link
Collaborator

Another update: we've updated PR #95 to (among a lot of other things) add a "backend" that does work with pre-fork servers, sharing data between processes on a scrape.
We've also done plenty of benchmarking and performance improvements to it, and would love the everyone's comments on it!

The new backend we're proposing it's pretty simple. It basically uses files to keep the data, but thanks to the wonders of file system caching, we're not really blocking on disks, and it's actually surprisingly fast. A counter increment with this backend is about 9μs.
For context:

  • the MMap Store that @juliusv was proposing in the multiprocess branch is about 6μs per increment
  • a Hash with a Mutex around is about 4μs
  • and a bare Hash, with nothing around is about 1μs.

We still haven't abandoned the idea of using MMaps in the near future, but we've found some stability issues with that solution, whereas the File-based backend is very stable and reliable, so we're proposing to go with that for the time being.

NOTE: It's a big PR, it was a big refactoring, and there's a lot going on. However, the commits are coherent and have good descriptions, so we recommend reading it commit-by-commit rather than the full diff all at once, which will not be very parseable.

@philandstuff
Copy link

Is this fixed by #95 now?

@dmagliola
Copy link
Collaborator

@philandstuff In short, yes, but the latest release still doesn't reflect that. There's still a few changes planned before cutting a new release, some of which are breaking changes, so we'd like to get them done before that.
But... Soon :-)

@dmagliola
Copy link
Collaborator

We're happy to report, this is now fixed! 🎉
(Sorry for the delay, we should've updated this a while ago)

We haven't yet launched a 1.0 (coming soon!), but you can start using it with our recent v0.10.0.pre.alpha.2 release.

NOTE: We expect there might still be a couple of small breaking changes in the API between now and the 1.0, but we've been using this code in production for months and it works great, so if you are OK with potentially having to do minor adjustments once we release 1.0, you can already use this.

@wshihadeh
Copy link

@dmagliola does the fix also consider deploying the service in a HA mode. in that case, the running process do not have access to the shared memory :(

@dmagliola
Copy link
Collaborator

@wshihadeh I'm not sure I understand the question fully...
HA mode would be different servers, correct?
In that case, each server would have separate time series in Prometheus anyway, and the aggregation happens when querying Prometheus.

Or are you talking about something different?

@wshihadeh
Copy link

Yes, I am talking about that. To explain it more , suppose that the application is a web server that is deployed in multiple containers or servers and it exposes the metrics endpoint. I don't see why we need two Prometheus tasks for such setup (one is enough that scrape metrics from the load balancer or service IP). In that case, the metrics will not be totally accurate and that is why I think there is a need for using shared storage between the containers or servers something like Redis

@philandstuff
Copy link

I don't see why we need two Prometheus tasks for such setup

Prometheus really really wants to scrape individual instances. There are a number of reasons for this but the main one is that this is the overwhelming convention in the Prometheus world. If you want to break this convention you will find yourself fighting the tooling instead of working with it.

You should use some sort of service discovery to ensure that Prometheus automatically detects new instances when they are created. That way, you only need a single scrape_config but Prometheus will still scrape each instance separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests