This became nextstrain/nextstrain.org#511.
Logging in nextstrain.org is currently ad-hoc and sometimes useful, sometimes not enough. The goal is to make what's logged more systematic and comprehensively useful while at the same time improving our ability to inspect and act on the logs.
Below are some proposed changes to make. I expect to break each of these items out into separate issues in due course.
-
Standardize on debug or bunyan for emitting logs.
debug
is already used by Express and several related deps (setDEBUG='*'
to see these logs), which is a point in its favor, but it's relatively limited (e.g. no log levels, no structured logs, no child loggers).bunyan
provides a more traditional and modern set of logging features, including a focus on structured logs which greatly helps with analysis.Retire our existing logging functions (e.g.
utils.verbose()
) in favor of the above. -
Switch from ad-hoc messages that embed limited context to structured logs with increased context across the board (e.g. all logs include req context, incl. critical request ids).
The ad-hoc messages don't all have to go away—they're still useful for skimming by eye, particularly if well-written—but they shouldn't be the only thing produced by a log line.
-
Archive raw logs to S3 for retrospective debugging/analysis.
Archives should be raw, straight from Heroku's Logplex service.
Two quick examples of things archives are useful for:
- Investigating reports like "this weird thing happened last week"
- Analyzing the impact of a bug over time, e.g. "how many requests were impacted by this over the last month?"
Keep the retention period fairly short so we're not storing (potentially sensitive) data we don't need. Maybe retain for 30 days? (which is also the minimum duration you pay for with S3)
Papertrail supports S3 archiving, but we don't currently ingest raw logs into Papertrail due to account limitations (see our incoming log filters).
-
Ingest logs into analysis system for real-time inspection.
Currently this is Papertrail, a Heroku add-on. It's got a nice straightforward interface. However, we're using a pretty stringent free tier (10MB/day, 2d retention for search) so have to omit lots of useful logs (e.g. all GET requests!). Retention period is also pretty short. Paid tiers that would accommodate our usage largely unfiltered get pricey.
Logtail is another (glossier) app that also has a Heroku add-on, with a free tier that's a bit more generous than Papertrail (5GB/month, 8d retention for search). Even it, however, would require a bit of filtering on ingest and prices increase quickly from free. Interestingly, the Heroku add-on's free tier is more generous than the non-Heroku free tier of the same product.
Grafana Logs is part of Grafana's hosted offering, Grafana Cloud, with a generous free tier of 50GB logs and 14d retention. No Heroku add-on, so we'd have to manually configure the Heroku log drain → Grafana connection. This seems straightforward to do with Vector (a log-processing multi-tool), though means we'll have to run Vector somewhere persistent. The smallest EC2 instance or Heroku dyno would almost certainly be fine, both at around $7/month.
A big benefit of using Grafana is that we'd also get an industry-leading suite of metrics, monitoring, dashboarding, and alerting tools. For the Seattle Flu infra, Grafana was a very nice place to ship our metrics and logs to. We don't have non-Heroku-generated metrics for nextstrain.org right now, but there are several I think would be good to start tracking. Downside is it does include an additional moving part we'd have to manage (the connection between Heroku and it). For this reason primarily, it could make sense to stick with Papertrail (or switch to Logtail) and pay a bit more (or ask for a gratis bump up as a non-profit).