On logging #644

rec · 2023-08-08T10:02:29Z

rec
Aug 8, 2023

Why?

Our current logging is hand-written and very low on features, as in "none".

I believe that end-users, internal and external developers will all want more features.

Sources of logging

the sddb core code
our server code
Flask or FastAPI's own logging or printing
our client code
third party models' logging or printing
warnings as we import modules

Possible features

It is easy to suggest features, harder to estimate their value to the user, and perhaps even harder to estimate how much work they are. Here's a list in very roughly value from most to least.

log levels
preservation of logs, including log rotation
integration with logging from third parties (Python standard logging!!!, print statements)
structured logging: even though this is for Java, this article is good
alerting on certain log levels or other conditions
log routing (teeing, routing between processes or machines, from server to client, or the other way)
Ease of programmer use
Prettiness of output

Likely non-goals and non-features

CPU speed
log size

Notes on features

The last feature, prettiness has a non-zero value, and not just to impress others: we will be seeing a lot of these logfiles!, but we should decide which logging system to use based on the other features, and then use whatever prettiness we get. :-)

As for "ease of programmer use", there seem to be only two styles:

Import/create:

import logging
logger = logging.getLogger(__name__)

logger.critical('Meltdown initiated, flee')

Import-only

from loguru import logger

logger.info('Your cat has fleas')

I should add that the second format still manages to get __name__ and even the line in the file that's being called, using inspect magic and some clever caching, in seemingly every library that offers it.

A good reference

This article is actually pretty good when it comes to the two libraries I know well, Python's logging and loguru.

My next steps will be to go through the remaining three libraries in the article and any others I find and compare them with my checklist above.

IIRC, I loved loguru totally except we had a bad time with integration with the Python standard logger which would be a dealbreaker, but this was several years ago.

thejumpman2323 · 2023-08-08T11:58:19Z

thejumpman2323
Aug 8, 2023

I second this

we should have some sort of TimeRotating log file
Right log levels
Persisted logs into disk

0 replies

nenb · 2023-08-09T08:45:52Z

nenb
Aug 9, 2023

I'm ashamed to admit that logging has always been something of an afterthought for me, even though I often rely on it heavily afterwards. My knowledge of what's useful/what's not so useful has also been cobbled together from a bunch of (possibly inconsistent) sources. Thank you for proposing this, please do track the knowledge that you have/learn here or in an issue so that we (me!) can finally have a more 'structured' approach to logging. (See what I did there 😉).

In terms of things that seem useful to me at the moment:

introduce some guidelines on how we will 'log' in the project and what we must do so that we don't spam or break the logs of other packages that use us as a library (and also so that we can benefit from it internally ie don't just ignore the logs as they are too noisy as I have done in a bunch of past projects...)
introduce some conventions for keys for structured logging within our project so that we can parse errors quickly and easily eg all requests to the server have a 'client' key and an 'action' key in the logs, so that we can quickly parse for all requests from a particular client or for a particular action
(very vague but) introduce some guidelines so that logging never interferes with the performance of our application in a default setting (ie don't spam the logs a million times when training unless explicitly told to)
(Potentially important use case) Users need to be able to track where 'AI-generated' code enters their workflow. For example, AI model used on what data. (Structured) Logging may have a role to play here, to track what data has been modified and where.

That's all I can think of for now. Looking forward to tracking this!

0 replies

blythed · 2023-08-09T09:04:09Z

blythed
Aug 9, 2023
Maintainer

Another key issue, and which will be critical for the end-user, is redirecting logs from the dask workers, which basically "run-blind" to somewhere, where the logs can be viewed and monitored in real-time. Currently we're using MongoDB to do this, redirecting the logs in real time to a collection. See here.

1 reply

rec Aug 9, 2023
Author

I read this code at that link a few weeks ago and totally forgot it was there for this document!

I hope that this functionality is covered by the two bullet points above:

alerting on log levels or other conditions
log routing (teeing, routing between processes or machines, from server to client, or the other way)

I think all the logging libraries under consideration handle some form of redirection, though how much of a transport layer there is between machines is very variable.

At least two of the logging libraries under consideration handle alerting, both using the notifiers library, which seems pretty good.

rec · 2023-08-09T09:09:09Z

rec
Aug 9, 2023
Author

@nen, your four points are sufficiently strong that I did a naughty and edited your comment to give them numbers!

I totally missed 2, particularly, which is so important. What's the point of structured logging without a consistent fields, huh?

Clearly a document for developers is a key deliverable, it's part of 1, 2, 3, and 4.

We get 4 almost for free if we just include file and line numbers in each log record.

1 and 3 are the same huge point - "don't spam the logs"!

I spent considerable time on Google logging, which is amazing.

Considerably later, I spent a few months doing it in another project, where the main program generated 2 gigs of log every day, only one person could really read them, and yet no one wanted to commit to removing any log messages.

/shaking my head

Google and most companies at the time they started had two sorts of logs: free-form human-readable logs of everything that gets printed to stderr or stdout, and very structured logs written using protocol buffers.

The two were totally different in every way: functionality, API, and managerial. (You had to get serious permissions to add something to the structured logs, but none to write crap into the programmer logs.)

Google considered their structured logging to be so critical that they got two famous figures from computer science to work on them, Peter J. Weinberger and Rob Pike. (I spent a lot of time with Peter and he's a bit of a curmudgeon in a good way but ego-free and unpretentious. My only interaction with Pike was a disagreement about error handling in Go in the early phases, he can be a bit spikey, and I still think I was right. :-D)

Nearly all the value in Google comes from their targeted ads: nearly all of that comes from structured log analysis.

It is likely we can get away with only one type of log because we are starting fresh.

In order to do this, we will have to hijack stdout and stderr from our dependencies, as well as their unstructured "classic logging" calls, and wrap them in structured logging calls.

Redirecting "classic logging" is known technology, we can do it mechanically with only a little look at the source of our dependencies.

Redirecting "stdout and stderr" for dependencies is always doable, but might take a bit of research into each dependency to resolve that.

2 replies

blythed Aug 9, 2023
Maintainer

Ok, I'm not sure if I emphasized the issue with the dask logging sufficiently. We have multiple tasks, potentially running simultaneously on remote dask workers. Client-side, the user may want to interrogate the progress of these tasks, using a task-id, and take action based on that. See here for a discussion of that dask/distributed#2033.

rec Aug 9, 2023
Author

I was responding to @nenb's comments, I haven't gotten to yours yet! :-)

rec · 2023-08-09T09:31:29Z

rec
Aug 9, 2023
Author

I wanted to add some more important random points before I forget them:

GDPR and log retention and the right to be forgotten - the easiest way to deal with this is never to store any personally identifiable information, but what about IP addresses, or MAC addresses? Not sure what the law is, we should get an opinion.
We should automatically be logging any Python crashes, which is fairly easy, and we should figure out how to log SEGVs or other such misery from our C++ dependencies, which is much less easy.
It's sort of implied above, but we'll need to have a very clear naming convention for our logs, so that ideally each logfile has a guaranteed unique name. Using the MAC address in the logfile is a way to do that, but see GDPR above.

0 replies

fnikolai · 2023-09-13T10:36:27Z

fnikolai
Sep 13, 2023
Maintainer

For single-node deployments, the custom logging approach is ok.

For large-scale and production-ready deployments we need more advanced methods.
It is particularly important to separate metrics from logs.

Metrics show the "performance" of a service, and are typically used for monitoring the progress (e.g. Grafana dashboards).
Logs focus on specific events and are typically focused on troubleshooting. I also concur with @rec that logs should be structured so that integration with other tools can be easily done (e.g., they should be in the form action:"write", location:"./somewhere" rather than writing file to ./somewhere)

I would propose integration with Prometheus[1] for metrics, and with Loki[2] for logs. For tools are optimized for running queries, provide web interfaces, and handle nuances such as rotation, replication, etc.

[1] https://github.com/prometheus/client_python
[2] https://pypi.org/project/python-logging-loki/

0 replies

rec · 2023-09-14T07:56:58Z

rec
Sep 14, 2023
Author

The term that I have heard used before is "monitoring variables".

I haven't heard the term "metrics" before, and I'm moderately against it, as it has so many other meanings, and we already have Metric all over the code! Also, a "metric" sounds numeric, but monitoring variables certainly include variables of string type.

Logs focus on specific events and are typically focused on troubleshooting.

welll....

Google became fabulously wealthy because of log analysis. Spell correction? From log analysis. Targeted ads? From log analysis.

There are actually two somewhat separate uses for logs and Google had completely different "logs" in fact, stored and processed in radically different ways.

There were "programmer's logs" which were basically print statements to a UTF-8 text file separated by carriage returns. At Google, there were three of these logs, depending on error levels (info, warn, error). Some automatic monitoring of this was done - for example, new error messages appearing would trigger stuff. No personally identifiable information of customers could go into these logs.

And then there were "the logs", basically event logs, which were highly structured, stored in protocol buffer format, encrypted, and held behind "bastion" machines where only a handful of people had keys. Analysis could only be done through a special purpose language which did not actually allow you to retrieve personally identifiable information. Google took this seriously enough that they had two famous people, Peter Weinberger and R. Pike, design the logs system.

The GDPR and personally identifiable information program is so thorny that I think we shouldn't deal with it at all! :-)

I think we should tell people somewhere that all the logs are only intended to be collected on people working with your company, and not from the general public at all and if they do, they won't be GDPR compliant, and we aren't responsible.

I don't think we should have separate programmers' and event logs, because it's too much work. I think we should write everything to structured logs.

I think we should keep the structure logging as simple and minimal as possible. I also think we should have unit testing of it, though: when we add it, we should add at the same time a way that a unit test can easily say, "Now test that the previous steps generated logs rather like this".

I would like to add that we are in a fairly easy and luxurious place, because our individual operations are very heavy, and there aren't very many of them. In some systems, the efficiency of the logs is of key importance, because generating them is quite heavy compared to the tininess of each transaction.

Given that, we could get even more simplified: we could have our monitoring variables simply as part of our logging, where we just emit all sorts of variables to our logging system, and then a separate monitoring system exposes some subset of those variables to the outside world to be monitored.

I am talking of course about our internal API.

We should be entirely be using other people's code as much as possible for log gathering, rotation, compression, etc etc etc. but we want to make our own internal API for logging/monitoring as bone simple as possible, and allow other people to use it when writing superduperdb programs.

0 replies

rec · 2023-09-14T08:35:18Z

rec
Sep 14, 2023
Author

So let's look into what such an API might be like.

I see a logging or monitoring event as emitting a JSON record to a key, so you get a "timeseries" of JSON records for each key.

Sounds easy, but it opens a couple of cans of worms!

If I write a record to a key today, I would like at any point in the future to be able to find that key, and then understand that record.

This means keys have to be stable. It's probably not a good idea to have user-contributed information in a key, therefore.

This also means that it must be possible to identify the type of a record from a key.

Nearly always, keys are segmented, divided into multiple segments with semantic meaning like cpu.load.mean.

One or more segments will have to determine the type. (In other APIs, the "type" might be at the start, e.g. maiito:, or at the end, ...track-1-15.wav.)

Types of records need to be carefully controlled in order to maintain backward and forward compatibility.

The meetings of fields cannot change. If you got the design of a field wrong, you simply have to create a new field with a new name, and still accept the old one in your analysis code, which means that you also need to commit some code to port existing records to the new format if that has to happen - but you never rewrite old logs.

Finally, you need to have at least some logic associated with the log type. For example, you want "counter" types where just hitting them increments a thread-safe counter, but you want other types where you set a new value each time, so this means there will have to be a class associated with each log key (from a small number: "set", "counter", "mean" perhaps...)

So let's actually fix all keys at startup, as static variables - then we could write unit tests about them.

Let's sketch the design of just a counter that increments by one when it is hit, something that takes successive numerical samples, and a string variable.

But we can't actually instantiate a log until we are running, we don't want a huge number of global static variables everywhere that we need to patch to make our tests run!

So it goes like this:


#server.py...

import superduperdb as s
from enum import Enum


class Status(str, Enum):
    idle = 'idle'
    busy = 'busy'


class Log(s.log.Base):
    # This defines the log points just for this module.
    # Any module that did logging would have a definition like this

    __keys__ = ['server']

    elapsed = s.log.ElapsedTime()
    hits = s.log.Counter()
    status = s.log.Enumerate[Status]()


class Server:
    def __init__(self, context: s.log.Context):
        self.log = Log(context)

    def handle_request(self, ...):
        with self.log.elapsed:
            self.log.status.set(Server.busy)
            self.log.hits.count()
            ....
            self.log.status.set(Server.idle)

0 replies

rec · 2023-09-14T09:12:12Z

rec
Sep 14, 2023
Author

Using this strategy allows us to ensure some of the important properties above with unit tests.

We create a text "registry" of every single log field and type and we store it with the source (not even with the tests, it's part of our spec).

A test makes sure that the existing fields do not change incorrectly as follows.

Each class inheriting from log.Base is registered and all their fields are made into a big JSON document, which is compared with the existing text registry.

One of three outcomes might happen:

Nothing has changed
The types have changed in a backward compatible way (i.e. only new fields have been added and no types changed)
The types have changed in a backward incompatible way.

In case 1, the test passes.

In case 2, the test fails but with a message like this:

You have changed the log types in a backward compatible way. To update the registry, run

    SDDB_UPDATE_LOG_TYPE=t pytest -k test_log_types

In case 3, it fails with no automatic fix.

For simplicity, I omitted one important case in the "log types above", which is the structured type, but I know you're itching to see it, so let me list it now.

class Card:
    class Suit(str, Enum):
        ...

   suit: Suit
   value: Value


class Log(s.log.Base):
    __keys__ = ['CardTable']

    dealt = s.log.Data[Card]()  

class MonteCarlo:
    def play(self):
        self.log.dealt(self.deal_card())

         # or also
         self.log.dealt(suit=Card.Suit.spades, value=Card.Value.ace)

         # kwargs only, so NOT:
         # self.log.dealt(Card.Suit.spades, Card.Value.ace)

0 replies

rec · 2023-09-14T09:34:33Z

rec
Sep 14, 2023
Author

Frequently imagined questions!

1. How to we incorporate third-party and legacy text logs into this picture?

We receive a log record from torch, what happens to it?

Simple: each third party log gets its own key. If we can't at all parse it, we can just store it as a string of text, or we might be able to extract some structure from it.

2. Who's actually writing the logs and where?

Under the scenes we will use some other library, probably loguru, to do all the heavy lifting, including redirected legacy logging, and then other packages external to our program to move the logs around, compress, rotate etc.

3. Hasn't someone already done this?

Only the ideas in the above code snippets are new and will need coding. Most of this is other people's production packages

But you would think someone had done this little idea before, it seems both rigorous and convenient.

I'm going to look again after I press return, but I did a search for "Python structured logging" before and it showed me a lot of conventional packages, none of which used modern techniques of using class members to indicate intent, like SQLAlchemy, dataclasses or pydantic do.

The code will be advanced, but straightforward to write and not require many keystrokes to achieve. There won't be a lot of special cases or complex logic, either.

4. But I really really really want print statements/unstructured logs!

No problems! Just for you, we could create two keys named stdout and stderr which simply take a stream of unstructured text.

5. What about monitoring variables/metrics?

In this proposal, there is no difference in the API.

Behind the scenes, we can either route certain keys also to send to monitoring systems (active monitoring) or respond to a variable request from a monitor (passive monitoring), but conceptually these are all the same sort of thing, a key attached to a type that receives a series of values!

0 replies

rec · 2023-09-14T10:15:15Z

rec
Sep 14, 2023
Author

These guys seem to be the most prominent in Python purely structured logging: https://github.com/hynek/structlog and actually, we could avoid loguru entirely I think.

They integrate with a lot of things we would want to integrate with.

They don't have the idea of stable keys validated by tests like the above, which is I think essential if you really intend to read last year's logs next year, a goal which is part of the "repeatable calculation" effort.

They don't have integrated monitoring.

loguru is the leader in logging in general. They now have extensive structured logging. They seem to have more tools for log rotation and such than structlog. There is a notification system already there, which might do for monitoring but wouldn't get in the way.

They also seem to interface with a laundry list of other programs.

They don't have the idea of stable keys validated by tests.

So I tentatively think we should write to loguru, but it makes little difference - we're preparing dicts and passing them to one or the other - if we cared to we could even have a software switch between them.

We provide the simple framework designed above. I could write implementations of what's above in three days with little risk factor as it has few dependencies.

Behind the scenes it's just a bunch of keys, each receiving a time series of JSON-able values with backward compatible structures.

We send keys and data to loguru/structlog and also to whatever monitoring system we choose.

0 replies

rec · 2023-09-14T15:59:52Z

rec
Sep 14, 2023
Author

One final part since I'm here!

Let's estimate the software parts needed to implement it! I should always do this, takes a few minutes.

log.Base: the class that all Logs inherit from
log.Entry: the base class for entries in a log
log.Counter: an Entry which is just a single int that increments
log.Enumerate: a generic Entry accepting a choice from an enumerated type
log.Data: a generic Entry accepting a dataclass or pydantic class
log.ElapsedTime: a content Entry that measures the elapsed time in a scope
log.Exception: reports an exception
log.Simple: a generic type that holds simple JSONable variables like str, int, float, bool, None and lists and dicts recursively of these...
log.Context: contains either loguru or our test proxy.
The test to show compatibility and to update on new backward compatible changes - test_log_permanence.py

Base, Entry, Enumerate, Data and test_log_permanence.py are the non-trivial parts. I can't envisage any obvious issues with any of them...

0 replies

fazlulkarimweb · 2023-10-30T13:55:26Z

fazlulkarimweb
Oct 30, 2023
Collaborator

Can we evaluate SigNoz as our logging platform? It's open source with 15K Stars. It has a community edition as well. Big companies like Comcast are using it. It's cheaper than most other alternatives. Most importantly, it's open source.

Here's the pricing details.

To get an idea regarding pricing, see here:

Here's the Python example

On the other hand, we can also utilize Loki. It has 21K stars on GitHub.

0 replies

On logging #644

rec Aug 8, 2023

Why?

Sources of logging

Possible features

Likely non-goals and non-features

Notes on features

Import/create:

Import-only

A good reference

Replies: 13 comments · 3 replies

thejumpman2323 Aug 8, 2023

nenb Aug 9, 2023

blythed Aug 9, 2023 Maintainer

rec Aug 9, 2023 Author

rec Aug 9, 2023 Author

blythed Aug 9, 2023 Maintainer

rec Aug 9, 2023 Author

rec Aug 9, 2023 Author

fnikolai Sep 13, 2023 Maintainer

rec Sep 14, 2023 Author

rec Sep 14, 2023 Author

rec Sep 14, 2023 Author

rec Sep 14, 2023 Author

1. How to we incorporate third-party and legacy text logs into this picture?

2. Who's actually writing the logs and where?

3. Hasn't someone already done this?

4. But I really really really want print statements/unstructured logs!

5. What about monitoring variables/metrics?

rec Sep 14, 2023 Author

rec Sep 14, 2023 Author

fazlulkarimweb Oct 30, 2023 Collaborator

rec
Aug 8, 2023

Replies: 13 comments 3 replies

thejumpman2323
Aug 8, 2023

nenb
Aug 9, 2023

blythed
Aug 9, 2023
Maintainer

rec Aug 9, 2023
Author

rec
Aug 9, 2023
Author

blythed Aug 9, 2023
Maintainer

rec Aug 9, 2023
Author

rec
Aug 9, 2023
Author

fnikolai
Sep 13, 2023
Maintainer

rec
Sep 14, 2023
Author

rec
Sep 14, 2023
Author

rec
Sep 14, 2023
Author

rec
Sep 14, 2023
Author

rec
Sep 14, 2023
Author

rec
Sep 14, 2023
Author

fazlulkarimweb
Oct 30, 2023
Collaborator