-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial structured logging work with fire_event
#4137
Initial structured logging work with fire_event
#4137
Conversation
13c8c41
to
8f1a2d4
Compare
8f1a2d4
to
8655220
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate the readme and annotations. I'd be curious to hear if others have structural comments or suggestions. For my part, I think I have a decent sense of how we plan to use this, and how we can extend it beyond stdout/file logging.
Am I right to think that our next steps are:
- going through all the places in dbt where
logger
is called - adding typed events to
events/types
- replacing those
logger
calls withfire_event
for the new typed event
At the same time, we need to replicate the functionality currently available in logger
(log levels, JSON formatting, secret scrubbing, etc) with our new event-based system.
I'm especially interested in finding ways to safely split up and parallelize all this work, knowing that we've got precious time left ahead of planned release for v1.0.0-rc1
# (i.e. - mutating the event history, printing to stdout, logging | ||
# to files, etc.) | ||
def fire_event(e: Event) -> None: | ||
EVENT_HISTORY.append(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question, because I don't really understand how computers work: Do we need to flush/cycle EVENT_HISTORY
at some point? A dbt invocations can have hundreds of thousands of debug-level log-lines, is there a risk that this will grow to use a substantial amount of memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe! This is just a horribly naive approach to in-memory history. I figure we'll cross that bridge when we need to. There are a few tactics we could use to make this more robust, but I'm not totally sure which one to go with yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's another reason not to keep methods (i.e. - cli_msg
) on these datatypes. I'm pretty sure methods make the memory footprint bigger for objects in Python land. (Would need to double check this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some rudimentary testing here, and I really can't get methods to increase the memory footprint of objects. So I'm going to try swapping this around to the OO way of things and see if I can get a similar level of safety there too.
@@ -0,0 +1,61 @@ | |||
|
|||
import dbt.logger as logger # type: ignore # TODO eventually remove dependency on this logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to eventually replace with structlog
, when the output is CLI/file logging, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. This is just to show that the new structure works when you run dbt without introducing too much nonsense all at once. It should be a relatively easy swap out in a future PR.
That's exactly correct. |
Did some exploration on whether we should take the fp or oo approach to computing messages from event types. The goal here is to know at development time whether we have log messages defined for each event type or not without the diligence to write a test every time we add a log statement. The scenarios we want to avoid are: a release where some messages are not written, or a release that raises exceptions when an obscure log line is triggered. The FP WayThe way this PR presents a solution to this is to use Union types and type-level branching (i.e. The OO WayWe could instead use an abstract base class (ABC) to define the methods we wish to inherit with code like this: class CliEventABC(metaclass=ABCMeta):
@abstractmethod
def cli_msg(self) -> str:
raise Exception("cli_msg not implemented for event")
@dataclass(frozen=True)
class ParsingStart():
pass
CliEventABC.register(ParsingStart) However, this would require we drop support for v 3.6 because custom Additionally, I did some tinkering to figure out if adding methods to these classes increases their memory footprint (see this thread with @jtcohen6 about in-memory event history), but I keep coming up with local results that show that they don't. I don't know enough about Python's runtime model to know how this is possible, but I can't seem to blow out the memory footprint with methods alone. |
core/dbt/events/types.py
Outdated
] | ||
|
||
# top-level event type for all events that go to the CLI | ||
CliEvent = Union[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you see CliEvent
and Event
differing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect there to be different destinations so Event
would encompass all events, but CliEvent
is only the events that go to the cli as opposed to an event stream, log file, or any other destination we want. This way we can keep the cli super crisp, while putting more details in the log file even at the same log level. That's because each event can have more than one destination, and would just have one message computed for each destination.
I just pushed the change to OO-style python. After pairing with @iknox-fa, it's very likely that in order to match the correctness guarantees of the FP-style code we will have to manually maintain a list of dummy class instantiations until mypy is run on every file. This is because mypy does not check that the concrete classes implement abstract methods unless the class itself is instantiated. If it's not, mypy considers the concrete class definition dead code and does not check it. That being said, this code is way nicer so I think we should go with this option since we plan on turning mypy on everywhere ala #3203 and #4089 |
So the tests pass even though I'm using dataclasses from python 3.7. I would expect 3.6 tests to fail, but they don't. Because I'm planning on making many many event types, I would really like to use dataclasses so we don't have to live in boilerplate city, but if we really need to I can deal with the bloat. Thoughts? (especially @jtcohen6 re: dropping support for 3.6 for ease of development here) |
# types to represent log levels | ||
|
||
# in preparation for #3977 | ||
class TestLevel(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea for future iterations on this: Implement log level number as well since it's a defined thing in python. This would let us more easily tie in with python tools that utilize log levels (I imagine structlog has some support for it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know! Yeah I imagine we can work these in once we hook structlog up.
raise Exception("cli_msg not implemented for cli event") | ||
|
||
|
||
class ParsingStart(InfoLevel, CliEventABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail: can we ditch the periods at the end of these returned strings? They aren't sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the messages exactly the way they are printed today. I completely agree, however I think I want to do user-facing message improvements in their own PR so product can give them all a go. I might be being a bit too structured for something as silly as these periods though.
# we need to skirt around that by computing something it doesn't check statically. | ||
# | ||
# TODO remove these lines once we run mypy everywhere. | ||
if 1 == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit, I kinda love this hack...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor thing that's not going to bother anyone but me. :)
add event type modeling and fire_event calls
add event type modeling and fire_event calls
add event type modeling and fire_event calls
add event type modeling and fire_event calls
add event type modeling and fire_event calls automatic commit by git-black, original commits: f9ef9da
Description
This PR into the feature branch adds the first real bit of structured logging. The description of the module layout is in the README file. This is the best PR to ask for large structural changes to the general approach if there are any concerns with this way of doing things.
Checklist
CHANGELOG.md
and added information about my change