-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add node_info to relevant logs #4336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bunch of comments regarding some alternatives for the core type modeling of this. I think if we get that right, the rest will come easy.
core/dbt/events/functions.py
Outdated
@@ -178,6 +182,7 @@ def create_json_log_line(e: T_Event, msg_fn: Callable[[T_Event], str]) -> str: | |||
in e.__dataclass_fields__.items() # type: ignore[attr-defined] | |||
if type(y._field_type) == _FIELD_BASE | |||
} | |||
values['data']['type'] = 'dataclass_attr' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I don't think this is quite what we're looking for here. type
should communicate a schema i.e. what fields and types of values to expect, however this seems to communicate the lack of a schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really seeing what the schema here is since it is just any attributes that were passed in. I know we've defined them in the event classes but each one is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my guess is that "any attributes that were passed in" is a distributed type instead of a centralized type. The schema is written down, it's just written across the whole codebase not in one place. It exists even if it's annoying to pin down. I can help pin this down in the future. we may have to release with this kind of unstructured for now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work pulling this together! I think we're heading in the right overall direction. I left some scattered thoughts, would be happy to discuss in more detail when I'm back online a bit later
core/dbt/events/base_types.py
Outdated
def get_node_info(self) -> Any: | ||
if self.report_node_data: | ||
return { | ||
"type": 'node_status', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not convinced we need this — I can't think of any events right now that should have node info, and aren't 'node_status'
in one way or another
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized this is here for functional/plumbing reasons, not user-facing context
core/dbt/events/types.py
Outdated
@@ -2170,6 +2190,9 @@ def message(self) -> str: | |||
@dataclass | |||
class NodeFinished(DebugLevel, Cli, File): | |||
unique_id: str | |||
report_node_data: ParsedModelNode | |||
node_status: str = node_statuses['pass'] | |||
node_state: str = node_states['success'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to include the full contracted RunResult
object in NodeFinished
as well — even if we don't use it to template out the CLI message or specific node_info
— that way, users can access its serialized representation from data
within structured logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're getting there! definitely some great improvements keeping these changes on track to use nominal typing. Let me know what you think of these suggestions and we'll keep tweaking from there.
core/dbt/events/base_types.py
Outdated
|
||
|
||
class NodeInfo(Event, metaclass=ABCMeta): | ||
report_node_data: Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm lets see if we can make this type annotation tighter than Any
. Looking at the call sites below, I think they're all subclasses of some parent Node type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chatting over zoom! here's some comments.
core/dbt/events/base_types.py
Outdated
node_finished_at="TODO", | ||
node_status="TODO" # self.node_status # pull from contract | ||
) | ||
return vars(node_info) #TODO: can just return node_info after #4326 is merged in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes totally! I'll let you know when this PR is in.
core/dbt/events/base_types.py
Outdated
node_path=self.report_node_data.path, | ||
node_name=self.report_node_data.name, | ||
resource_type=self.report_node_data.resource_type, | ||
materialized=self.report_node_data.config.materialized, | ||
node_started_at="TODO", | ||
unique_id=self.report_node_data.unique_id, | ||
node_finished_at="TODO", | ||
node_status="TODO" # self.node_status # pull from contract | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fields you're accessing on report_node_data
like path
, name
, resource_type
, config.materialized
etc. ARE the type that we've defined with the type hint Any
above. I suspect we can represent this with a union of node types instead.
200cfda
to
8c8bfdd
Compare
core/dbt/task/runnable.py
Outdated
@@ -206,17 +206,34 @@ def call_runner(self, runner): | |||
with RUNNING_STATE, uid_context: | |||
startctx = TimestampNamed('node_started_at') | |||
index = self.index_offset(runner.node_index) | |||
runner.node.config['started_at'] = datetime.utcnow().isoformat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any possibility that these could interfere with user-provided configs for started_at
+ finished_at
? If so, could we name these in such a way that collisions are extremely unlikely (e.g. dbt_internal__started_at
)?
f88f067
to
dffdaa3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing we can delete the file trash.py
. Added some questions.
del runner.node._event_status["dbt_internal__started_at"] | ||
del runner.node._event_status["dbt_internal__finished_at"] | ||
del runner.node._event_status["node_status"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kind of awkward but I wouldn't be surprised if it's necessary. I think it's worth commenting though since it's odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeremy asked for this when we were putting these keys in the config, but we don't need it now that I changed it to put the keys in an event_status dict instead. So they names could be changed back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the deleting of all the keys that seems odd to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was concerned that there might be situations where the nodes are reused and the status information is leftover from a previous run. I'm not sure if that can happen or not... Once we're really solid on what information we want to store there we could change it to a class and just reset it.
core/dbt/task/base.py
Outdated
@@ -286,6 +287,13 @@ def skip_result(self, node, message): | |||
def compile_and_execute(self, manifest, ctx): | |||
result = None | |||
with self.adapter.connection_for(self.node): | |||
ctx.node._event_status['node_status'] = str(RunningStatus.Compiling) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this string representation of this class name user visible? the conversion to user-visible strings should live in the events module if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the node_status can contain either RunningStatus or NodeStatus, so it's just more work to get that straight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is more work, but I don't want us to change a classname and break an app parsing our logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're asking about here. When enums like NodeStatus are serialized, I think they're always strings.
core/dbt/events/functions.py
Outdated
@@ -178,6 +182,7 @@ def create_json_log_line(e: T_Event, msg_fn: Callable[[T_Event], str]) -> str: | |||
in e.__dataclass_fields__.items() # type: ignore[attr-defined] | |||
if type(y._field_type) == _FIELD_BASE | |||
} | |||
values['data']['type'] = 'dataclass_attr' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my guess is that "any attributes that were passed in" is a distributed type instead of a centralized type. The schema is written down, it's just written across the whole codebase not in one place. It exists even if it's annoying to pin down. I can help pin this down in the future. we may have to release with this kind of unstructured for now though.
core/dbt/events/functions.py
Outdated
'node_info': e.get_node_info() # TODO: update to just `node_info` and more calling this to | ||
# `event_to_serializable_dict` after #4326 gets merged in. | ||
# also remove refrence to node_info in Event class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highighting this TODO since that PR is soon to be merged in.
def __post_serialize__(self, dct): | ||
if '_event_status' in dct: | ||
del dct['_event_status'] | ||
return dct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this and when is it called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gshank can you speak to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__post_serialize is a method that's called by the mashumaro serialization code. This removes the '_event_status' dictionary from the serialized form of the class (like manifest.json) because we should only need while running.
51d314b
to
61b8f67
Compare
07262e2
to
8ad4055
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with the intent that we'll do more cleanup after we cut the RC.
* WIP * fixed some merg issues * WIP * first pass with node_status logging * add node details to one more * another pass at node info * fixed failures * convert to classes * more tweaks to basic implementation * added in ststus, organized a bit * saving broken state * working state with lots of todos * formatting * add start/end tiemstamps * adding node_status logging to more events * adding node_status to more events * Add RunningStatus and set in node * Add NodeCompiling and NodeExecuting events, switch to _event_status dict * add _event_status to SourceDefinition * small tweaks to NodeInfo * fixed misnamed attr * small fix to validation * rename logging timestamps to minimize name collision * fixed flake failure * move str formatting to events * incorporate serialization changes * add node_status to event_to_serializable_dict * convert nodeInfo to dict with dataclass builtin * Try to fix failing unit, flake8, mypy tests (#4362) * fixed leftover merge conflict Co-authored-by: Gerda Shank <gerda@dbtlabs.com> automatic commit by git-black, original commits: caa6269
resolves #4324
Description
First pass - any feedback on approach welcome!
Checklist
CHANGELOG.md
and added information about my change