-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data access repositories for modular pipeline change #1938
Refactor data access repositories for modular pipeline change #1938
Conversation
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done an initial review of this part, but want to have another look after reviewing the other parts. I've left comments mostly related to readability of the code.
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
`input_node.modular_pipelines`. | ||
def add_inputs(self, modular_pipeline_id: str, inputs: Set[str]) -> None: | ||
""" | ||
Add input datasets to the modular pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code there's also a special case for parameters
. I think it makes sense to mention that in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is a special case for parameters and that is while adding the datasets as children. That is in the docstring for add_children
, _add_children_to_parent_pipeline
and _add_datasets_as_children
. The method here is to add_inputs to the modular pipeline (like the free_inputs hanging outside the modular pipeline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's a good idea to describe the following code in the docstring:
if is_dataset_param(_input):
self.parameters.add(hashed_input)
self.tree[modular_pipeline_id].internal_outputs.add(output_node.id) | ||
else: | ||
self.tree[modular_pipeline_id].external_outputs.add(output_node.id) | ||
def _hash_input_output(self, item: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to take self
? As far as I understand this method just hashes input/output so it doesn't need an instance of ModularPipelinesRepository
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would like to know where should the helper functions like this be placed in ? I used to place all the helper functions in utils as public. But there should be some rules which I am not aware of. Could you please let me know ? Thanks
Placed in -
- A class as static method (public/private)
- Outside of a class but in same file (public/private)
- Inside utils (public/private)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stackoverflow thread, specifically the first two answers, describe quite well what the general rule of thumb is when it comes to static methods, class methods and module functions. https://stackoverflow.com/questions/11788195/module-function-vs-staticmethod-vs-classmethod-vs-no-decorators-which-idiom-is
In the Kedro codebase we generally apply the following:
- Inside a class as static method: any method that is only used by that class, but the difference with a module function is that these types of methods are specific to this class and don't have any meaning outside of it. (e.g. https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L336-L338)
- Outside of a class but in same file: any helper method that is only used in that specific file should be a module function at the top of the file. Note that these methods are only used in that file/class, but don't require any "knowledge" about the class: no need to access a class instance and the method has meaning outside of the context of the class (e.g. https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L33-L56)
- Inside utils: when it's a helper method used in multiple files and not specific to a certain class.
When it comes to making things public/private it's a bit different for Viz vs. Kedro but in the framework anything that's public is part of the public API. So it can (and will) be used by outside users directly accessing Kedro as a library. So that means, unless a user should have access to the method, we make it private. My personal opinion is that for Viz we should make things public when it's accessed in multiple classes/files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation Merel 💯
self.add_outputs(modular_pipeline_id, free_outputs) | ||
self.add_children(modular_pipeline_id, sub_pipeline.nodes) | ||
|
||
def _explode_namespace(self, nested_namespace: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to take self
? This doesn't need an instance of ModularPipelinesRepository
so it can just be a static method or it can be defined outside the ModularPipelinesRepository
class altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, converted to static method. Thank you
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
package/kedro_viz/data_access/repositories/modular_pipelines.py
Outdated
Show resolved
Hide resolved
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
kedro_node: Node, | ||
modular_pipeline_inputs_outputs: Set[str], | ||
): | ||
"""Helper to add datasets (not parameters) related to task nodes as children. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could add datasets that have no modular pipeline ID associated with them (and thus belong to the root modular pipeline) in this function instead of in managers.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a correctness point of view, this looks good as it is as far as I can tell. There's a few things that can be done to make the style of the code a bit more readable. I've added some comments here and there with pointers. Happy to discuss on a call if you have questions.
sub_pipeline = pipeline.only_nodes_with_namespace(modular_pipeline_id) | ||
rest_of_the_pipeline = pipeline - sub_pipeline | ||
|
||
free_inputs = sub_pipeline.inputs() | ||
free_outputs = sub_pipeline.outputs() | ( | ||
rest_of_the_pipeline.inputs() & sub_pipeline.all_outputs() | ||
) | ||
|
||
self._add_inputs(modular_pipeline_id, free_inputs) | ||
self._add_outputs(modular_pipeline_id, free_outputs) | ||
self._add_children(modular_pipeline_id, sub_pipeline.nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth explaining the code here with a couple of comments, so we don't reintroduce the bugs we had before by accidentally modifying it in the future. In short, the logic is as follows:
- We extract all nodes within a certain namespace as a subpipeline
- We consider all free inputs to be the inputs of the subpipeline
- We consider all free outputs to be the outputs of the subpipeline
- We add as outputs all intermediary outputs which are consumed by external to the subpipeline nodes
Probably could be explained in less and shorter sentences.
def _hash(value: str): | ||
return hashlib.sha1(value.encode("UTF-8")).hexdigest()[:8] | ||
|
||
|
||
def _hash_input_output(item: str) -> str: | ||
"""Hash the input/output dataset.""" | ||
return ( | ||
_hash(_strip_transcoding(item)) | ||
if TRANSCODING_SEPARATOR in item | ||
else _hash(item) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put these in utils
and import them here, rather than the transcoding logic. Hide as much as possible behind a smaller API footprint.
self.tree[modular_pipeline_id].internal_outputs.add(output_node.id) | ||
else: | ||
self.tree[modular_pipeline_id].external_outputs.add(output_node.id) | ||
def _add_children(self, modular_pipeline_id: str, kedro_nodes: List[Node]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function is overly complicated, very wide open if condition and too much nestedness. Could you try to refactor it into a more concise and easier to follow one?
for kedro_node in kedro_nodes: | ||
# add kedro node as a child to the modular pipeline | ||
if kedro_node.namespace == modular_pipeline_id: | ||
kedro_node_id = _hash(str(kedro_node)) | ||
modular_pipeline.children.add( | ||
ModularPipelineChild(id=kedro_node_id, type=GraphNodeType.TASK) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of nesting this with 2 extra levels, you could do something like:
children = [n for n in kedro_nodes if n.namespace == modular_pipeline_id]
modular_pipeline.children = [ModularPipelineChild(id=_hash(str(n)), type=GraphNodeType.TASK) for n in children]
Much more readable and explainable:
- We get all the nodes matching exactly the pipeline namespace
- Add them as children to the pipeline (this one could be
.extend
instead of=
if there's other places that add to children called before this one)
Two actions, two lines of code...
# The line `parent_modular_pipeline_id = ('.'.join(modular_pipeline_id.split('.')[:-1]) | ||
# if '.' in modular_pipeline_id else None)` is extracting the parent modular pipeline ID | ||
# from the given modular pipeline ID. | ||
parent_modular_pipeline_id = ( | ||
".".join(modular_pipeline_id.split(".")[:-1]) | ||
if "." in modular_pipeline_id | ||
else None | ||
) | ||
|
||
# Add the node's registered pipelines to the modular pipeline's registered pipelines. | ||
# Basically this means if the node belongs to the "__default__" pipeline, for example, | ||
# so does the modular pipeline. | ||
modular_pipeline.pipelines.update(node.pipelines) | ||
if parent_modular_pipeline_id: | ||
parent_modular_pipeline = self.get_or_create_modular_pipeline( | ||
parent_modular_pipeline_id | ||
) | ||
parent_modular_pipeline.pipelines.update(modular_pipeline.pipelines) | ||
parent_modular_pipeline.tags.update(modular_pipeline.tags) | ||
|
||
modular_pipeline.tags.update(node.tags) | ||
# add current modular pipeline and input/output datasets | ||
# of a modular pipeline as a child to the parent modular | ||
# pipeline based on the rules | ||
self._add_children_to_parent_pipeline( | ||
parent_modular_pipeline, | ||
modular_pipeline_id, | ||
modular_pipeline_inputs_outputs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very unintuitive - the method is called _add_children
, but in it we have the totally unrelated code for adding the modular pipeline itself to it's parent's children. Could this be factored out as a separate method which is called right after _add_children
instead?
if dataset not in self.node_mod_pipeline_map: | ||
self.node_mod_pipeline_map[dataset] = set() | ||
self.node_mod_pipeline_map[dataset].add(modular_pipeline_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultdict
to the rescue here...
if io_id not in self.node_mod_pipeline_map: | ||
self.node_mod_pipeline_map[io_id] = set() | ||
self.node_mod_pipeline_map[io_id].add(modular_pipeline.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultdict
...
if ( | ||
io_id not in modular_pipeline_inputs_outputs | ||
and io_id not in self.parameters | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filter could've just been added to the previous for
-comprehension, or as a separate one on its own and then we don't need the extra nestedness.
""" | ||
# Determine the parent modular pipeline ID | ||
parent_modular_pipeline_id = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simply call this parent_id
, variables have context and in this function the only thing that can be a parent is a modular pipeline, thus it's implied and doesn't need to be part of the variable name.
modular_pipeline_inputs_outputs, | ||
modular_pipeline.tags.update(node.tags) | ||
|
||
hashed_io_ids = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this just io_ids
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some final minor comments, but nothing blocking!
`input_node.modular_pipelines`. | ||
def add_inputs(self, modular_pipeline_id: str, inputs: Set[str]) -> None: | ||
""" | ||
Add input datasets to the modular pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's a good idea to describe the following code in the docstring:
if is_dataset_param(_input):
self.parameters.add(hashed_input)
modular_pipeline_inputs_outputs: Set[str], | ||
): | ||
""" | ||
Helper to add modular_pipeline children correctly to parent modular pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helper to add modular_pipeline children correctly to parent modular pipelines | |
Helper to add modular pipeline children correctly to parent modular pipelines | |
Description
Partially resolves #1899
Development notes
populate_tree
method which takes care of inputs/outputs and children for all the modular pipelines within a registered pipelineexplode_namespace
method which takes care of expanding a nested namespace.Set[str]
of inputs and outputs that are calculated inpopulate_tree
methodadd_children
method which resolves the modular pipeline child nodes_add_children_to_parent_pipeline
method which resolves adding children to parent modular pipeline in-case of nesting_add_datasets_as_children
method which adds datasets as children to the modular pipelineQA notes
Checklist
RELEASE.md
file