-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data access manager for modular pipelines change #1939
Refactor data access manager for modular pipelines change #1939
Conversation
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Can you explain why it is now necessary to pass the |
Hi @merelcht , Every node in the flowchart model of Kedro-Viz has a 2 way mapping i.e., modular pipeline repository contains a map - @field_validator("modular_pipelines")
@classmethod
def set_modular_pipelines(cls, _, info: ValidationInfo):
return cls._expand_namespaces(info.data["kedro_obj"].namespace) |
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does. From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside |
Yes @merelcht , the data flow is not changed.
Yes
Yes, the |
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
free_inputs = pipeline.inputs() | ||
|
||
for node in pipeline.nodes: | ||
task_node = self.add_node(registered_pipeline_id, node) | ||
self.registered_pipelines.add_node(registered_pipeline_id, task_node.id) | ||
task_node = self.add_node( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there are two add_node
functions; it does affect readability. Maybe we add comments before the add_node functions explaining what happens?
) | ||
|
||
# update the node_mod_pipeline_map | ||
if dataset_id not in modular_pipelines_repo_obj.node_mod_pipeline_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dataset_id not in modular_pipelines_repo_obj.node_mod_pipeline_map:
modular_pipelines_repo_obj.node_mod_pipeline_map[dataset_id] = {ROOT_MODULAR_PIPELINE_ID}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nearly there. My main suggestion is to make the asserts in the tests more explicit so instead of something like:
assert list(pipelines) == create_list(pipelines)
you do:
assert ["car_pipeline", "plane_pipeline"] == create_list(pipelines)
This explicitness makes the tests easier to read.
dataset_name=dataset_name, | ||
layer=layer, | ||
tags=set(), | ||
parameters=obj, | ||
# [TODO: Confirm if parameters are not part of any modular_pipeline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a TODO here.
@@ -375,6 +471,348 @@ def test_add_pipelines( | |||
] | |||
) | |||
|
|||
def assert_expected_modular_pipeline_values_for_edge_cases( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to put this at the top of the test class so it can't be mistaken for a test.
for registered_pipeline_id, _pipeline in edge_case_example_pipelines.items(): | ||
# This is needed to add modular pipeline nodes to the GraphNodesRepository | ||
data_access_manager.create_modular_pipelines_tree_for_registered_pipeline( | ||
registered_pipeline_id | ||
) | ||
all_node_names |= {node.name for node in _pipeline.nodes} | ||
|
||
assert sorted( | ||
{ | ||
f"{n.namespace}.{n.name}" if n.namespace else n.name | ||
for n in data_access_manager.nodes.as_list() | ||
if n.type == GraphNodeType.TASK | ||
} | ||
) == sorted(list(all_node_names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this test quite hard to follow because both sides of the assert are generated in the test. If you look at other tests in this file e.g. https://github.com/kedro-org/kedro-viz/blob/main/package/tests/test_data_access/test_managers.py#L346-L354 you see that one side of the assert is generated whereas the other is very explicit, which makes it easier to reason about IMO.
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Hi @merelcht , I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit. Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json. I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this. Thank you |
# Testing for task node names in NodesRepository | ||
|
||
# Extract the task node names that are present in edge_case_example_pipelines | ||
all_task_node_names_from_edge_case_example_pipelines = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we create a fixture for all_task_node_names_from_edge_case_example_pipelines
and just call that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team. |
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Thank you @merelcht for taking time and reviewing the PR. Rashida and I had a discussion and she suggested of having one integration test for the edge cases to check the entire api response instead of checking individual properties. I have made the changes in this commit . @rashidakanchwala , I removed the individual tests here as we already test for |
Description
Partially resolves #1899
Development notes
add_pipeline
method to usepopulate_tree
in the modular_pipelines repository to create a modular pipeline tree using KedroPipelineadd_node_input
andadd_node_output
to take themodular_pipelines_repo_obj
which contains the pre-populated modular pipelines tree (earlier we used to construct the tree at a later stage using expand_tree method of - package/kedro_viz/services/modular_pipelines.py)add_node
to takemodular_pipelines_repo_obj
as inputadd_dataset
to takemodular_pipelines_repo_obj
as inputexpand_tree
service call fromcreate_modular_pipelines_tree_for_registered_pipeline
method as the tree is already constructed- Fix modular pipelines breaking when collapsed. #1651
- While nesting namespace pipelines, intermediary datasets get exposed to top level of the
viz
#1814QA notes
Checklist
RELEASE.md
file