Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data access manager for modular pipelines change #1939

Conversation

ravi-kumar-pilla
Copy link
Contributor

@ravi-kumar-pilla ravi-kumar-pilla commented Jun 10, 2024

Description

Partially resolves #1899

Development notes

  • Modify add_pipeline method to use populate_tree in the modular_pipelines repository to create a modular pipeline tree using KedroPipeline
  • Modify add_node_input and add_node_output to take the modular_pipelines_repo_obj which contains the pre-populated modular pipelines tree (earlier we used to construct the tree at a later stage using expand_tree method of - package/kedro_viz/services/modular_pipelines.py)
  • Modify add_node to take modular_pipelines_repo_obj as input
  • Modify add_dataset to take modular_pipelines_repo_obj as input
  • Remove the expand_tree service call from create_modular_pipelines_tree_for_registered_pipeline method as the tree is already constructed
  • Update tests to use the edge case scenarios mentioned in :
    - Fix modular pipelines breaking when collapsed.  #1651
    - While nesting namespace pipelines, intermediary datasets get exposed to top level of the viz #1814

QA notes

  • This PR is part of a bigger refactor (Refactor Namespace Pipelines  #1897) of modular pipelines and is created to ease review process.
  • The CI build might fail as this PR is not self-sufficient

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@ravi-kumar-pilla ravi-kumar-pilla mentioned this pull request Jun 10, 2024
9 tasks
@merelcht
Copy link
Member

Can you explain why it is now necessary to pass the modular_pipelines_repo_obj of type ModularPipelinesRepository in some of the methods here? With the refactor it seems like the ModularPipelinesRepository has a different function than the other repository classes/objects.

@ravi-kumar-pilla
Copy link
Contributor Author

Can you explain why it is now necessary to pass the modular_pipelines_repo_obj of type ModularPipelinesRepository in some of the methods here? With the refactor it seems like the ModularPipelinesRepository has a different function than the other repository classes/objects.

Hi @merelcht , Every node in the flowchart model of Kedro-Viz has a 2 way mapping i.e., modular pipeline repository contains a map - node_mod_pipeline_map and the node contains a field modular_pipelines. Earlier we used to have the below logic to gather the modular_pipelines. With the refactor, we are constructing the modular pipelines ahead of creating the nodes and we build the dict - node_mod_pipeline_map. This dict already contains the details of the modular pipelines the node belongs to. So we are passing the modular_pipelines_repo_obj while creating the nodes.

    @field_validator("modular_pipelines")
    @classmethod
    def set_modular_pipelines(cls, _, info: ValidationInfo):
        return cls._expand_namespaces(info.data["kedro_obj"].namespace)

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@merelcht
Copy link
Member

merelcht commented Jun 20, 2024

Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does.

From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the expand_tree() method in the services/modular_pipelines.py.

Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside populate_data() that happens on the server side when the server is started. In fact this happens at the same time as the regular pipelines are populated. The reason why the ModularPipelinesRepository object is then used as an argument to subsequent methods in the call stack, is because it's used to determine the nodes + inputs + outputs.

@ravi-kumar-pilla
Copy link
Contributor Author

Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does.

Yes @merelcht , the data flow is not changed.

From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the expand_tree() method in the services/modular_pipelines.py.

Yes

Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside populate_data() that happens on the server side when the server is started. In fact this happens at the same time as the regular pipelines are populated. The reason why the ModularPipelinesRepository object is then used as an argument to subsequent methods in the call stack, is because it's used to determine the nodes + inputs + outputs.

Yes, the ModularPipelinesRepository object is now passed to the create_node methods to assign the modular_pipelines property on the GraphNode.

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@ravi-kumar-pilla ravi-kumar-pilla marked this pull request as ready for review June 24, 2024 15:01
free_inputs = pipeline.inputs()

for node in pipeline.nodes:
task_node = self.add_node(registered_pipeline_id, node)
self.registered_pipelines.add_node(registered_pipeline_id, task_node.id)
task_node = self.add_node(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there are two add_node functions; it does affect readability. Maybe we add comments before the add_node functions explaining what happens?

)

# update the node_mod_pipeline_map
if dataset_id not in modular_pipelines_repo_obj.node_mod_pipeline_map:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dataset_id not in modular_pipelines_repo_obj.node_mod_pipeline_map:
    modular_pipelines_repo_obj.node_mod_pipeline_map[dataset_id] = {ROOT_MODULAR_PIPELINE_ID}

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is nearly there. My main suggestion is to make the asserts in the tests more explicit so instead of something like:

assert list(pipelines) == create_list(pipelines)

you do:

assert ["car_pipeline", "plane_pipeline"] == create_list(pipelines)

This explicitness makes the tests easier to read.

dataset_name=dataset_name,
layer=layer,
tags=set(),
parameters=obj,
# [TODO: Confirm if parameters are not part of any modular_pipeline]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a TODO here.

@@ -375,6 +471,348 @@ def test_add_pipelines(
]
)

def assert_expected_modular_pipeline_values_for_edge_cases(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to put this at the top of the test class so it can't be mistaken for a test.

Comment on lines 517 to 530
for registered_pipeline_id, _pipeline in edge_case_example_pipelines.items():
# This is needed to add modular pipeline nodes to the GraphNodesRepository
data_access_manager.create_modular_pipelines_tree_for_registered_pipeline(
registered_pipeline_id
)
all_node_names |= {node.name for node in _pipeline.nodes}

assert sorted(
{
f"{n.namespace}.{n.name}" if n.namespace else n.name
for n in data_access_manager.nodes.as_list()
if n.type == GraphNodeType.TASK
}
) == sorted(list(all_node_names))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this test quite hard to follow because both sides of the assert are generated in the test. If you look at other tests in this file e.g. https://github.com/kedro-org/kedro-viz/blob/main/package/tests/test_data_access/test_managers.py#L346-L354 you see that one side of the assert is generated whereas the other is very explicit, which makes it easier to reason about IMO.

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@ravi-kumar-pilla
Copy link
Contributor Author

I think this is nearly there. My main suggestion is to make the asserts in the tests more explicit so instead of something like:

assert list(pipelines) == create_list(pipelines)

you do:

assert ["car_pipeline", "plane_pipeline"] == create_list(pipelines)

This explicitness makes the tests easier to read.

Hi @merelcht ,

I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit.

Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json.

I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this.

Thank you

# Testing for task node names in NodesRepository

# Extract the task node names that are present in edge_case_example_pipelines
all_task_node_names_from_edge_case_example_pipelines = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we create a fixture for all_task_node_names_from_edge_case_example_pipelines and just call that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already coming from a fixture, that is reading the json file. As @merelcht pointed if you feel it is hard to follow, I can hardcode the required part of the json that is present here for every test.

@merelcht
Copy link
Member

merelcht commented Jul 1, 2024

I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit.

Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json.

I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this.

I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team.

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>
@ravi-kumar-pilla
Copy link
Contributor Author

I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team.

Thank you @merelcht for taking time and reviewing the PR. Rashida and I had a discussion and she suggested of having one integration test for the edge cases to check the entire api response instead of checking individual properties. I have made the changes in this commit .

@rashidakanchwala , I removed the individual tests here as we already test for example_pipelines in test_add_pipelines. Please have a look. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants