Refactor data access manager for modular pipelines change #1939

ravi-kumar-pilla · 2024-06-10T20:22:58Z

Description

Partially resolves #1899

Development notes

Modify add_pipeline method to use populate_tree in the modular_pipelines repository to create a modular pipeline tree using KedroPipeline
Modify add_node_input and add_node_output to take the modular_pipelines_repo_obj which contains the pre-populated modular pipelines tree (earlier we used to construct the tree at a later stage using expand_tree method of - package/kedro_viz/services/modular_pipelines.py)
Modify add_node to take modular_pipelines_repo_obj as input
Modify add_dataset to take modular_pipelines_repo_obj as input
Remove the expand_tree service call from create_modular_pipelines_tree_for_registered_pipeline method as the tree is already constructed
Update tests to use the edge case scenarios mentioned in :
- Fix modular pipelines breaking when collapsed. #1651
- While nesting namespace pipelines, intermediary datasets get exposed to top level of the viz #1814

QA notes

This PR is part of a bigger refactor (Refactor Namespace Pipelines #1897) of modular pipelines and is created to ease review process.
The CI build might fail as this PR is not self-sufficient

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added new entries to the RELEASE.md file
Added tests to cover my changes

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

merelcht · 2024-06-18T09:29:09Z

Can you explain why it is now necessary to pass the modular_pipelines_repo_obj of type ModularPipelinesRepository in some of the methods here? With the refactor it seems like the ModularPipelinesRepository has a different function than the other repository classes/objects.

ravi-kumar-pilla · 2024-06-19T03:12:50Z

Can you explain why it is now necessary to pass the modular_pipelines_repo_obj of type ModularPipelinesRepository in some of the methods here? With the refactor it seems like the ModularPipelinesRepository has a different function than the other repository classes/objects.

Hi @merelcht , Every node in the flowchart model of Kedro-Viz has a 2 way mapping i.e., modular pipeline repository contains a map - node_mod_pipeline_map and the node contains a field modular_pipelines. Earlier we used to have the below logic to gather the modular_pipelines. With the refactor, we are constructing the modular pipelines ahead of creating the nodes and we build the dict - node_mod_pipeline_map. This dict already contains the details of the modular pipelines the node belongs to. So we are passing the modular_pipelines_repo_obj while creating the nodes.

    @field_validator("modular_pipelines")
    @classmethod
    def set_modular_pipelines(cls, _, info: ValidationInfo):
        return cls._expand_namespaces(info.data["kedro_obj"].namespace)

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

merelcht · 2024-06-20T09:37:42Z

Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does.

From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the expand_tree() method in the services/modular_pipelines.py.

Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside populate_data() that happens on the server side when the server is started. In fact this happens at the same time as the regular pipelines are populated. The reason why the ModularPipelinesRepository object is then used as an argument to subsequent methods in the call stack, is because it's used to determine the nodes + inputs + outputs.

ravi-kumar-pilla · 2024-06-20T14:20:26Z

Thanks @ravi-kumar-pilla this is clearer now. My main concern was whether the flow of data still adheres to the architecture (https://miro.com/app/board/uXjVOktl9TY=/ and https://user-images.githubusercontent.com/2032984/118201082-98dcfb80-b44e-11eb-8e63-d7cc7254d1d1.png) and as far as I can tell it does.

Yes @merelcht , the data flow is not changed.

From what I understand: before the refactoring the modular pipelines where populated as part of a call to the REST API. This would in turn call the expand_tree() method in the services/modular_pipelines.py.

Yes

Now instead the modular pipelines are populated in the same way as the "regular" pipelines, the catalog and dataset stats from inside populate_data() that happens on the server side when the server is started. In fact this happens at the same time as the regular pipelines are populated. The reason why the ModularPipelinesRepository object is then used as an argument to subsequent methods in the call stack, is because it's used to determine the nodes + inputs + outputs.

Yes, the ModularPipelinesRepository object is now passed to the create_node methods to assign the modular_pipelines property on the GraphNode.

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

rashidakanchwala · 2024-06-25T09:51:55Z

package/kedro_viz/data_access/managers.py

        free_inputs = pipeline.inputs()

        for node in pipeline.nodes:
-            task_node = self.add_node(registered_pipeline_id, node)
-            self.registered_pipelines.add_node(registered_pipeline_id, task_node.id)
+            task_node = self.add_node(


since there are two add_node functions; it does affect readability. Maybe we add comments before the add_node functions explaining what happens?

rashidakanchwala · 2024-06-25T09:59:34Z

package/kedro_viz/data_access/managers.py

+            )
+
+            # update the node_mod_pipeline_map
+            if dataset_id not in modular_pipelines_repo_obj.node_mod_pipeline_map:


if dataset_id not in modular_pipelines_repo_obj.node_mod_pipeline_map: modular_pipelines_repo_obj.node_mod_pipeline_map[dataset_id] = {ROOT_MODULAR_PIPELINE_ID}

merelcht

I think this is nearly there. My main suggestion is to make the asserts in the tests more explicit so instead of something like:

assert list(pipelines) == create_list(pipelines)

you do:

assert ["car_pipeline", "plane_pipeline"] == create_list(pipelines)

This explicitness makes the tests easier to read.

merelcht · 2024-06-27T12:45:07Z

package/kedro_viz/data_access/managers.py

                dataset_name=dataset_name,
                layer=layer,
                tags=set(),
                parameters=obj,
+                # [TODO: Confirm if parameters are not part of any modular_pipeline]


There's still a TODO here.

merelcht · 2024-06-27T12:49:31Z

package/tests/test_data_access/test_managers.py

@@ -375,6 +471,348 @@ def test_add_pipelines(
            ]
        )

+    def assert_expected_modular_pipeline_values_for_edge_cases(


It's better to put this at the top of the test class so it can't be mistaken for a test.

merelcht · 2024-06-27T13:02:24Z

package/tests/test_data_access/test_managers.py

+        for registered_pipeline_id, _pipeline in edge_case_example_pipelines.items():
+            # This is needed to add modular pipeline nodes to the GraphNodesRepository
+            data_access_manager.create_modular_pipelines_tree_for_registered_pipeline(
+                registered_pipeline_id
+            )
+            all_node_names |= {node.name for node in _pipeline.nodes}
+
+        assert sorted(
+            {
+                f"{n.namespace}.{n.name}" if n.namespace else n.name
+                for n in data_access_manager.nodes.as_list()
+                if n.type == GraphNodeType.TASK
+            }
+        ) == sorted(list(all_node_names))


I find this test quite hard to follow because both sides of the assert are generated in the test. If you look at other tests in this file e.g. https://github.com/kedro-org/kedro-viz/blob/main/package/tests/test_data_access/test_managers.py#L346-L354 you see that one side of the assert is generated whereas the other is very explicit, which makes it easier to reason about IMO.

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

ravi-kumar-pilla · 2024-06-28T20:48:34Z

I think this is nearly there. My main suggestion is to make the asserts in the tests more explicit so instead of something like:
assert list(pipelines) == create_list(pipelines)
you do:
assert ["car_pipeline", "plane_pipeline"] == create_list(pipelines)
This explicitness makes the tests easier to read.

Hi @merelcht ,

I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit.

Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json.

I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this.

Thank you

rashidakanchwala · 2024-07-01T09:31:29Z

package/tests/test_data_access/test_managers.py

+        # Testing for task node names in NodesRepository
+
+        # Extract the task node names that are present in edge_case_example_pipelines
+        all_task_node_names_from_edge_case_example_pipelines = set()


maybe we create a fixture for all_task_node_names_from_edge_case_example_pipelines and just call that here.

This is already coming from a fixture, that is reading the json file. As @merelcht pointed if you feel it is hard to follow, I can hardcode the required part of the json that is present here for every test.

merelcht · 2024-07-01T11:10:42Z

I understand the edge case test cases are not straightforward to read in the assert statements like you pointed here. I tried to add additional comments to improve readability in the new commit.

Since these tests are drawing the expected json from the expected_tree_for_edge_cases , it will be lot of hardcoding of strings in the test file if I extract the details from the json.

I find this easy to manage in-case there is a change in the modular pipeline tree. Please have a look at the changes in my new commit and let me know if you feel there could be a better way to handle this.

I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team.

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

ravi-kumar-pilla · 2024-07-02T01:56:23Z

I think this is a step into the right direction. Personally, I think hardcoding to enhance test readability is better and forces you to check that the code is still correct when you make changes, because you'll have to manually update your tests. However, since I'm not working on this codebase full time I leave the final decision on how much further to improve the tests up to you and the rest of the Viz team.

Thank you @merelcht for taking time and reviewing the PR. Rashida and I had a discussion and she suggested of having one integration test for the edge cases to check the entire api response instead of checking individual properties. I have made the changes in this commit .

@rashidakanchwala , I removed the individual tests here as we already test for example_pipelines in test_add_pipelines. Please have a look. Thank you

split modular pipeline refactor data_access

6d14c7b

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

ravi-kumar-pilla mentioned this pull request Jun 10, 2024

Refactor modular pipelines #1941

Closed

9 tasks

address PR comments1

036f5d1

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

ravi-kumar-pilla mentioned this pull request Jun 21, 2024

Refactor flowchart models for modular pipelines change #1937

Closed

5 tasks

ravi-kumar-pilla added 2 commits June 21, 2024 20:36

add meaningful test case names

163917a

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

refactor tests from PR comments1

d36f939

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

ravi-kumar-pilla marked this pull request as ready for review June 24, 2024 15:01

ravi-kumar-pilla requested a review from rashidakanchwala as a code owner June 24, 2024 15:01

ravi-kumar-pilla requested a review from merelcht June 24, 2024 15:03

rashidakanchwala reviewed Jun 25, 2024

View reviewed changes

merelcht reviewed Jun 27, 2024

View reviewed changes

address PR comments2

7eb99f3

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

rashidakanchwala reviewed Jul 1, 2024

View reviewed changes

updated tests for readability

0c90993

Signed-off-by: ravi-kumar-pilla <ravi_kumar_pilla@mckinsey.com>

merelcht approved these changes Jul 2, 2024

View reviewed changes

rashidakanchwala mentioned this pull request Jul 2, 2024

Refactor Namespace Pipelines #1897

Merged

9 tasks

rashidakanchwala closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data access manager for modular pipelines change #1939

Refactor data access manager for modular pipelines change #1939

ravi-kumar-pilla commented Jun 10, 2024 •

edited

Loading

merelcht commented Jun 18, 2024

ravi-kumar-pilla commented Jun 19, 2024

merelcht commented Jun 20, 2024 •

edited

Loading

ravi-kumar-pilla commented Jun 20, 2024

rashidakanchwala Jun 25, 2024

rashidakanchwala Jun 25, 2024

merelcht left a comment

merelcht Jun 27, 2024

merelcht Jun 27, 2024

merelcht Jun 27, 2024

ravi-kumar-pilla commented Jun 28, 2024

rashidakanchwala Jul 1, 2024

ravi-kumar-pilla Jul 1, 2024

merelcht commented Jul 1, 2024

ravi-kumar-pilla commented Jul 2, 2024

Refactor data access manager for modular pipelines change #1939

Refactor data access manager for modular pipelines change #1939

Conversation

ravi-kumar-pilla commented Jun 10, 2024 • edited Loading

Description

Development notes

QA notes

Checklist

merelcht commented Jun 18, 2024

ravi-kumar-pilla commented Jun 19, 2024

merelcht commented Jun 20, 2024 • edited Loading

ravi-kumar-pilla commented Jun 20, 2024

rashidakanchwala Jun 25, 2024

Choose a reason for hiding this comment

rashidakanchwala Jun 25, 2024

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

merelcht Jun 27, 2024

Choose a reason for hiding this comment

merelcht Jun 27, 2024

Choose a reason for hiding this comment

merelcht Jun 27, 2024

Choose a reason for hiding this comment

ravi-kumar-pilla commented Jun 28, 2024

rashidakanchwala Jul 1, 2024

Choose a reason for hiding this comment

ravi-kumar-pilla Jul 1, 2024

Choose a reason for hiding this comment

merelcht commented Jul 1, 2024

ravi-kumar-pilla commented Jul 2, 2024

ravi-kumar-pilla commented Jun 10, 2024 •

edited

Loading

merelcht commented Jun 20, 2024 •

edited

Loading