Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent node execution order by sorting node with Sequentialrunner #1604

Merged

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Jun 9, 2022

Signed-off-by: Nok Chan nok.lam.chan@quantumblack.com

Description

#1350

Development notes

The trickiest part of the PR is actually the test. It's not straightforward to reproduce the random behaviour. If you just create 10 pipelines in the same python session. It is actually deterministic, which is different from doing 10 kedro run.

After some investigation, I believe this is the whole story

  1. Set is not "random", but "arbitrary ordered", at least for CPython
  2. Node has a custom __hash__ function, which actually dictated how set order them.
  3. In the same python session, the hash will be identical, even if they are different instances because how the __hash__ is implemented. Therefore when
  def __hash__(self):
      return hash(self._unique_key)

The way to test this properly is likely just to do some mocking (I still need to think about how to do this)

_topologically_sorted is the main function that gets changed, it changes the order of the arbitrary nodes to a deterministic order. The node dependencies are still respected. It only has effects when nodes don't have dependencies on each other. The new implementation will give a deterministic order to these nodes, so the run's results will be consistent.

  def __lt__(self, other):
      if not isinstance(other, Node):
          return NotImplemented
      return self._unique_key < other._unique_key

How this is tested

In the end, I go with manual testing, as turns out it is very tricky to test this non-deterministic behavior due to hash.

For reference, this is the test that I have written originally to test out this behavior, which should help understanding what this PR is trying to do.

    def test_pipeline_consistent_nodes_order(self, mocker):
        """
            Pipeline that have multiple possible execution orders should have consistent
            solutions
            Possible Solutions:
            1. A -> B -> C -> D -> E -> F
            2. B -> A -> C -> D -> E -> F
            3 ... Any permutation as long as F is executed last.

            Although we are not sure which permutation it is, but it should always output
            the same permutation.

            A-- \
            B--- \
            C---- F
            D--- /
            E-- /
        """

        def multiconcat(*args):
            return sum(args)

        mock_hash = mocker.patch(f"{__name__}.Node.__hash__", lambda x: random.int())
        expected_sorted_nodes: List[List[Node]] = None

        # Repeat 10 times so we can be sure it is not purely by chance
        for _ in range(10):

            inverted_fork_dags = Pipeline(
                [
                    node(constant_output, None, "A"),  # 3 4 1 0
                    node(constant_output, None, "B"),  # 9 4 1 4
                    node(constant_output, None, "C"),  # 2
                    node(constant_output, None, "D"),  # 6
                    node(constant_output, None, "E"),
                    node(multiconcat, ["A", "B", "C", "D", "E"], "F"),
                ]
            )
            if not expected_sorted_nodes:
                expected_sorted_nodes = inverted_fork_dags.nodes

            else:

                assert expected_sorted_nodes == inverted_fork_dags.nodes

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
noklam and others added 5 commits June 14, 2022 11:01
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as ready for review June 14, 2022 13:30
@noklam noklam requested a review from idanov as a code owner June 14, 2022 13:30
@noklam noklam requested review from antonymilne and merelcht and removed request for idanov June 14, 2022 13:30
Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice fix! I actually think the test here isn't doing what you think and needs to change, but also I'm not sure there's a good way to write a test for this at all. So might just be best overall to remove it.

RELEASE.md Outdated
@@ -25,9 +25,11 @@
* Reduced number of log lines by changing the logging level from `INFO` to `DEBUG` for low priority messages.
* Kedro's framework-side logging configuration no longer performs file-based logging. Hence superfluous `info.log`/`errors.log` files are no longer created in your project root, and running Kedro on read-only file systems such as Databricks Repos is now possible.
* The `root` logger is now set to the Python default level of `WARNING` rather than `INFO`. Kedro's logger is still set to emit `INFO` level messages.
* Kedro pipeline will have consistent execution order given the same set of nodes when using with `SequentialRunner`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Kedro pipeline will have consistent execution order given the same set of nodes when using with `SequentialRunner`.
* `SequentialRunner` now consistently runs nodes in the same order across multiple runs.

I still don't think this is a very clear explanation though. Maybe what you have is better 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @MerelTheisenQB has a better idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is any better to be honest 😅 "Added sorting of nodes for the SequentialRunner to facilitate consistent execution order across multiple runs. "

kedro/pipeline/pipeline.py Outdated Show resolved Hide resolved
tests/pipeline/test_pipeline.py Outdated Show resolved Hide resolved
Comment on lines 597 to 612
Pipeline that have multiple possible execution orders should have consistent
solutions
Possible Solutions:
1. A -> B -> C -> D -> E -> F
2. B -> A -> C -> D -> E -> F
3 ... Any permutation as long as F is executed last.

Although we are not sure which permutation it is, but it should always output
the same permutation.

A-- \
B--- \
C---- F
D--- /
E-- /
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice 👍


# Repeat 10 times so we can be sure it is not purely by chance
for _ in range(10):
mock_hash.return_value = random.randint(1, 1e20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is doing what you want it to do. This is currently fixing the hash of every Node instance in this pipeline to be the same. We don't want hash(node1) to be the same as hash(node2). What we should have is:

        n1 = node(constant_output, None, "A")
        n2 = node(constant_output, None, "B")
        n3 = node(constant_output, None, "C")
        n4 = node(constant_output, None, "D")
        n5 = node(constant_output, None, "E")
        n6 = node(multiconcat, ["A", "B", "C", "D", "E"], "F")

       # You don't have to nest these, you can put them all in one with block
       # But actually for Python < 3.10 it's super ugly formatting still: https://stackoverflow.com/questions/3024925/create-a-with-block-on-several-context-managers
        for _ in range(10):
            with mock.patch.object(n1, "__hash__", random.randint(1, 1e20)):
                with mock.patch.object(n2, "__hash__", random.randint(1, 1e20)):
                    with mock.patch.object(n3, "__hash__", random.randint(1, 1e20)):
                        with mock.patch.object(n4, "__hash__", random.randint(1, 1e20)):
                            with mock.patch.object(n5, "__hash__", random.randint(1, 1e20)):
                                with mock.patch.object(n6, "__hash__", random.randint(1, 1e20)):
                                    inverted_fork_dags = Pipeline([n1, n2, n3, n4, n5, n6])
                                    # use inverted_fork_dags.nodes

... but this is still not a great test because the current code in main doesn't fail it 😬

After spending a looooong time playing around with this, I think it might just not be worth writing a test for it at all all... So long as it works as it should in manual testing then I think we're fine.

Happy to explain more about what I discovered while playing around with the testing here if you want to hear. It's certainly a tricky one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's chat tomorrow!

@noklam noklam self-assigned this Jun 15, 2022
noklam and others added 4 commits June 16, 2022 06:40
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, and even though it would be ideal to have a test for this I recognise it isn't trivial to test this well. I'd be happy to merge this without a specific test if we're satisfied with the results from the manual test 👍

RELEASE.md Outdated
@@ -25,9 +25,11 @@
* Reduced number of log lines by changing the logging level from `INFO` to `DEBUG` for low priority messages.
* Kedro's framework-side logging configuration no longer performs file-based logging. Hence superfluous `info.log`/`errors.log` files are no longer created in your project root, and running Kedro on read-only file systems such as Databricks Repos is now possible.
* The `root` logger is now set to the Python default level of `WARNING` rather than `INFO`. Kedro's logger is still set to emit `INFO` level messages.
* Kedro pipeline will have consistent execution order given the same set of nodes when using with `SequentialRunner`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is any better to be honest 😅 "Added sorting of nodes for the SequentialRunner to facilitate consistent execution order across multiple runs. "

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
@noklam noklam merged commit 6768d7f into main Jun 16, 2022
@noklam noklam deleted the feat/consistent_node_execution_order_with_sequential_runner branch June 16, 2022 12:13
@noklam noklam restored the feat/consistent_node_execution_order_with_sequential_runner branch June 17, 2022 10:14
@noklam noklam deleted the feat/consistent_node_execution_order_with_sequential_runner branch June 20, 2022 12:59
@noklam noklam linked an issue Jul 8, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consistent nodes execution order with SequentialRunner
3 participants