Add runner benchmark #4210

noklam · 2024-10-07T12:17:47Z

Description

Close #4127

Dev Notes:

Rename benchmarks due to name clash with pyarrow: Unexpected benchmark module get installed with arrow apache/arrow#44398
Add runner tests for peak memory, memory and time
For time, each runner get a specific workload (IO/compute bound)
benchmarks has been move to kedro_benchmarks, despite best effort to avoid installation of pyarrow, it will still causes a weird benchmarks module not found error. I try many different things, but ultimately I found that it works as long as the folder have any name other than `benchmarks. It's a bit strange but also not worth the effort to further investigate, so I end up renaming the folder.

QA notes

pip install asv or pip install -e ".[benchmark]"
asv run --quick

More notes

During test I discover ThreadRunner throws error like this occassionally, it's likely the same issue of #4191

DatasetAlreadyExistsError: Dataset 'dummy_1' has already been registered

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

noklam · 2024-10-09T14:33:39Z

============================================================Done============================================================
{'SequentialRunner': 7.8415021896362305, 'ThreadRunner': 7.56311297416687, 'ParallelRunner': 3.4262261390686035}

At the moment I only running test for a compute-bound workload, the result is expected only ParallelRunner should speed things up. I run into issue sometimes with dataset already registered error. I think it's likely related to: #4191

@ElenaKhaustova any thought? Should I focus on testing only the KedroDataCatalog instead?

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

noklam · 2024-10-15T12:52:06Z

kedro_benchmarks/pyproject.toml

+[project]
+name = "kedro_benchmarks"


The reason to do this is multiprocessing breaks imports. Without this I get a benchmarks module is not found error.

To make sure this always in sys.path, I add this into part of the installation step. Other suggestions are welcomed.

I have tried PYTHONPATH as well but it doesn't seem to work as new process doesn't inherit it.

kedro_benchmarks/benchmark_runner.py

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

…edro-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

merelcht

This looks good 👍

One question I have is why the SequentialRunner and ParallelRunner are tested with the compute bound pipeline and ThreadRunner with io bound?

kedro_benchmarks/README.md

…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

ElenaKhaustova · 2024-10-28T12:11:22Z

kedro_benchmarks/benchmark_runner.py

+        with open(data_dir / "data.csv", "w") as f:
+            f.write("col1,col2\n1,2\n")
+
+    def mem_runners(self, runner):


Shall we also test memory usage on various catalogs and actually return some objects that need sufficient memory rather than a "dummy" string?

Do you mean just changing what it returns? We can do that but I don't know if it changes anything. Do I also need to change the structure of the pipeline?

Most data are not stored in DataCatalog unless CacheDataset is used. From my understanding they only get loaded within the node function and get recycled as soon as node is finished.

I mean varying the catalog and number of entries there.

Makes sense about returns, thank you.

Do you mean adding also the KedroDataCatalog? I am not sure what do you want to test by varying the number of entries, can you elaborate more?

In general, I mean running tests for different catalog configurations—the number of datasets in the catalog, with/without patterns—so we vary something that actually affects memory usage. We can also vary the pipeline size and the number of input arguments for nodes.

In the current setup, we use the same pipeline and catalog configuration and we try to test memory usage by running nodes with heavy compute load that should not grow the memory. So it feels like we might add other scenarios that grow it and we're able to compare (memory grows as expected) runs on them.

Adding KedroDataCatalog would be nice as well, just to compare them. But we can do it in a separate PR if easier.

@ElenaKhaustova I see, https://github.com/kedro-org/kedro/blob/main/benchmarks/benchmark_kedrodatacatalog.py, I think this is covered in the catalog test already?

Yeah, but we didn't do memory profiling. Maybe worth adding it on the catalog side. Anyway, that's just an idea, so I don't want to block the PR with it.

ElenaKhaustova · 2024-10-28T12:18:13Z

This looks good 👍

One question I have is why the SequentialRunner and ParallelRunner are tested with the compute bound pipeline and ThreadRunner with io bound?

Maybe it makes sense to add time tests for each runner and each scenario (compute bound, io bound) so we can compare not only catalogs (when changes are applied) but also runners (in general, between each other).

noklam · 2024-10-28T12:51:54Z

This looks good 👍
One question I have is why the SequentialRunner and ParallelRunner are tested with the compute bound pipeline and ThreadRunner with io bound?

Maybe it makes sense to add time tests for each runner and each scenario (compute bound, io bound) so we can compare not only catalogs (when changes are applied) but also runners (in general, between each other).

We can do that. I didn't do it because I think we don't get much for running ThreadRunner for compute bound task (maybe things will change with 3.13 but we also need to enable that flag explicitly). These tests are more costly as they takes some time to run.

I can add these tests, and if these tests start taking too long then I will suggest to remove the less useful one.

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

ElenaKhaustova

Thank you, @noklam, LGTM!

There's a suggestion on possible extension of tests for memory profiling, but that can be done in the separate PR if decide to do that.

noklam added 3 commits October 7, 2024 09:58

add benchmark dependencies

98744d5

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

add structure

216cc7a

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

add benchmark

a4ab4c4

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

noklam requested a review from ankatiyar October 9, 2024 14:34

noklam added 12 commits October 14, 2024 13:34

tmp commit

92c7556

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

force benchmarks to be a package

e1c4156

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update config

06d178d

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

rename folder

783b2e8

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

fix asv config

033237b

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

rename

0cb6e4f

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update format

1b8a7ab

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

typo

4341284

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update

bc1ec5c

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

incorrect config

58a70ff

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update

145af85

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update

7e00882

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

noklam commented Oct 15, 2024

View reviewed changes

ankatiyar reviewed Oct 15, 2024

View reviewed changes

kedro_benchmarks/benchmark_runner.py Outdated Show resolved Hide resolved

back to kedro_benchmarks

66ad6c5

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

noklam force-pushed the noklam/stress-testing-runners-4127 branch from a8883b9 to 66ad6c5 Compare October 16, 2024 11:23

noklam added 5 commits October 16, 2024 12:23

rename benchmark file

985a051

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

clean up

0f5a4f0

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update asv config

ddd6a77

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update config

4f905b5

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update config

515d91c

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

noklam mentioned this pull request Oct 16, 2024

Refactor Runners, introduce Task class #4206

Merged

7 tasks

noklam marked this pull request as ready for review October 17, 2024 14:53

noklam requested a review from merelcht as a code owner October 17, 2024 14:53

Merge branch 'main' into noklam/stress-testing-runners-4127

fdf0fe2

merelcht mentioned this pull request Oct 18, 2024

Add Github Actions workflow to trigger pipeline performance test #4231

Merged

7 tasks

noklam added 13 commits October 18, 2024 14:51

fix memory test

74d24bf

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

remove memory tracking since it's not meaningful

aaa9a3f

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

Merge branch 'main' into noklam/stress-testing-runners-4127

b733534

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

Merge branch 'main' into noklam/stress-testing-runners-4127

4c9098b

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

mergMerge branch 'noklam/stress-testing-runners-4127' of github.com:k…

752440b

…edro-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

test

8ee7dac

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

commit benchmark module

05bcd5e

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

ADD README

f1a1235

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

Merge branch 'main' into noklam/stress-testing-runners-4127

e33e74e

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

rename kedro_benchmarks

2349b9f

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

update asv config

1f7f053

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

lint

0ba5c44

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

Merge branch 'main' into noklam/stress-testing-runners-4127

713562b

noklam mentioned this pull request Oct 22, 2024

ThreadRunner Dataset DatasetAlreadyExistsError: Dataset has already been registered #4250

Open

Merge branch 'main' into noklam/stress-testing-runners-4127

982a5b0

noklam requested a review from ankatiyar October 24, 2024 12:24

Merge branch 'noklam/stress-testing-runners-4127' of github.com:kedro…

a219069

…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

merelcht approved these changes Oct 24, 2024

View reviewed changes

kedro_benchmarks/README.md Outdated Show resolved Hide resolved

noklam added 3 commits October 25, 2024 11:45

Merge branch 'main' into noklam/stress-testing-runners-4127

3e64745

Merge branch 'main' into noklam/stress-testing-runners-4127

4907858

Merge branch 'noklam/stress-testing-runners-4127' of github.com:kedro…

7b00c0d

…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

ElenaKhaustova reviewed Oct 28, 2024

View reviewed changes

test matrix of runner

f818826

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>

ElenaKhaustova approved these changes Oct 28, 2024

View reviewed changes

noklam merged commit 1ecb0c8 into main Oct 28, 2024
34 checks passed

noklam deleted the noklam/stress-testing-runners-4127 branch October 28, 2024 18:33

noklam mentioned this pull request Oct 28, 2024

Add memory profiling for catalog #4264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add runner benchmark #4210

Add runner benchmark #4210

noklam commented Oct 7, 2024 •

edited

Loading

noklam commented Oct 9, 2024

noklam Oct 15, 2024

merelcht left a comment

ElenaKhaustova Oct 28, 2024

noklam Oct 28, 2024

ElenaKhaustova Oct 28, 2024

noklam Oct 28, 2024

ElenaKhaustova Oct 28, 2024

noklam Oct 28, 2024

ElenaKhaustova Oct 28, 2024

ElenaKhaustova commented Oct 28, 2024

noklam commented Oct 28, 2024

ElenaKhaustova left a comment

		[project]
		name = "kedro_benchmarks"

Add runner benchmark #4210

Add runner benchmark #4210

Conversation

noklam commented Oct 7, 2024 • edited Loading

Description

Dev Notes:

QA notes

More notes

Developer Certificate of Origin

Checklist

noklam commented Oct 9, 2024

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElenaKhaustova commented Oct 28, 2024

noklam commented Oct 28, 2024

ElenaKhaustova left a comment

Choose a reason for hiding this comment

noklam commented Oct 7, 2024 •

edited

Loading