Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add runner benchmark #4210

Merged
merged 41 commits into from
Oct 28, 2024
Merged

Add runner benchmark #4210

merged 41 commits into from
Oct 28, 2024

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Oct 7, 2024

Description

Close #4127

Dev Notes:

  • Rename benchmarks due to name clash with pyarrow: Unexpected benchmark module get installed with arrow apache/arrow#44398
  • Add runner tests for peak memory, memory and time
  • For time, each runner get a specific workload (IO/compute bound)
  • benchmarks has been move to kedro_benchmarks, despite best effort to avoid installation of pyarrow, it will still causes a weird benchmarks module not found error. I try many different things, but ultimately I found that it works as long as the folder have any name other than `benchmarks. It's a bit strange but also not worth the effort to further investigate, so I end up renaming the folder.

QA notes

pip install asv or pip install -e ".[benchmark]"
asv run --quick

More notes

During test I discover ThreadRunner throws error like this occassionally, it's likely the same issue of #4191

DatasetAlreadyExistsError: Dataset 'dummy_1' has already been registered

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
@noklam
Copy link
Contributor Author

noklam commented Oct 9, 2024

============================================================Done============================================================
{'SequentialRunner': 7.8415021896362305, 'ThreadRunner': 7.56311297416687, 'ParallelRunner': 3.4262261390686035}

At the moment I only running test for a compute-bound workload, the result is expected only ParallelRunner should speed things up. I run into issue sometimes with dataset already registered error. I think it's likely related to: #4191

@ElenaKhaustova any thought? Should I focus on testing only the KedroDataCatalog instead?

@noklam noklam requested a review from ankatiyar October 9, 2024 14:34
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Comment on lines 1 to 2
[project]
name = "kedro_benchmarks"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to do this is multiprocessing breaks imports. Without this I get a benchmarks module is not found error.

To make sure this always in sys.path, I add this into part of the installation step. Other suggestions are welcomed.

I have tried PYTHONPATH as well but it doesn't seem to work as new process doesn't inherit it.

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
@noklam noklam force-pushed the noklam/stress-testing-runners-4127 branch from a8883b9 to 66ad6c5 Compare October 16, 2024 11:23
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
@noklam noklam marked this pull request as ready for review October 17, 2024 14:53
@noklam noklam requested a review from merelcht as a code owner October 17, 2024 14:53
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
…edro-org/kedro into noklam/stress-testing-runners-4127

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
@noklam noklam requested a review from ankatiyar October 24, 2024 12:24
…-org/kedro into noklam/stress-testing-runners-4127

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good 👍

One question I have is why the SequentialRunner and ParallelRunner are tested with the compute bound pipeline and ThreadRunner with io bound?

kedro_benchmarks/README.md Outdated Show resolved Hide resolved
with open(data_dir / "data.csv", "w") as f:
f.write("col1,col2\n1,2\n")

def mem_runners(self, runner):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also test memory usage on various catalogs and actually return some objects that need sufficient memory rather than a "dummy" string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just changing what it returns? We can do that but I don't know if it changes anything. Do I also need to change the structure of the pipeline?

Most data are not stored in DataCatalog unless CacheDataset is used. From my understanding they only get loaded within the node function and get recycled as soon as node is finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean varying the catalog and number of entries there.

Makes sense about returns, thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean adding also the KedroDataCatalog? I am not sure what do you want to test by varying the number of entries, can you elaborate more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I mean running tests for different catalog configurations—the number of datasets in the catalog, with/without patterns—so we vary something that actually affects memory usage. We can also vary the pipeline size and the number of input arguments for nodes.

In the current setup, we use the same pipeline and catalog configuration and we try to test memory usage by running nodes with heavy compute load that should not grow the memory. So it feels like we might add other scenarios that grow it and we're able to compare (memory grows as expected) runs on them.

Adding KedroDataCatalog would be nice as well, just to compare them. But we can do it in a separate PR if easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but we didn't do memory profiling. Maybe worth adding it on the catalog side. Anyway, that's just an idea, so I don't want to block the PR with it.

@ElenaKhaustova
Copy link
Contributor

This looks good 👍

One question I have is why the SequentialRunner and ParallelRunner are tested with the compute bound pipeline and ThreadRunner with io bound?

Maybe it makes sense to add time tests for each runner and each scenario (compute bound, io bound) so we can compare not only catalogs (when changes are applied) but also runners (in general, between each other).

@noklam
Copy link
Contributor Author

noklam commented Oct 28, 2024

This looks good 👍
One question I have is why the SequentialRunner and ParallelRunner are tested with the compute bound pipeline and ThreadRunner with io bound?

Maybe it makes sense to add time tests for each runner and each scenario (compute bound, io bound) so we can compare not only catalogs (when changes are applied) but also runners (in general, between each other).

We can do that. I didn't do it because I think we don't get much for running ThreadRunner for compute bound task (maybe things will change with 3.13 but we also need to enable that flag explicitly). These tests are more costly as they takes some time to run.

I can add these tests, and if these tests start taking too long then I will suggest to remove the less useful one.

Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @noklam, LGTM!

There's a suggestion on possible extension of tests for memory profiling, but that can be done in the separate PR if decide to do that.

@noklam noklam merged commit 1ecb0c8 into main Oct 28, 2024
34 checks passed
@noklam noklam deleted the noklam/stress-testing-runners-4127 branch October 28, 2024 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Stress Testing] - Runners
4 participants