-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add runner benchmark #4210
Add runner benchmark #4210
Conversation
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
At the moment I only running test for a compute-bound workload, the result is expected only @ElenaKhaustova any thought? Should I focus on testing only the |
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
kedro_benchmarks/pyproject.toml
Outdated
[project] | ||
name = "kedro_benchmarks" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to do this is multiprocessing
breaks imports. Without this I get a benchmarks
module is not found error.
To make sure this always in sys.path
, I add this into part of the installation step. Other suggestions are welcomed.
I have tried PYTHONPATH
as well but it doesn't seem to work as new process doesn't inherit it.
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
a8883b9
to
66ad6c5
Compare
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
…edro-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good 👍
One question I have is why the SequentialRunner
and ParallelRunner
are tested with the compute bound pipeline and ThreadRunner
with io bound?
…-org/kedro into noklam/stress-testing-runners-4127 Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
with open(data_dir / "data.csv", "w") as f: | ||
f.write("col1,col2\n1,2\n") | ||
|
||
def mem_runners(self, runner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also test memory usage on various catalogs and actually return some objects that need sufficient memory rather than a "dummy
" string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just changing what it returns? We can do that but I don't know if it changes anything. Do I also need to change the structure of the pipeline?
Most data are not stored in DataCatalog
unless CacheDataset
is used. From my understanding they only get loaded within the node function and get recycled as soon as node is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean varying the catalog and number of entries there.
Makes sense about returns, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean adding also the KedroDataCatalog
? I am not sure what do you want to test by varying the number of entries, can you elaborate more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I mean running tests for different catalog configurations—the number of datasets in the catalog, with/without patterns—so we vary something that actually affects memory usage. We can also vary the pipeline size and the number of input arguments for nodes.
In the current setup, we use the same pipeline and catalog configuration and we try to test memory usage by running nodes with heavy compute load that should not grow the memory. So it feels like we might add other scenarios that grow it and we're able to compare (memory grows as expected) runs on them.
Adding KedroDataCatalog
would be nice as well, just to compare them. But we can do it in a separate PR if easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ElenaKhaustova I see, https://github.com/kedro-org/kedro/blob/main/benchmarks/benchmark_kedrodatacatalog.py, I think this is covered in the catalog test already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but we didn't do memory profiling. Maybe worth adding it on the catalog side. Anyway, that's just an idea, so I don't want to block the PR with it.
Maybe it makes sense to add time tests for each runner and each scenario (compute bound, io bound) so we can compare not only catalogs (when changes are applied) but also runners (in general, between each other). |
We can do that. I didn't do it because I think we don't get much for running I can add these tests, and if these tests start taking too long then I will suggest to remove the less useful one. |
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @noklam, LGTM!
There's a suggestion on possible extension of tests for memory profiling, but that can be done in the separate PR if decide to do that.
Description
Close #4127
Dev Notes:
benchmarks
due to name clash with pyarrow: Unexpectedbenchmark
module get installed with arrow apache/arrow#44398benchmarks
has been move tokedro_benchmarks
, despite best effort to avoid installation ofpyarrow
, it will still causes a weirdbenchmarks
module not found error. I try many different things, but ultimately I found that it works as long as the folder have any name other than `benchmarks. It's a bit strange but also not worth the effort to further investigate, so I end up renaming the folder.QA notes
pip install asv or
pip install -e ".[benchmark]"
asv run --quick
More notes
During test I discover
ThreadRunner
throws error like this occassionally, it's likely the same issue of #4191Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file