Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up data processing #682

Open
joakimbits opened this issue Sep 19, 2024 · 9 comments · Fixed by #683
Open

Speed-up data processing #682

joakimbits opened this issue Sep 19, 2024 · 9 comments · Fixed by #683

Comments

@joakimbits
Copy link

joakimbits commented Sep 19, 2024

Support pytest-profiling of data tests to find and fix performance bottlenecks

The data may need some flattening to numeric ndtypes before processing.

Definition of Done: Can run pytest and plot a heat map of time spent in functions.

Not urgent, but a good introduction for Joppe on the data processing pipelines.

Contact Joppe for questions/discussions/suggestions

In addition to the Definition of Done, the following always apply:

  • Tests, building, and linting passes
  • No new warnings are introduced
  • User experience is not reduced
  • Code is well formatted and readable
    See doc/contributing.md.
@joakimbits
Copy link
Author

Tested on mac in local branch where pytest-profiling is added to requirements:

brew install graphviz
py.test tests --profile-svg && open prof/combined.svg

combined

@joakimbits
Copy link
Author

read_excel in get_smhi_data is dominating over everything.

@joakimbits
Copy link
Author

According to https://hakibenita.com/fast-excel-python we get 10x faster excel reads with python-calamine.

@joakimbits
Copy link
Author

Adding the python-calamine pip to requirements, we can use it in pandas pd.read_excel("path_to_file.xlsb", engine="calamine")

@joakimbits
Copy link
Author

Changing read_excel engine='calamine' in get_smhi_data cut test suite by 6 seconds, from 15 to 9 seconds. But it is still dominating the test time.

@joakimbits
Copy link
Author

@joakimbits
Copy link
Author

Decorated get_smhi_data with a file cache that reads from cache if it is from the current year. Now the whole suite completes in less than a second!

@joakimbits
Copy link
Author

Actually just 0.2 seconds - so fast now that the profiler has nothing to show - just an empty .csv file.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024
Generate an cpu intensity graph of the functions used during pytest.

* Add a pytest project file
* Add pytest-profiling
* Add graphviz
* Document how to profile performance
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024
This shaved off 6 seconds from the test.

* Use calamine when importing SMHI data
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024
Speed up repeat usage of SMHI data to a fraction.

* Add feather-format
* Use a feather file cache for the imported SMI data.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 22, 2024
Refactor-out the cache_df decorator.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 22, 2024
feather is part of pyarrow which is already in requirements.

* Remove feather-format
@joakimbits
Copy link
Author

joakimbits commented Sep 22, 2024

#683

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 22, 2024
Do not mention excel in cache_df.

Also clarify why column names are cached separately.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 24, 2024
* More explicit hint on functions supported.
* Clarify options.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 24, 2024
* Make default path a valid string.
* Remove obsolete error handling for no path.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 24, 2024
* Hint a return type that is the same as the decorated function.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024
Fixes a problem when using unittest rather than pytest.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024
Use openpyxl to read SMHI data.

Note: calamine is only 15% faster which is not worth an additional dependency.
joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024
Use openpyxl to read SMHI data.

Note: calamine is only 15% faster which is not worth an additional dependency.
elvbom pushed a commit that referenced this issue Oct 23, 2024
* Profile data tests #682

Generate an cpu intensity graph of the functions used during pytest.

* Add a pytest project file
* Add pytest-profiling
* Add graphviz
* Document how to profile performance

* Speedup read_excel for SMHI data #682

This shaved off 6 seconds from the test.

* Use calamine when importing SMHI data

* Cache SMHI data #682

Speed up repeat usage of SMHI data to a fraction.

* Add feather-format
* Use a feather file cache for the imported SMI data.

* Make default excel file path and df cache period configurable #682

Generalize the df cache decorator.

* Add a cache_utilities module #682

Refactor-out the cache_df decorator.

* Remove duplicate requirement #682

feather is part of pyarrow which is already in requirements.

* Remove feather-format

* cache_df is not just for read_excel #682

Do not mention excel in cache_df.

Also clarify why column names are cached separately.

* Improve documentation of cache_df #682

* More explicit hint on functions supported.
* Clarify options.

* Make cache_df work without path #682

* Make default path a valid string.
* Remove obsolete error handling for no path.

* Make cache_df decoration more apparent #682

* Hint a return type that is the same as the decorated function.

* Use relative import of cache_df #682

Fixes a problem when using unittest rather than pytest.

* Revert "Speedup read_excel for SMHI data #682"

This reverts commit 216339e

* Speedup read_excel for SMHI data using openpyxl #682

Use openpyxl to read SMHI data.

Note: calamine is only 15% faster which is not worth an additional dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
@joakimbits and others