Speed-up data processing #682

joakimbits · 2024-09-19T10:43:37Z

Support pytest-profiling of data tests to find and fix performance bottlenecks

The data may need some flattening to numeric ndtypes before processing.

Definition of Done: Can run pytest and plot a heat map of time spent in functions.

Not urgent, but a good introduction for Joppe on the data processing pipelines.

Contact Joppe for questions/discussions/suggestions

In addition to the Definition of Done, the following always apply:

Tests, building, and linting passes
No new warnings are introduced
User experience is not reduced
Code is well formatted and readable
See doc/contributing.md.

joakimbits · 2024-09-19T11:58:15Z

Tested on mac in local branch where pytest-profiling is added to requirements:

brew install graphviz
py.test tests --profile-svg && open prof/combined.svg

joakimbits · 2024-09-19T12:01:49Z

read_excel in get_smhi_data is dominating over everything.

joakimbits · 2024-09-19T13:45:10Z

According to https://hakibenita.com/fast-excel-python we get 10x faster excel reads with python-calamine.

joakimbits · 2024-09-19T13:50:35Z

Adding the python-calamine pip to requirements, we can use it in pandas pd.read_excel("path_to_file.xlsb", engine="calamine")

joakimbits · 2024-09-19T14:01:09Z

Changing read_excel engine='calamine' in get_smhi_data cut test suite by 6 seconds, from 15 to 9 seconds. But it is still dominating the test time.

joakimbits · 2024-09-19T14:10:19Z

According to https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d we will get an order of magnitude faster loading time if we cashe it also. https://miro.medium.com/v2/resize:fit:720/format:webp/1*-QoJbusw3MUYdms0lbmd4Q.png

joakimbits · 2024-09-19T16:06:11Z

Decorated get_smhi_data with a file cache that reads from cache if it is from the current year. Now the whole suite completes in less than a second!

joakimbits · 2024-09-19T16:42:38Z

Actually just 0.2 seconds - so fast now that the profiler has nothing to show - just an empty .csv file.

Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance

This shaved off 6 seconds from the test. * Use calamine when importing SMHI data

Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data.

…yran#682 Generalize the df cache decorator.

Refactor-out the cache_df decorator.

feather is part of pyarrow which is already in requirements. * Remove feather-format

joakimbits · 2024-09-22T00:21:52Z

#683

Do not mention excel in cache_df. Also clarify why column names are cached separately.

* More explicit hint on functions supported. * Clarify options.

* Make default path a valid string. * Remove obsolete error handling for no path.

* Hint a return type that is the same as the decorated function.

Fixes a problem when using unittest rather than pytest.

This reverts commit 216339e

Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.

* Profile data tests #682 Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance * Speedup read_excel for SMHI data #682 This shaved off 6 seconds from the test. * Use calamine when importing SMHI data * Cache SMHI data #682 Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data. * Make default excel file path and df cache period configurable #682 Generalize the df cache decorator. * Add a cache_utilities module #682 Refactor-out the cache_df decorator. * Remove duplicate requirement #682 feather is part of pyarrow which is already in requirements. * Remove feather-format * cache_df is not just for read_excel #682 Do not mention excel in cache_df. Also clarify why column names are cached separately. * Improve documentation of cache_df #682 * More explicit hint on functions supported. * Clarify options. * Make cache_df work without path #682 * Make default path a valid string. * Remove obsolete error handling for no path. * Make cache_df decoration more apparent #682 * Hint a return type that is the same as the decorated function. * Use relative import of cache_df #682 Fixes a problem when using unittest rather than pytest. * Revert "Speedup read_excel for SMHI data #682" This reverts commit 216339e * Speedup read_excel for SMHI data using openpyxl #682 Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024

Speedup read_excel for SMHI data Klimatbyran#682

216339e

This shaved off 6 seconds from the test. * Use calamine when importing SMHI data

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024

Cache SMHI data Klimatbyran#682

53eeb3f

Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024

Make default excel file path and df cache period configurable Klimatb…

34ab729

…yran#682 Generalize the df cache decorator.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 21, 2024

Make default excel file path and df cache period configurable Klimatb…

b829564

…yran#682 Generalize the df cache decorator.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 22, 2024

Add a cache_utilities module Klimatbyran#682

d2e9590

Refactor-out the cache_df decorator.

joakimbits mentioned this issue Sep 22, 2024

Speed up data processing #683

Merged

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 22, 2024

Remove duplicate requirement Klimatbyran#682

e1b86bc

feather is part of pyarrow which is already in requirements. * Remove feather-format

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 22, 2024

cache_df is not just for read_excel Klimatbyran#682

7516b67

Do not mention excel in cache_df. Also clarify why column names are cached separately.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 24, 2024

Improve documentation of cache_df Klimatbyran#682

06e1102

* More explicit hint on functions supported. * Clarify options.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 24, 2024

Make cache_df work without path Klimatbyran#682

5627a2d

* Make default path a valid string. * Remove obsolete error handling for no path.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Sep 24, 2024

Make cache_df decoration more apparent Klimatbyran#682

9dac37d

* Hint a return type that is the same as the decorated function.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024

Use relative import of cache_df Klimatbyran#682

1e5e324

Fixes a problem when using unittest rather than pytest.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024

Revert "Speedup read_excel for SMHI data Klimatbyran#682"

b39530e

This reverts commit 216339e

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024

Speedup read_excel for SMHI data Klimatbyran#682 Klimatbyran#682

5449161

Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.

joakimbits added a commit to joakimbits/klimatkollen that referenced this issue Oct 15, 2024

Speedup read_excel for SMHI data using openpyxl Klimatbyran#682

0d2575f

Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed-up data processing #682

Speed-up data processing #682

joakimbits commented Sep 19, 2024 •

edited

Loading

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 22, 2024 •

edited

Loading

Speed-up data processing #682

Speed-up data processing #682

Comments

joakimbits commented Sep 19, 2024 • edited Loading

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 19, 2024

joakimbits commented Sep 22, 2024 • edited Loading

joakimbits commented Sep 19, 2024 •

edited

Loading

joakimbits commented Sep 22, 2024 •

edited

Loading