-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-up data processing #682
Comments
read_excel in get_smhi_data is dominating over everything. |
According to https://hakibenita.com/fast-excel-python we get 10x faster excel reads with python-calamine. |
Adding the python-calamine pip to requirements, we can use it in pandas |
Changing read_excel engine='calamine' in get_smhi_data cut test suite by 6 seconds, from 15 to 9 seconds. But it is still dominating the test time. |
According to https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d we will get an order of magnitude faster loading time if we cashe it also. https://miro.medium.com/v2/resize:fit:720/format:webp/1*-QoJbusw3MUYdms0lbmd4Q.png |
Decorated get_smhi_data with a file cache that reads from cache if it is from the current year. Now the whole suite completes in less than a second! |
Actually just 0.2 seconds - so fast now that the profiler has nothing to show - just an empty .csv file. |
Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance
This shaved off 6 seconds from the test. * Use calamine when importing SMHI data
Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data.
…yran#682 Generalize the df cache decorator.
…yran#682 Generalize the df cache decorator.
Refactor-out the cache_df decorator.
feather is part of pyarrow which is already in requirements. * Remove feather-format
Do not mention excel in cache_df. Also clarify why column names are cached separately.
* More explicit hint on functions supported. * Clarify options.
* Make default path a valid string. * Remove obsolete error handling for no path.
* Hint a return type that is the same as the decorated function.
Fixes a problem when using unittest rather than pytest.
This reverts commit 216339e
Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.
Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.
* Profile data tests #682 Generate an cpu intensity graph of the functions used during pytest. * Add a pytest project file * Add pytest-profiling * Add graphviz * Document how to profile performance * Speedup read_excel for SMHI data #682 This shaved off 6 seconds from the test. * Use calamine when importing SMHI data * Cache SMHI data #682 Speed up repeat usage of SMHI data to a fraction. * Add feather-format * Use a feather file cache for the imported SMI data. * Make default excel file path and df cache period configurable #682 Generalize the df cache decorator. * Add a cache_utilities module #682 Refactor-out the cache_df decorator. * Remove duplicate requirement #682 feather is part of pyarrow which is already in requirements. * Remove feather-format * cache_df is not just for read_excel #682 Do not mention excel in cache_df. Also clarify why column names are cached separately. * Improve documentation of cache_df #682 * More explicit hint on functions supported. * Clarify options. * Make cache_df work without path #682 * Make default path a valid string. * Remove obsolete error handling for no path. * Make cache_df decoration more apparent #682 * Hint a return type that is the same as the decorated function. * Use relative import of cache_df #682 Fixes a problem when using unittest rather than pytest. * Revert "Speedup read_excel for SMHI data #682" This reverts commit 216339e * Speedup read_excel for SMHI data using openpyxl #682 Use openpyxl to read SMHI data. Note: calamine is only 15% faster which is not worth an additional dependency.
Support pytest-profiling of data tests to find and fix performance bottlenecks
The data may need some flattening to numeric ndtypes before processing.
Definition of Done: Can run pytest and plot a heat map of time spent in functions.
Not urgent, but a good introduction for Joppe on the data processing pipelines.
Contact Joppe for questions/discussions/suggestions
In addition to the Definition of Done, the following always apply:
See doc/contributing.md.
The text was updated successfully, but these errors were encountered: