Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add lazy imports to speed up the time taken to load pyjanitor (part 2) #1180

Merged
merged 4 commits into from
Oct 29, 2022
Merged

[ENH] Add lazy imports to speed up the time taken to load pyjanitor (part 2) #1180

merged 4 commits into from
Oct 29, 2022

Conversation

asmirnov69
Copy link
Contributor

PR Description

This PR resolves #1059

Code changes are taken from original PR #1060 and placed to current dev branch head.

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch.
  2. If you're not on the contributors list, add yourself to AUTHORS.md.
  3. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

@ericmjl

@codecov
Copy link

codecov bot commented Oct 18, 2022

Codecov Report

Merging #1180 (6610448) into dev (352977c) will decrease coverage by 0.40%.
The diff coverage is 79.41%.

@@            Coverage Diff             @@
##              dev    #1180      +/-   ##
==========================================
- Coverage   97.99%   97.58%   -0.41%     
==========================================
  Files          76       76              
  Lines        3387     3397      +10     
==========================================
- Hits         3319     3315       -4     
- Misses         68       82      +14     

@ericmjl
Copy link
Member

ericmjl commented Oct 19, 2022

Hi @asmirnov69, thanks for taking the initiative on this PR!

Can I ask, when you benchmark import times on your own machine, do you observe a speedup in import times?

I'm asking b/c on a fresh Codespace, running the import benchmark gets me:

image

However, a subsequent run gets me:

image

As you probably can tell, the absolute first clean import on a machine will take ~6 seconds, and subsequent imports will be fast. I'm not sure whether this will be desired behaviour or not, though. What are your thoughts here?

@asmirnov69
Copy link
Contributor Author

asmirnov69 commented Oct 19, 2022

Hi @ericmjl

Looks like first run is busy caching bytecode into __pycache__ dirs - in your case it is way too slow. Difference and the timings you observe are much bigger that I can see on my ubuntu 22.04 4 cpu 16Gb in aws. All my tests are faster than 1 sec - including test with no cached bytecode.
I will try to do one more test on fresh instance, will let you know if I will observe significant slow down on first time use - but I think something is happening in your system.

Few immediate observations:

  • tuna.png doesn't go through pre-commit - file is ~800k which is bigger than allowed 500k
  • simple timing shows improvement in import time ~30%. Removing of __pycache__ dirs slow both tests down ~8%. The slowdown if -B cli options is used to prevent bytecode caching is the same 8% as expected
time python -c 'import janitor' # real time ~0.7 secs
time python -B -c 'import janitor' # 
python -X importtime -c 'import janitor' |& tail # look at the last line
python -B -X importtime -c 'import janitor' |& tail # 

30% improvement is smaller gain than you've seen (~50%). But looks like it make sense to use lazy_import in pyjanitor since gain is here and if anything goes wrong with future development it should be easy to change import statements.


In general slow startup time of some script due to huge package dependencies could become the problem elsewhere too. E.g. usual executables or shared objects sometimes loaded before program start using LD_PRELOAD (also used if you have circular dependencies). Maybe python3 team will add something like this in python itself in future.

But the real solution could be proper split of pyjanitor to pluggable modules (as mentioned in #826). pyjanitor list of mandatory dependencies is too long.

In long term I am for packing of pyjanitor using optional dependencies which I think is proper answer on how to implement the ask of #826:

Is the following possible?
pip install pyjanitor[base]

BTW, I can explore what kind of visualization support can be added to my proposal #1176 to start addressing package dependencies problem. The idea could be to collect import dependencies into rdf triple form and then maybe have some useful diagram out of that - e.g. pyjanitor submodules dependencies graph.

@asmirnov69
Copy link
Contributor Author

What should be done about codecov checks failures? Is it mandatory precondition to have all checks green to have contribution accepted?

@ericmjl
Copy link
Member

ericmjl commented Oct 20, 2022

@asmirnov69 thanks for the detailed response, and for putting in the time and effort to do profiling on your end.

Considering what you've written, I think we're good with the PR.

With respect to codecov, the big ones to look out for are that all tests pass (primarily), and that code coverage on the project doesn't decrease significantly. In this case, a decrease by 0.41% is acceptable.

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR. Let's see how this works out on dev (i.e. make sure all tests still pass).

@asmirnov69
Copy link
Contributor Author

@ericmjl do i need to do anything at this point? who is going to complete this PR with merge commit to dev?

@ericmjl
Copy link
Member

ericmjl commented Oct 24, 2022

We'll need one more review from the team. I just asked for @samukweku, @thatlittleboy, and @Zeroto521. As a rule of thumb, we try to go for 2 reviews (but time-bound to ~1-1.5 weeks of having a PR open.)

Copy link
Member

@Zeroto521 Zeroto521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small patches.

Since lazy_loader is imported in janitor/__init__.py. It's a necessary package.
So It need to add into .requirements/base.in.

AUTHORS.md Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
janitor/__init__.py Outdated Show resolved Hide resolved
Comment on lines +4 to 7
import lazy_loader as lazy
import numpy as np
import pandas_flavor as pf
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw all packages in janitor/math.py were wrapped by lazy.load.
But janitor/functions/impute.py only wrapped scipy.stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't try to go beyond original @ericmjl code changes.
it could be true that some newer additions on dev would require changes in imports. i assume we can address it later in subsequent PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @Zeroto521. We can continue to improve coverage of lazy loading, so no worries here. Plus we might also benefit from Python having lazy loading by default! (I'm thinking of PEP690 here.) Let's stick with what we have for now.

@ericmjl
Copy link
Member

ericmjl commented Oct 29, 2022

Three approvers! I am going to hit the merge button 😄.

@ericmjl ericmjl merged commit 36de148 into pyjanitor-devs:dev Oct 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable lazy importing for external dependencies
4 participants