Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_zipmanifest results should be memoized #154

Closed
ghost opened this issue Feb 25, 2014 · 15 comments
Closed

build_zipmanifest results should be memoized #154

ghost opened this issue Feb 25, 2014 · 15 comments

Comments

@ghost
Copy link

ghost commented Feb 25, 2014

Originally reported by: wickman (Bitbucket: wickman, GitHub: wickman)


as far as I can tell, build_zipmanifest is not cached.

from a recent profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.942    4.942 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/pex.py:124(_execute_internal)
        1    0.000    0.000    3.830    3.830 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:114(activate)
        1    0.001    0.001    3.823    3.823 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:121(_activate)
        1    0.000    0.000    3.737    3.737 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:105(update_candidate_distributions)
       56    0.003    0.000    3.731    0.067 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:84(load_internal_cache)
        1    0.032    0.032    3.728    3.728 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:63(write_zipped_internal_cache)
       55    0.002    0.000    2.947    0.054 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/util.py:46(distribution_from_path)
       57    0.005    0.000    2.945    0.052 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/pkg_resources.py:1703(__init__)
       57    0.183    0.003    2.938    0.052 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/pkg_resources.py:1452(build_zipmanifest)

This is the profile for starting up a PEX file (zipped python environment, see https://mail.python.org/pipermail/distutils-sig/2014-January/023727.html ) with a number of exploded eggs inside. build_zipmanifest is called with the same archive every time we construct a Distribution via EggMetadata:

    def __init__(self, module):
        EggProvider.__init__(self,module)
        self.zipinfo = build_zipmanifest(self.loader.archive)
        self.zip_pre = self.loader.archive+os.sep

It's not an unreasonable assumption that each time you construct a new Distribution, it will either be on disk or part of its own zip archive, meaning these would not be duplicated calls.

In our case, all eggs are in a single zip. This means 57 50ms calls instead of 1 50ms call in order to run this Python application which has 57 egg dependencies.

The proposal is to cache calls to zipmanifest (perhaps invalidating should os.stat/mtime change.)


@ghost
Copy link
Author

ghost commented Mar 29, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


Not a bad idea. Should be easy enough.

It looks like code that I had worked on due to a issue with pypy. So just adding some additional information for anyone that might beat me to the punch.

Originally distribute used zipimports._zip_directory_cache Which cached this, except the private interface is inconsistent between pypy/cpython/win32. https://bitbucket.org/tarek/distribute/issue/27

@ghost
Copy link
Author

ghost commented Apr 16, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


Here is a file you can test, if interested, I believe it should be caching now:

https://bitbucket.org/philip_thiem/setuptools/downloads#download-348570

Alternatively, I'd like a good test file if possible. :)

@ghost
Copy link
Author

ghost commented May 31, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


Pending Pull Request #58

@ghost
Copy link
Author

ghost commented Jun 14, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


caching the zip manifests Fixes #154

@ghost
Copy link
Author

ghost commented Jun 14, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


This work has been merged, but I have one concern about performance. We're making a speed vs. memory tradeoff here. This change will necessarily increase the memory usage of setuptools, potentially substantially if a lot of zip packages are used, and that memory won't be reclaimed for the life of the process. I'm now starting to question whether this functionality should be enabled by default.

As a result, I've bookmarked the proposed commits with a 'issue154' bookmark, leaving 'master' in a separate ancestry.

@ghost
Copy link
Author

ghost commented Jun 14, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


The more I think about it, the speed vs. memory tradeoff is a hard one to make at a library like this, but I think the default position should be to favor memory usage to speed. What if instead each process was able to opt-in to enable caching?

@ghost
Copy link
Author

ghost commented Jun 21, 2014

Original comment by philip_thiem (Bitbucket: philip_thiem, GitHub: Unknown):


Looks like the response that I sent didn't get posted. I think opt in is fine. I can update my branch there to make the cache optional. I could certainly pull that information from a command line or env, I will have to refresh my memory where configuration is stored in memory.

@ghost
Copy link
Author

ghost commented Jul 1, 2014

Original comment by orsenthil (Bitbucket: orsenthil, GitHub: orsenthil):


I could not find this in master/default/tip. (Plus no reference to this change in NEWS file). Will this change be available part of setuptools release?

#!python

[localhost setuptools]$ hg grep -a 'ZipManifests' -r tip
[localhost setuptools]$ hg grep -a 'ZipManifests'
pkg_resources.py:2884:class ZipManifests(dict):
pkg_resources.py:2884:zip_manifests = ZipManifests()

@ghost
Copy link
Author

ghost commented Jul 1, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


You're right. Although the changes were merged into the master repo, they have not been merged into the master branch from which releases are made. I'm still pondering how opt in should work for this feature.

@ghost
Copy link
Author

ghost commented Jul 5, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


I just ran an experiment on my Python 3.4 Windows 64-bit environment where I have 54 zip eggs and found that enabling this feature adds 1.4Mb (+6.6%) to the heap usage versus the status quo, but I will see almost no speed benefit from the feature as my environment has an approximately one-to-one mapping of module to zip egg.

That's a pretty substantial cost to impose on nearly all processes, especially when the benefits will only be had is specially-crafted scenarios, such as the OP presents.

@ghost
Copy link
Author

ghost commented Jul 5, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Make memoized zip manifests opt-in using the PKG_RESOURCES_CACHE_ZIP_MANIFESTS environment variable. Ref #154.

@ghost
Copy link
Author

ghost commented Jul 5, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Split MemoizedZipManifests from ZipManifests. Ref #154.

@ghost
Copy link
Author

ghost commented Jul 5, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


This feature has been released in Setuptools 5.4. Please try it out and let me know what you think. As you can see above, it's an opt-in only feature, at least for now. A pretty compelling argument would be needed to justify enabling this behavior by default.

@ghost
Copy link
Author

ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Issue #240 was marked as a duplicate of this issue.

@ghost
Copy link
Author

ghost commented Aug 15, 2014

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


I have 54 zip eggs and found that enabling this feature adds 1.4Mb (+6.6%) to the heap usage versus the status quo

I may have been mistaken here. Analyzing the code, I would have expected 5.3 to also be storing the zip manifests for every zip file (in the zipinfo property). I suspect when I said "status quo" here, I was simply disabling the caching, not actually measuring against 5.3 (the actual status quo). Since I didn't actually log my experiment, I can't say for sure at this point, but I'm going to assume that was the case.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0 participants