Description
What's the problem this feature will solve?
Workflows that require a frequent re-creation of virtual envs from scratch are very slow, and a large amount of time is spend collecting wheels despite having all the artefacts in the local cache already.
To demonstrate, let's consider installing from this (pip-compiled) requirements.txt
:
requirements.txt
#
# This file is autogenerated by pip-compile with python 3.8
# To update, run:
#
# pip-compile --no-emit-index-url requirements.in
#
ansible==4.5.0
# via -r requirements.in
ansible-core==2.11.4
# via ansible
asgiref==3.4.1
# via django
attrs==21.2.0
# via pytest
backcall==0.2.0
# via ipython
black==21.8b0
# via -r requirements.in
bokeh==2.3.3
# via -r requirements.in
brotli==1.0.9
# via flask-compress
cashier==1.3
# via -r requirements.in
certifi==2021.5.30
# via requests
cffi==1.14.6
# via cryptography
charset-normalizer==2.0.4
# via requests
click==8.0.1
# via
# -r requirements.in
# black
# flask
cloudpickle==2.0.0
# via dask
cryptography==3.4.8
# via ansible-core
cycler==0.10.0
# via matplotlib
dash==2.0.0
# via -r requirements.in
dash-core-components==2.0.0
# via dash
dash-html-components==2.0.0
# via dash
dash-table==5.0.0
# via dash
dask==2021.9.0
# via -r requirements.in
decorator==5.1.0
# via ipython
django==3.2.7
# via -r requirements.in
docker==5.0.2
# via -r requirements.in
flake8==3.9.2
# via -r requirements.in
flask==2.0.1
# via
# -r requirements.in
# dash
# flask-compress
flask-compress==1.10.1
# via dash
fsspec==2021.8.1
# via dask
idna==3.2
# via requests
importlib-resources==5.2.2
# via yachalk
iniconfig==1.1.1
# via pytest
ipython==7.27.0
# via -r requirements.in
itsdangerous==2.0.1
# via flask
jedi==0.18.0
# via ipython
jinja2==3.0.1
# via
# -r requirements.in
# ansible-core
# bokeh
# flask
joblib==1.0.1
# via scikit-learn
kiwisolver==1.3.2
# via matplotlib
locket==0.2.1
# via partd
markupsafe==2.0.1
# via jinja2
matplotlib==3.4.3
# via -r requirements.in
matplotlib-inline==0.1.3
# via ipython
mccabe==0.6.1
# via flake8
mypy==0.910
# via -r requirements.in
mypy-extensions==0.4.3
# via
# black
# mypy
numpy==1.21.2
# via
# -r requirements.in
# bokeh
# matplotlib
# pandas
# pyarrow
# scikit-learn
# scipy
packaging==21.0
# via
# ansible-core
# bokeh
# dask
# pytest
pandas==1.3.3
# via -r requirements.in
parso==0.8.2
# via jedi
partd==1.2.0
# via dask
pathspec==0.9.0
# via black
pexpect==4.8.0
# via ipython
pickleshare==0.7.5
# via ipython
pillow==8.3.2
# via
# bokeh
# matplotlib
platformdirs==2.3.0
# via black
plotly==5.3.1
# via dash
pluggy==1.0.0
# via pytest
prompt-toolkit==3.0.20
# via ipython
ptyprocess==0.7.0
# via pexpect
py==1.10.0
# via pytest
pyarrow==5.0.0
# via -r requirements.in
pycodestyle==2.7.0
# via flake8
pycparser==2.20
# via cffi
pyflakes==2.3.1
# via flake8
pygments==2.10.0
# via ipython
pyparsing==2.4.7
# via
# matplotlib
# packaging
pytest==6.2.5
# via -r requirements.in
python-dateutil==2.8.2
# via
# bokeh
# matplotlib
# pandas
pytz==2021.1
# via
# django
# pandas
pyyaml==5.4.1
# via
# -r requirements.in
# ansible-core
# bokeh
# dask
regex==2021.8.28
# via black
requests==2.26.0
# via
# -r requirements.in
# docker
resolvelib==0.5.5
# via ansible-core
scikit-learn==0.24.2
# via sklearn
scipy==1.7.1
# via
# -r requirements.in
# scikit-learn
six==1.16.0
# via
# cycler
# plotly
# python-dateutil
sklearn==0.0
# via -r requirements.in
sqlitedict==1.7.0
# via -r requirements.in
sqlparse==0.4.2
# via django
tenacity==8.0.1
# via plotly
threadpoolctl==2.2.0
# via scikit-learn
toml==0.10.2
# via
# mypy
# pytest
tomli==1.2.1
# via black
toolz==0.11.1
# via
# dask
# partd
tornado==6.1
# via bokeh
traitlets==5.1.0
# via
# ipython
# matplotlib-inline
typing-extensions==3.10.0.2
# via
# black
# bokeh
# mypy
urllib3==1.26.6
# via requests
wcwidth==0.2.5
# via prompt-toolkit
websocket-client==1.2.1
# via docker
werkzeug==2.0.1
# via flask
yachalk==0.1.4
# via -r requirements.in
zipp==3.5.0
# via importlib-resources
Benchmark script for reproduction:
# cleanup tmp venv if it exists
rm -rf ./tmp_venv
# re-create tmp venv
virtualenv ./tmp_venv
# activate tmp venv
. ./tmp_venv/bin/activate
# install requiremts
time pip install -r requirements.txt --no-deps -i https://pypi.python.org/simple
Starting with the second execution of the script, pip can fully rely on its local cache. But even with 100% cache hits, the time it takes to run all the Collecting ... Using cached ...
operations takes ~42 seconds. The total run time is ~92 seconds, so 45% of the time is spend just collecting from the cache. This time seems excessive. The disk is a fast SSD and the total sum of data to be collected should be < 1GB. So in terms of I/O it should be possible to collect the artefacts much faster from an SSD based cache.
In terms of raw I/O loading the wheels from an SSD based cache should be in the order of few seconds. Thus, bringing down the collection time could speed up venv creation in many cases by almost a factor of 2. This could e.g. significantly speed up CI pipelines that require creation of multiple similar venvs (in fact, venv creation is becoming an increasing bottleneck in complex CI pipelines for us).
Used pip version: 21.2.4
Describe the solution you'd like
Perhaps it is possible to revisit why collecting artefacts from the cache is so slow.
Alternative Solutions
Alternative solutions probably doesn't apply in case of performance improvements.
Additional context
All information given above?
Code of Conduct
- I agree to follow the PSF Code of Conduct.