Skip to content

Simplifying dependencies #23115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Oct 12, 2018 · 17 comments
Closed

Simplifying dependencies #23115

datapythonista opened this issue Oct 12, 2018 · 17 comments
Labels
CI Continuous Integration Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

I guess this has been previously discussed, but I'd personally appreciate understanding better the dependencies, and see if they can be simplified. As I think they often generate confusion (e.g. not installing the optional ones), and some errors (e.g. editing files that are automatically generated, or forgetting to run the script).

For what I know, pandas has 3 dependencies, numpy, dateutil and pytz. Those live in setup.py, so when packaging they are required. No question about this part.

Then, for the development environment, I think "ideally" we would like to have a environment.yml file in the root of the project, so setting up a pandas environment is as easy as conda env create, and maintaining the list of dependencies is as easy as updating that file.

The questions then are:

  • What is the reason for splitting into dev and optional? And are they enough to justify the increased complexity?
  • Is it an option to provide only the dependencies for conda? If people is interested in using pip for a development environment, is it an option to instead of having the requirements.txt file, just provide the script that generates it (i.e. convert_deps.sh)?
  • For the CI dependencies, we've got 14 independent files. Is something we could do to make things simpler? Generating those files with a script would simplify, or would make things more complex? Or at least, would it make sense to keep all those files in a separate dir (e.g. ci/requirements/)

CC @jreback, @pandas-dev/pandas-core

@datapythonista datapythonista added CI Continuous Integration Dependencies Required and optional dependencies labels Oct 12, 2018
@WillAyd
Copy link
Member

WillAyd commented Oct 12, 2018

Not sure of the history but do wonder if it wouldn't make sense to combine dev and optional into just optional. To answer your second question about pip, we maybe should start taking pipenv into consideration as well which AFAICT only allows for required and dev dependencies (not an expert there so could be wrong).

@datapythonista datapythonista added the Needs Discussion Requires discussion from core team before further action label Oct 13, 2018
@datapythonista
Copy link
Member Author

@jreback can you help me understand about the difference between dev dependencies and optional dependencies? Thanks!

@TomAugspurger
Copy link
Contributor

What is the reason for splitting into dev and optional? And are they enough to justify the increased complexity?

probably not worth the complexity.

Is it an option to provide only the dependencies for conda?

I have a slight preference for providing both. But auto-generating is fine as well. Really, I think we should just have a code check that runs convert_deps.py and ensures that there's no change in the dependencies (to ensure we don't add dependencies to the auto-generated file).

For the CI dependencies...

I value having a single environment.yaml per job. It's sometimes enough to reproduce an error locally. I'd be OK with consolidating those some how, as long as we retain a way to get the exact environment.yaml passed to conda on the CI run. Having a single file that generates environment.yamls may make it easier to see exactly what is tested, and what's pinned, etc.

@jreback
Copy link
Contributor

jreback commented Nov 2, 2018

the CI are already way simpler than before tom refactored

am -1 on changing this anymore
we purposely test many scenarios and are adequately coveraged now

@datapythonista
Copy link
Member Author

Thanks for the feedback.

Besides generating the dependencies files automatically, that I trust you experience an may not be a good idea, it may still be worth cleaning a bit. May be separating the yaml files from the scripts, unifying the scripts in ci and scripts, which to me seems somehow arbitrary, making it simpler and more standard for users to set up pandas.

I'll try to open a PR, so we don't need to discuss things in an abstract way.

@datapythonista
Copy link
Member Author

datapythonista commented Nov 3, 2018

I've been taking a look at the dependencies, and I generated a spreadsheet with what we have in each file.

I see couple of things that I'm not sure are intentional:

  • The azure-37-locale file is using Python 3.6
  • In most cases, for Python 2.7 we use pytz=2013b, but not in azure-windows-27 (in travis-27-locale we have it twice, with and without the version). Same for xlswriter.
  • I think python-dateutil should always be in the requirements. It's missing in few files, and in azure-windows-27 the package name is dateutil instead of python-dateutil (I guess this last thing is correct, and the windows version for 2.7 is maintained in that other package in conda, but we probably should have a comment)
  • cython is pinned to 0.28.2 for Python 2.7, except for windows
  • mock is only used in one of the Python 2.7 (travis-27), so I guess we are not using mock anymore, and that one could also be removed
  • hypothesis is sometimes installed using conda, and some times using pip.
  • For sqlalchemy, we have the version pinned for Python 2.7, but a different one in each case. Not sure if we want to test all them, or just happened to be like this.

Then, for me it's very difficult to tell what we install in every case (and why). After taking a look at all files, I'm more strongly in favor of generating the files:

  • First to avoid errors and inconsistecies
  • Second to have some sort of documentation or information on why the packages are installed (code likeif pyhon_version == 2: pytz_version = '2013b' would be very helpful to understand what are we doing)

@TomAugspurger @jreback @jorisvandenbossche thoughts?

I attach the file I generated, in case you want to take a look:

pandas_deps.xlsx

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 5, 2018

I think we should split the discussion in two parts: dependencies for CI vs dependencies for contributors (to set up their local dev environment).

The question you raised above about "What is the reason for splitting into dev and optional? And are they enough to justify the increased complexity?" is as far as I know only about the contributor dependencies (and not CI) ?
Also, those environment-dev.yaml and requirements-optional.txt files live in a directory of a lot of files, which can be quite overwhelming if a contributor looks at this. Therefore, it might be a good idea to either move those somewhere else, or either move the actual CI requirement/environment files one folder down (either all in a /requirements or have one directory for each platform (/travis, /azure, ..)

Your last exploration and your proposal to autogenerate those files is then only about the CI dependencies I think?
Do you have a concrete idea on how to autogenerate them? How to encode all the information for the different environments (package versions, which package to install, ...) in a single overview? I am not sure if encoding all this in a python if/else structure in a .py file will be very clear. We could maybe have an actual table/spreadsheet of packages (rows) vs environments (columns) with indication if it is installed or not and their versions (values) ?

@jorisvandenbossche
Copy link
Member

We could maybe have an actual table/spreadsheet of packages (rows) vs environments (columns) with indication if it is installed or not and their versions (values) ?

OK, didn't see your second unnamed sheet .. ;) Because that is doing exactly that. You now did this as exploration, but would you consider something cleaned-up like that as basis for autogeneration?

@datapythonista
Copy link
Member Author

datapythonista commented Nov 5, 2018

I agree on your points. I'm discussing (at least) two things here.

For the contributors dependencies, I don't know of any case of users who install dev but not optional (and optional is creating confusion often with new contributors). I'd respect conda standard and simply provide environment.yml with all them in the root of the project (so the environment can be created by simply running conda env create).

I'd personally not provide dependencies for pip. But if we do, I'd generate them automatically from environment.yml to requirements.txt in the root (that's the standard for pip afaik). And add to the CI a check to make sure that the two versions always match.

This could be environment.yml (we can divide the dependencies in sections, and in the case that a user really needs to install just a subset, it can do it by editing that file and commenting the parts that is not interested in).

name: pandas-dev

channels:
  - defaults
  - conda-forge

dependencies:
  # required
  - NumPy
  - python-dateutil>=2.5.0
  - pytz

  # development
  - Cython>=0.28.2
  - flake8
  - flake8-comprehensions
  - hypothesis>=3.58.0
  - isort
  - moto
  - pytest>=3.6
  - python=3
  - setuptools>=24.2.0
  - sphinx
  - sphinxcontrib-spelling

  # optional
  - beautifulsoup4>=4.2.1
  - blosc
  - bottleneck>=1.2.0
  - fastparquet
  - gcsfs
  - html5lib
  - ipython>=5.6.0
  - ipykernel
  - jinja2
  - lxml
  - matplotlib>=2.0.0
  - nbsphinx
  - numexpr>=2.6.1
  - openpyxl
  - pyarrow>=0.4.1
  - pymysql
  - pytables>=3.4.2  # pip: tables>=3.4.2
  - pytest-cov
  - pytest-xdist
  - s3fs
  - scipy>=0.18.1
  - seaborn
  - sqlalchemy
  - statsmodels
  - xarray
  - xlrd
  - xlsxwriter
  - xlwt

Regarding the dependencies for CI. I think it's impossible to maintain what we have without incurring in errors and inconsistencies. I need some more info to have an informed opinion, and make a proposal, but something like this sounds like a much better option:

deps = ['numpy', 'pytz', 'python-dateutil', 'pytest']

if docs:
    deps += ['sphinx',
             'nbsphinx',  # used for the `.. ipython:: python` directive
             ...]
if slow:
    deps += ['pymysql', ...]

if py2:
    deps.remove('pytz')
    deps += ['pytz=2013b', ...]

This would make very clear what do we need in each case, we could have comments to document why a dependency is needed, changes (upgrading a version for example) would just be made in one place, simplifying work and avoiding errors... And we wouldn't need to have the 10 or more files adding clutter to the ci/ directory.

May be I'm missing something, but it seems a much much better approach to me.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2018

+1 on changing users to have a single file. The reason we had multiple files is to make it 'faster' to install, but I suspect it doesn't make much difference at the end of the day and avoids problems of having deps installed.

+0 on consolidating the CI dep files a little. Its already in the /ci directory so a user shouldn't even be caring about this if they actually read the documentation.

-1 on trying to change even more the CI .yaml files. These are quite straightforward atm. These are meant to test very specific things; consolidating these has in the past caused us not to test what we want, IOW to test oldest possible versions of things and/or separate the CI runs into logical groups (e.g. slow tests on older vesrions of python and so on ).

@datapythonista
Copy link
Member Author

I fail to see how unifying/automating things a bit in the generation of CI yaml files should cause more confusion on what is being tested than maintaining 14 independent files. But I don't have the experience you have with the CI. So if you and @TomAugspurger are sure about that, let's forget about it.

Btw, did you check the points I mentioned in #23115 (comment) ? I guess at least the first one is an error, not sure about the others.

Regarding the single environment.yml file, if anyone opposes to simplify that, we just need to decide:

  1. We provide just conda and not pip files
  2. We provide both, and we maintain them manually
  3. We provide both, we generate the pip one with a script, and we validate in the CI that they match

I'd start by 1, and would implement 3 only if users complain. :)

@TomAugspurger
Copy link
Contributor

I have a slight preference for many environment.yaml files, rather than generating them, since it makes the CI stuff a bit more explicit (smaller chance of us accidentally not testing something because of a bug in whatever we would use to generate them).

I think seeing a matrix of exactly what deps we test where would be extremely valuable. I started to write a script to do this. IIRC I stopped since conda env create doesn't have the equivalent of a --dry-run. But maybe just doing things statically would be enough?

I have a slight preference for providing both an environment.yaml and a requirements-dev.txt with all the deps. Not having an obvious requirements-dev.txt may make us look like bad Python citizens.

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 5, 2018

+ 1 on a ci/env_files/ directory or something

+1 on joris's suggestion to split the discussion along CI vs user-facing lines

Then, for me it's very difficult to tell what we install in every case (and why)

If this is not obvious to @datapythonista, this is prima facie evidence that things can be better-documented or otherwise clarified.

@jorisvandenbossche
Copy link
Member

CI deps:

I have a slight preference for many environment.yaml files, rather than generating them, since it makes the CI stuff a bit more explicit

We can still check-in the autogenerated yaml files, then you don't have this concern?


Contributor deps:

I have a slight preference for providing both an environment.yaml and a requirements-dev.txt with all the deps. Not having an obvious requirements-dev.txt may make us look like bad Python citizens.

+1

We provide both, we generate the pip one with a script

Isn't that what we already do? The pip requirement files are autogenerated.

BTW, I am fine with providing a single environment file instead of the dev + optional files for contributors. But, I think it is still useful to distinguish those two "kinds" of dependencies (eg in the docs): the dev dependencies are those dependencies that are needed to build pandas + run the test suite. With just those, the test suite should pass, as we should skip all tests that use optional dependencies (which is not the case for test dependencies). I think this notion can be useful for people who want to do a minimal install of pandas but want to test their installation.

@jorisvandenbossche
Copy link
Member

OK, see that there are already PRs, will check those :)

@datapythonista
Copy link
Member Author

In case it's useful for anyone else, I upload a summary of the builds we have.

pandas_builds.xlsx

@datapythonista
Copy link
Member Author

I think this was mainly addressed and no further action is expected, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants