Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python @ GCCcore #7463

Closed
Micket opened this issue Jan 21, 2019 · 3 comments
Closed

Python @ GCCcore #7463

Micket opened this issue Jan 21, 2019 · 3 comments
Labels
Milestone

Comments

@Micket
Copy link
Contributor

Micket commented Jan 21, 2019

Background

Compiling python together with the scipy bundle at the toolchain level has disadvantages:

  1. Unnecessarily duplicated python interpreters
  2. Any modules with direct or indirect runtime dependencies on Python is forced to be on the toolchain level as well.
    This primarily

Having a runtime dependency libpython at GCCcore will significantly reduce number of modules, as well as ensuring that all GCCcore based toolchains have access to a wider base of modules automatically.

Some affected configs that are currently at toolchain levels (due to Python) are:

  • Mesa, Qt, PyQt, Tkinter, GTK, PyGTK, GObject-Introspection, PyGObject, PyCairo, PyOpenGL, wxPython, CGAL, GDAL,
  • Meson, pkgconfig, SCons, SWIG, Mako, ANTLR, wheel,
  • Boost, PyYAML, lxml, Pillow, PostgreSQL, nd2reader, PIMS, xarray, configparser, future, cftime, Greenlet, sympy

(above list is just from browsing modules from past toolchains, some may be applicable for moving to GCCcore, nor is the list exhaustive)

Limitations

Some discussions have been going on about we handle PYTHONPATH in EB. That is a completely seperate issue which can be applied here regardless.

The list of packages in the Python easyconfigs that can't/shouldn't move to GCCcore are:

  • numpy
  • scipy
  • pandas
  • mpi4py
  • deap (unsure about this one)

Anything that in turn depends on these packages specifically, would stay at the toolchain level.

Suggestion 1: Just split existing package bundle

Keeping Scipy and friends seperately would let the current Python config, and related dependees, be moved to GCCcore.

  • Positive: Change in EasyBuild is trivial (just change a few easyconfigs).
  • Positive: No need for new names (like bare, base, core suffixes). Just Python-x.x.x-GCCcore-x.x.x.eb.
  • Positive/Negative: Seperate Scipy module lets users search for it among the modules (no need to explain that it's part of Python). Have to explain for existing users that from now on, it's no longer going to be part of Python.
  • Negative: exp function may be slower in GCCcore version compared to an intel version. Uncertain if this has an impact on any real code, as it doesn't affect vectorized operations in Numpy.

The scipy, numpy, pandas, mpi4py, deap packages can either be partially packaged as a bundle "SciPy", or as individual packages.
There are (old) invidual packages for pandas, numpy, so it might be consistent to just have them invidual?
Invidual packages are also easier to debug then bundles, when something goes wrong in the install.

Suggestion 2: Introduce a Python-core config

Ref: #6537

Similar to suggestion 1, but by introducing a new name, Python-core, we can keep toolchain level Python

  • Negative: Requires fixes in the easybuild framework to recognize Python-core as a Python package.
  • Positive: Modules look basically the same for users.
  • Negative: Have to keep explaining to new users that numpy is part of Python.
  • Negative: same exp/trig function issue as with suggestion 1.
  • Negative: Requires slightly uglier easyconfigs
  • Negative: Issues with configs/code that use EBROOTPYTHON
  • Negative: Issues with configs that use get_software_root('Python')

Suggestion 3: Shadowing libpython

Suggested by Jack Perdue
Building Python(base) as GCCcore as well as building Python at toolchain level.

  • Negative: Shadowing may introduce ABI problems
  • Negative: Leaves modules broken; requiring users to pick a Python version to load manually before they can be used.
  • Negative: Requires building more Python's.
  • Positive/Negative: Python gets built with icc (probably avoid exp-function issue?)

Suggestion 4: Using MKL with GCCcore without toolchain

This approach, as done locally by Damian(?) makes very significant changes to the whole toolchain level.

  • Positive: Has other great benefits, but will only save a few extra modules.
  • Negative: Requires changes that are very unlikely to make it into the next toolchains.
  • Positive/Negative: Numpy is just compiled with GCC, shared between all toolchains.

Suggestion 5: Only as a build dep

Ref: #5072

  • Negative: Doesn't tackle the runtime dependencies, so it won't have much impact?

Suggestion 6: Use a hidden python module and shadow it

Ref: #4962 #5075

  • Negative: ABI compatiblity to consider
  • Negative: Leves modules broken; requires users to pick a Python to load before dependees (like PyGTK) would work.

Known ABI difference between intel and gcc compiled versions:

objdump -t libpython2.7.so.1.0 | grep _PyUnicode

Performance

A single performance issue has been identified yet; libimf from intel has faster exp, log, sin and cos functions. As libimf only shadows libm, this also affects numpy. Only noticeable in arrays > len 10. Other math functions didn't show much difference.

Possible fixes:

  • LD_PRELOAD=libimf.so python ...
  • Linking with libimf (from mkl) when building in GCCcore

Note: I have gotten much better performance from numexpr, using VML.

Conclusion:

I prefer suggestion 1. I don't think the changes for users is going to be that problematic. I don't think the libimf function issue is likely to affect much real code. Suggestion 2 was tried in production, and, while doable, is a hassle with no benefit.

@boegel boegel added this to the 3.9.0 milestone Jan 22, 2019
@boegel boegel added the change label Jan 22, 2019
@boegel
Copy link
Member

boegel commented Jan 22, 2019

Thanks a lot for summarizing this so nicely @Micket!

W.r.t. suggestion 1:

  • We can look into still having a Python-2.7.15-foss-2019a.eb that simply installs an alias for SciPy-<datestamp?>-foss-2019a-Python-2.7.15.eb, much like we do with Java now, see https://easybuild.readthedocs.io/en/develop/Wrapping_dependencies.html. That would avoid most of the impact on users, since they can keep using module load Python/2.7.15-foss-2019a like they're used to. The wrapping support needs some work for that though, since now module wrappers are only supported between versions of one particular software name (e.g. Java), not between two different software names (here SciPy and Python).

  • Can we find a good set of (real world) benchmarks to quantify the impact of always building the Python interpreter with GCC? E.g. using the ones posted at https://software.intel.com/en-us/distribution-for-python/benchmarks, which seems to correspond to https://github.com/IntelPython/ibench?

  • I would prefer using a SciPy bundle rather than having individual easyconfigs/modules for the Python packages that are installed at the toolchain level. Not only does the latter result in a larger amount of modules/easyconfigs (and thus also a longer $PYTHONPATH which has an impact on startup performance), it also prevents sites from using the proposed wrapper approach to make this change easier to digest for end users...

@Micket
Copy link
Contributor Author

Micket commented Jan 22, 2019

  • If the alias doesn't confuse/conflict, then I see no reason not to add one. If so, then I think this part is completely positive; backwards compatible and a searchable SciPy module.

  • libimf vs libm and numpy.exp is a bit of a special case, so I wouldn't expect more exotic math to be affected. I ran the benchmark you linked (excluding sklearn, as I didn't have it built).
    I ran: Cholesky, Det, Dot, Fft, Inv, Lu, Eig, Svd, Blacksch, Rng.
    Unfortunately, I had to compare intel/2018a (with python @ icc) with intel/2018b (with python @ gcc)
    Except for Blacksch, all tests where equal, or a bit better.
    Blacksch went from 26s -> 39s (the same as the foss version).
    Looking at this particular benchmark, it just comes down to calling log, exp, a little bit of other math on a large numpy array.

  • I could really go either way on this, but just to play devils advocat; couldn't the wrapper have multiple dependencies? Also we could solve the PYTHONPATH problem with that other approach (I forget the details). Also also, does deap and mpi4py also go into this SciPy bundle?

@Micket
Copy link
Contributor Author

Micket commented Apr 26, 2019

We do this now.

@Micket Micket closed this as completed Apr 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants