Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Debian packaging #254

Closed
cdluminate opened this issue Sep 23, 2018 · 33 comments
Closed

Question about Debian packaging #254

cdluminate opened this issue Sep 23, 2018 · 33 comments

Comments

@cdluminate
Copy link
Contributor

I see Nico is working on BLIS packaging. I'm interested in packaging BLIS for Debian based on Nico's work. However I have some questions before doing the actual work:

  1. Is BLIS mature enough to be used as a drop-in replacement of libblas.so.3 ?
  2. Is there any benchmark of which compares BLIS with other BLAS implementations such as OpenBLAS?
@fgvanzee
Copy link
Member

@cdluminate Thank you for your interest in BLIS. We have been eagerly waiting for the right person from the Debian community to come along and "sponsor" our project in the Debian/Ubuntu universe. :)

  1. Is BLIS mature enough to be used as a drop-in replacement of libblas.so.3 ?

My first instinct is to answer with an emphatic "yes." However, the framing of your question is ultimately subjective, and also dependent on the actual implementation referred to by libblas.so.3, as my understanding is that libblas.so.3 is merely a generic symlink to the current BLAS library. The actual shared library to which it links would be some specific implementation, whether it is the Fortran-77 reference library from netlib, OpenBLAS, or something else.

The lead developers of the BLIS project exercise great care in their approach to software development. One of our guiding principles is to try to "get it right" the first time so we are much less likely to have to revisit/fix it in the future, and we believe this methodical approach pays dividends in long run. We still make mistakes from time to time, but I think the public record here on github shows that we are quite responsive to the community's feedback, especially with bug reports. We sometimes fix issues within hours of them being reported, and BLIS has multiple tools for checking correctness at our disposal, including a comprehensive BLIS testsuite, a C translation of the netlib BLAS test drivers, and integration with Travis CI that uses the former two mechanisms to test multiple hardware configurations via Intel's software development emulator (SDE).

So, to conclude, the answer to your question is "probably," but it depends on what you expect of libblas.so.3. BLIS is still under active development, even if the core functionality exists in a mostly stable state. Whether BLIS belongs in Debian as an officially-supported package may also depend on how frequently our sponsor will be willing to provide updated packages. The more frequent, the better. (I would hope we would have the opportunity to push our latest code to Debian at least monthly, even if we don't need to exercise each opportunity.)

  1. Is there any benchmark of which compares BLIS with other BLAS implementations such as OpenBLAS?

Yes. We have performed many performance experiments that compare BLIS against OpenBLAS. Last week, at our annual BLIS Retreat--a workshop here at UT-Austin centered around BLIS-related topics--Devangi Parikh (@dnparikh) presented performance results for multiple level-3 BLAS operations, floating-point datatypes, and problem sizes, and she did so for Intel Haswell/Broadwell, Intel SkylakeX, and Cavium ThunderX2 (ARMv8). The overall story of the performance results is that BLIS is remarkably competitive and consistent in its performance.

Devangi: could you link our guest to PDFs of your graphs from the Retreat?

Also calling out to @nschloe so he can comment about BLIS in general, if he likes.

@nschloe
Copy link
Contributor

nschloe commented Sep 23, 2018

Also calling out to @nschloe so he can comment about BLIS in general, if he likes.

I found BLIS as I was looking for BLAS operations on C-ordered arrays for NumPy. BLIS has that, but even better is the fact that it's developed in the open using a more modern language than Fortran.

The overall story of the performance results is that BLIS is remarkably competitive and consistent in its performance.

Plots about that should definitely go into the main README.

@fgvanzee
Copy link
Member

@nschloe Thanks for your comments, Nico. I agree that the time has come for us to include some basic plots in the source distribution.
@dnparikh Perhaps we should skip straight to Nico's idea instead, and then @cdluminate can view them through github or a git clone.

@rvdg
Copy link
Collaborator

rvdg commented Sep 23, 2018

Comments:

Robert

@cdluminate
Copy link
Contributor Author

@fgvanzee This is the background for my first question:

Debian/Ubuntu have an alternatives system, by which the user can switch the BLAS implementation smoothly without recompiling any software, e.g.

$ sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
There are 4 choices for the alternative libblas.so.3-x86_64-linux-gnu (providing /usr/lib/x86_64-linux-gnu/libblas.so.3).

  Selection    Path                                             Priority   Status
------------------------------------------------------------
  0            /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3   40        auto mode
  1            /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3      35        manual mode
  2            /usr/lib/x86_64-linux-gnu/blas/libblas.so.3       10        manual mode
* 3            /usr/lib/x86_64-linux-gnu/libmkl_rt.so            1         manual mode
  4            /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3   40        manual mode

Press <enter> to keep the current choice[*], or type selection number: ⏎       

All the alternative candidates for the libblas.so.3 shared object provides at least the standard set of BLAS API/ABI. In my question, my term libblas.so.3 means "standard BLAS ABI/API".

If BLIS's CBLAS implementation is mature enough, it could be used as a drop-in replacement of libblas.so.3, and it should be added as another alternative for libblas.so.3.

BLIS is still under active development, even if the core functionality exists in a mostly stable state.

That's good to hear.

Whether BLIS belongs in Debian as an officially-supported package may also depend on how frequently our sponsor will be willing to provide updated packages. The more frequent, the better.

I have enough permission to upload package to Debian. However when Debian is about to release a new version, .e.g. 10.0 Buster, the whole archive will be frozen and nothing can be updated except that some package has severe bug. I can only update packages regularly for debian testing/unstable/experimental. So it makes sense to ask upstream before uploading to make sure the package to be uploaded isn't too buggy.

(I would hope we would have the opportunity to push our latest code to Debian at least monthly, even if we don't need to exercise each opportunity.)

It won't be hard for me to update a Debian package as long as there is neither significant ABI/API change, nor significant change in build system.

Plots about that should definitely go into the main README.

+1

And thanks @rvdg for the plot.

@cdluminate
Copy link
Contributor Author

cdluminate commented Sep 24, 2018

One more question about packaging:

As shown in #255, BLIS has the best performance with openmp threading, as long as BLIS_NUM_THREADS is properly configured. This means the default configuration for the package should be

--enable-threading=openmp x86_64

Is this correct?

And does it make sense to provide another --disable-threading BLIS at the same time to avoid threading library clash under some certain conditions?

@fgvanzee
Copy link
Member

fgvanzee commented Sep 24, 2018

--enable-threading=openmp x86_64

Seems right, yes. You may want to do a thorough review of all configure options. For example, the BLAS integer size option is important to some people.

And does it make sense to provide another --disable-threading to avoid threading library clash under some certain conditions?

Could you clarify this question?

@cdluminate
Copy link
Contributor Author

cdluminate commented Sep 24, 2018

@fgvanzee I spotted significant performance drop when a program used iomp and gomp at the same time.

Besides, Intel MKL provides the sequential threading (single thread) module.
https://software.intel.com/en-us/mkl-linux-developer-guide-calling-intel-mkl-functions-from-multi-threaded-applications

Usage model: disable Intel MKL internal threading for the whole application

When used: Intel MKL internal threading interferes with application's own threading or may slow down the application.

@fgvanzee
Copy link
Member

fgvanzee commented Sep 24, 2018

@cdluminate Sounds like you are trying to link your gcc-compiled/linked libblis.so to your application with Intel's icc?

@devinamatthews @jeffhammond You guys have more experience with using icc. Could you comment here? I don't quite understand the issue.

EDIT: BTW everyone, I'm taking the day off today. :) Thanks for your patience, @cdluminate . Hopefully others can step in and help us figure out your issues.

@devinamatthews
Copy link
Member

@cdluminate I would guess that when using both iomp5 and gomp at the same time that you are ending up with N^2 threads. What happens when you run with OMP_NUM_THREADS=N and BLIS_NUM_THREADS=1 (assuming there is meaningful multithreading in the calling program)?

But, even if it "works" right now, mixing two different OpenMP runtimes is a recipe for disaster. In the context of a Debian package, I would think that gomp would make a sensible default. I guess that pthreads would be even better but as noted above we don't have a thread pool implementation and so performance suffers.

@jeffhammond
Copy link
Member

Intel OpenMP runtime defines the GOMP API so if you link it into a GCC program, you should not end up linking against libgomp.so. Please run ldd on your binary to see what libraries are being linked.

@cdluminate
Copy link
Contributor Author

@fgvanzee @devinamatthews @jeffhammond Thanks for the pointers. But I'm sorry that observation was found in one of my old cxx code that doesn't use BLIS ... and using different threading libraries at the same time resulted in creation of too many threads. GCC uses gomp by default but clang doesn't ...

Anyway I think the debian package should ship with openmp version first.

@cdluminate
Copy link
Contributor Author

Just a NOTE: I registered libblis as an alternative candidate to libblas.so.3 and libblas.so, and assigned blis with priority 37. I would be happy to increase the priority of BLIS to higher value if there are strong proof that suggests BLIS is better than OpenBLAS, in terms of cblas_* performance.

The default priority chain for Debian and Ubuntu will look like this:

OpenBLAS > BLIS > Atlas > Netlib
Justification:
OpenBLAS > BLIS: if we don't touch any environment variable at all
BLIS > Atlas: BLIS in single thread (according to my test) is still faster than generic Atlas
~/D/b/blis ❯❯❯ sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu
There are 5 choices for the alternative libblas.so.3-x86_64-linux-gnu (providing /usr/lib/x86_64-linux-gnu/libblas.so.3).

  Selection    Path                                             Priority   Status
------------------------------------------------------------
  0            /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3   40        auto mode
  1            /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3      35        manual mode
  2            /usr/lib/x86_64-linux-gnu/blas/libblas.so.3       10        manual mode
  3            /usr/lib/x86_64-linux-gnu/libblis.so.1            37        manual mode
* 4            /usr/lib/x86_64-linux-gnu/libmkl_rt.so            1         manual mode
  5            /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3   40        manual mode

Press <enter> to keep the current choice[*], or type selection number: ^C⏎                   ~/D/b/blis ❯❯❯ sudo update-alternatives --config libblas.so-x86_64-linux-gnu
There are 4 choices for the alternative libblas.so-x86_64-linux-gnu (providing /usr/lib/x86_64-linux-gnu/libblas.so).

  Selection    Path                                           Priority   Status
------------------------------------------------------------
  0            /usr/lib/x86_64-linux-gnu/openblas/libblas.so   40        auto mode
  1            /usr/lib/x86_64-linux-gnu/blas/libblas.so       10        manual mode
  2            /usr/lib/x86_64-linux-gnu/libblis.so            37        manual mode
* 3            /usr/lib/x86_64-linux-gnu/libmkl_rt.so          1         manual mode
  4            /usr/lib/x86_64-linux-gnu/openblas/libblas.so   40        manual mode

Press <enter> to keep the current choice[*], or type selection number: ^C⏎        

Please ignore the lowest priority 1 assigned to MKL.

@fgvanzee
Copy link
Member

fgvanzee commented Sep 27, 2018

I registered libblis as an alternative candidate to libblas.so.3 and libblas.so, and assigned blis with priority 37.

@cdluminate That's great news. We appreciate your willingness to include BLIS in the Debian universe. I'm sure many of our users will be pleased to be able to enable BLIS via the distribution's internal package management tools.

I would be happy to increase the priority of BLIS to higher value if there are strong proof that suggests BLIS is better than OpenBLAS, in terms of cblas_* performance.

It all depends on you hardware, operation, datatype, problem size, and how much threading you need, and maybe other factors as well. That said, Robert has pointed you to evidence that we gathered very recently on ThunderX2 and SkylakeX--two architectures designed for high performance. (Devangi didn't include OpenBLAS on the Haswell graphs because we couldn't get it working in multithreaded mode, despite intense efforts by deeply experienced individuals.) For now, OpenBLAS outperforms BLIS for small problems (less than about 100), but that is one of the only use cases for which OpenBLAS outpaces BLIS. The evidence suggests that, for almost all larger problems, almost all operations of almost all datatypes seem to yield better performance when executed via BLIS than with OpenBLAS.

Also consider that there are other metrics by which to measure "goodness" or "betterness" than raw performance. We believe strongly that BLIS stacks up well against OpenBLAS on virtually all of these other measures of software quality. Most important among these metrics: BLIS provides BLAS (and CBLAS) APIs, but unlike most other BLAS libraries, BLIS provides much more than just BLAS APIs. (The BLAS APIs are quite limiting for many individuals and applications, and BLIS contains APIs that attempt to break free of those limitations and expand the space of parameterization and storage formats.) For a full list of features that BLIS provides, I invite you to read our main github page, whose content is stored in the README.md file.

@jeffhammond
Copy link
Member

@cdluminate Clang uses the LLVM OpenMP runtime, which is the Intel OpenMP runtime and thus contains the GOMP symbols so it too should not lead to O(n^2) threads when layered with GCC. Pthreads+OpenMP can definitely hit this, however, and it is the expected behavior.

@jeffhammond
Copy link
Member

For interested parties, the GOMP symbols in KMP (Intel/LLVM OpenMP runtime) can be observed here: https://github.com/llvm-mirror/openmp/blob/master/runtime/src/kmp_gsupport.cpp

@cdluminate
Copy link
Contributor Author

Another question: What's the correct configuration parameter to compile the 64-bit-index version of BLIS (or in MKL's term, ILP64 interface)? Is it -i 64 -b 64?

@fgvanzee
Copy link
Member

It depends on what you want. If you only care about the integer size in the BLAS API, then you only need -b 64. If you want to ensure that the native BLIS APIs use 64-bit integers, you should use -i 64. (And if you want both, use both.)

@devinamatthews
Copy link
Member

One could argue that -b 64 -i 32 is a disaster waiting to happen, but that would only happen with just -b 64 specified on 32-bit architectures. Maybe in this case we should imply -i 64 as well?

@fgvanzee
Copy link
Member

Could you explain the precise circumstances under which -b 64 -i 32 would be a disaster and why? I don't see it.

@jeffhammond
Copy link
Member

jeffhammond commented Oct 26, 2018

@fgvanzee Do you support the case where a 64-bit BLAS integer exceeds INT_MAX when the BLIS API uses 32-bit C int?

(e.g. dot product on a vector of 17 GB of floats)

@fgvanzee
Copy link
Member

@jeffhammond No. I don't do anything fancy--just regular typecasts. I assume the developer/user knows what he's doing when he assigns, implicitly or explicitly, the integer size of both BLAS and BLIS integers.

@jeffhammond
Copy link
Member

@fgvanzee I encourage you to take a random sample of computational scientists you encounter in ICES or online to see if that assumption is valid.

There is a use case for dangerous truncation, but it is a dubious one. NWChem always uses 64-bit integers because these are used as offsets into distributed arrays (i.e. Global Arrays) and it is easy to have 1D arrays that contain more than INT_MAX elements yet never pass a local dimension greater than INT_MAX. However, there isn't a good reason to allow the BLIS API to be built with 32-bit integers for this case, since the additional cost of 64-bit integer arithmetic over 32-bit arithmetic in outer loops of BLIS algorithms should not be noticeable.

A good compromise here is to disallow 64b BLAS + 32b BLIS by default but have an override (e.g. allow_truncation) for folks who want to live dangerously.

@fgvanzee
Copy link
Member

Thanks Jeff. I've opened issue #274 to track this.

@cdluminate
Copy link
Contributor Author

update: blis 0.5.0 was built for six architectures on Ubuntu disco.
https://launchpad.net/~lumin0/+archive/ubuntu/ppa/+sourcepub/9660195/+listing-archive-extra

The packaging has been significantly changed. In short, the source package yields the following binary packages:

    libblis-dev BLAS-like Library Instantiation Software Framework
    libblis-openmp-dev BLAS-like Library Instantiation Software Framework
    libblis-pthread-dev BLAS-like Library Instantiation Software Framework
    libblis-serial-dev BLAS-like Library Instantiation Software Framework
    libblis1 BLAS-like Library Instantiation Software Framework - shared library
    libblis1-openmp BLAS-like Library Instantiation Software Framework - shared library
    libblis1-pthread BLAS-like Library Instantiation Software Framework - shared library
    libblis1-serial BLAS-like Library Instantiation Software Framework - shared library

    libblis64-1 BLAS-like Library Instantiation Software Framework - shared library
    libblis64-1-openmp BLAS-like Library Instantiation Software Framework - shared library
    libblis64-1-pthread BLAS-like Library Instantiation Software Framework - shared library
    libblis64-1-serial BLAS-like Library Instantiation Software Framework - shared library
    libblis64-dev BLAS-like Library Instantiation Software Framework
    libblis64-openmp-dev BLAS-like Library Instantiation Software Framework
    libblis64-pthread-dev BLAS-like Library Instantiation Software Framework
    libblis64-serial-dev BLAS-like Library Instantiation Software Framework

BLIS has been compiled in 6 different configurations as above. libblis1 is a meta package that depends on libblis1-openmp or libblis1-pthread or libblis1-serial. libblis-dev is also a meta package which depends on libblis1, and additionally a developement package corresponding to the underlying library. Note that any two of the three variants cannot coexist.

The 64bit variants are similar to those with 32bit indices. And it's soname has been modified to libblis64.so.X , and registered as a candidate of libblas64.so.3 in the alternatives system. The 64-bit version can co-exist with one 32-bit version.

If this looks good to you, I'll upload the package to Debian unstable shortly after your ack.

@fgvanzee
Copy link
Member

@cdluminate This all sounds great. Thanks so much for your contributions.

If I were able to put together a bugfix release (0.5.1) in short order (24 hours?), would it be better for me to do that before you move forward so we can get all the latest commits into the Debian package? (Hopefully slipping in a new version prior to upload won't be too disruptive to you.)

@cdluminate
Copy link
Contributor Author

@fgvanzee Just take your time. I'm fine to upload 0.5.1 again after 24 hours. What I concern is that, if 0.5.0 has severe regression bug or something alike, please tell me to stop uploading that.

@fgvanzee
Copy link
Member

@cdluminate Thanks. I don't think there are any really bad bugs in 0.5.0--mostly they are more benign improvements--but I'd have to check during my commit review (which is when I write the ReleaseNotes entry) to have a better sense of what bugs were fixed.

@cdluminate
Copy link
Contributor Author

https://ftp-master.debian.org/new/blis_0.5.0-1.html Uploaded and pending for ftp team to review.

@fgvanzee
Copy link
Member

@cdluminate Sounds good, thanks! I'm working towards 0.5.1 today, tomorrow at the latest.

@cdluminate
Copy link
Contributor Author

@cdluminate
Copy link
Contributor Author

FYI: BLIS was added to the Gentoo main repository: gentoo/gentoo@5c3ae58

@fgvanzee
Copy link
Member

@cdluminate Very cool. Thanks for letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants