-
-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling various special compilation optimizations/architectures #49
Comments
In reply to @msarahan's comment (cc @jjhelmus @aebrahim).
I assume you are meaning for optimizations. I agree. IMHO we should stick to keeping packages here and just add new package variants for optimizations ( though @pelson may disagree :) ). I am leaning towards having the variants be totally separate packages as was done with
Absolutely. Though for packages like OpenBLAS that allow runtime selection I think we should just build all the options and let it make the right choice at runtime. We should take this in consideration with MKL too. Right now, I am leaning towards providing OpenBLAS as the default as it has some nice properties though we could discuss other options like Blis in the future. ATLAS is a bit of a tricky one to ensure it is well optimized, but its stability is an asset. Plus, it seems the larger Python community likes this as a default (cc @ogrisel). It will be hard to build things like
Cool, it would be great to chat with them. We should probably move the optimization discussion to this issue ( #49 ) as it sounds like we all agree with the current build here and making these adjustments will come into play later anyways. |
Just a quick note that Intel does offer a "Community" licensed MKL. From what I recall Intel has offered to provide the NumPy core developers access to this library in the past to create binary released but this option was rejected due to the licensing terms. Specifically the binary would no long be a BSD only licensed product, could cause issues if combined with a GPL or similarly licensed software, and the distributor (us) must indemnify Intel against any law suits. |
To be clear, you are proposing adding variants within existing feedstocks, right? If so, I agree.
Agreed.
At this moment in time, we have no option to build against MKL, so that can be rules out for now (though we can look again at this once we have bottomed out on OpenBLAS). From the outset we should be labelling all packages which use OpenBLAS with an appropriate feature name. It wont be difficult in the future to turn on an extra build matrix dimension which builds OpenBLAS/MKL/other. |
Hmm - So if I understand this correctly, we are proposing to have different packages (package-sse2, package-sse3, package-sse-4, package-avx) for all the intrinsics? That seems like it could be a real burden. Particularly if you have a fairly complicated build, and then you need to update it and test it 4 times for each of the intrinsic types. Most build systems just have flags for turning intrinsics on or off - wouldn't this be very similar to features? Or are we suggesting that we also have separate packages for features? |
Sadly, I don't think there's any way around this. The build system may be able to set flags based on some feature (@ukoethe's proposal at conda/conda#1959 would probably be the right way to do this) - but we'd still need one package build at each feature level (with notable exceptions for packages like OpenBLAS that do runtime dispatch). |
Hmm - that's fair enough from a building stand point I suppose - short of kicking off multiple build profiles per feature? It still seems like it could be the source of many future bugs if very complicated build scripts must be duplicated and update 5+ times when the majority of the script will likely be identical. |
Yeah, @ukoethe's proposal would basically turn that into a one-line Jinja thing. The only other missing part that I'm aware of would be that conda-build-all would need to add this support, and we'd also need to make it so that packages without these features don't get needlessly rebuilt (not sure how many of these there might be) |
Yes, that would be the way to go for sure. We need to be very clever to keep this simple and maintainable. @ukoethe's proposal is probably that ticket out.
This is the easiest case really. 😄 We just need to make sure we are building all the cases into the package so it really has the full parameter range to chose from.
I think we need to decide, which ones are actually worth supporting.
That's why jinja templates will be essential to avoid such duplication. |
As a first step towards better build customization, I implemented some ideas in PR conda/conda-build#848. Please have a look! |
We checked that fftw also does runtime dispatch of SSE and AVX. In fact, any self-respecting numerics library should be able to do it. For those that don't, I agree that features are the most promising solution. However, it needs to be discussed if it is better to append the feature tag to the package name or to the version number. |
(package-sse2, package-sse3, package-sse-4, package-avx) for all the I'm pretty sure when this was discussed for Numpy that sse2 is pretty much Also, that it really only matters for lapack and blas ( and maybe fft). So And hopefully we use a math lib that does it for us anyway. Oh, and I though conda forge was building on Continuum's work anyway, so -CHB That seems like it could be a real burden. Particularly if you have a — |
@ChrisBarker-NOAA @ukoethe Runtime switching of intrinsics is true of large numerical libraries such as FFTW/BLAS, but sadly in my experience segfaults are more the norm. I agree that SSE2 is probably a fairly safe bet, though I'm willing to bet that eventually someone will crop up with an issue with it! Unfortunately, I have a number of packages I use/I want to submit that are significantly improved by increased intrinsics levels so some way of supporting a matrix of intrinsics would be very very useful for me. This may not be the case for Numpy - but that is likely due to the fact that Numpy does not make any use of the instructions added by SSE3/SSE4, rather than SSE3/SSE4 not marking a significant improvement for particular kinds of data inputs. |
To control SSE and AVX capabilities by features, one needs meta-packages that install a |
Agreed on the features point. Disagree on magic added into the pre-link step. Some OSes use these instructions for special things that can collide with other programs in unexpected ways. We shouldn't be assuming what works best for a user's system. They should be able to chose this. Of course, by having sane defaults, we should hopefully not have to worry in the base case. User's confident enough to want enable these special instructions should know what they are doing IMHO and have to make a conscious choice. |
I didn't make myself clear. Of course, users should be able to chose AVX acceleration. Assuming there is a meta-package
The |
How would you check this? Try to compile a simple program with the appropriate flags? |
I'm not an expert on AVX, but wouldn't a little utility program (whose output gives the desired platform information) be the simplest solution? Ah, this must go into the post-link script, but you get the idea. It's the same technique I use in the Try to compile is not an option because one cannot rely on a compiler to exist. |
Agreed. That's what I was getting at. 😉
The simplest ends up being using the compiler TBMK. Though that is already out. Possibly some compiled program could be used here. Certainly OpenBLAS does this so that would be a place to look, but would be good to find a simpler example. Would be nice if it had a Python interface to make it easier to use with conda or other things here.
Because of installing the program to determine the configuration? |
Maybe there is something usable from PeachPy. |
Exactly. BTW, someone in our Lab is working on AVX for VIGRA, so he knows how to check this. |
Maybe I'll let you guys take a crack at it first then. 😄 |
Windows has a utliity that tells you processor information. OSX you can use the command Linux you can use the command |
This doesn't make sense. If the OS tells the userspace that it will save xmm/ymm registers during context switches (XCR0 register) and if the correct flags are set in the CPUID that userspace program is free to use AVX (or SSE for that matter) however it likes. The correct solution is to do a runtime dispatch like OpenBLAS or FFTW do it. /proc/cpuinfo only parses CPUID which will however still be enough unless you have a very strangely configured system. See https://github.com/svenpeter42/fastfilters/blob/master/src/avx.cxx which requires https://github.com/svenpeter42/fastfilters/blob/master/src/xgetbv.hxx and https://github.com/svenpeter42/fastfilters/blob/master/src/cpuid.hxx for a really simple check. GCC even supports __builtin_cpu_supports and function multiversioning |
@svenpeter42 Totally agree with everything you've said. However, we have no control over these projects as they are third party - so runtime dispatching is out of the question. We are just trying to do the best we can at providing scientific software with intrinsics enabled and unfortunately extremely large projects like OpenCV do not perform runtime dispatching. So, trying to protect users from installing libraries that don't work on the system is the best we can do. If CPUID is lying to you then I suspect all bets are off. Finally, it's great that GCC has those checks - but we can't use them on Windows which further complicates things. Especially since your processor may support AVX2 - but if you run on Python 2.x and thus build with VS 2008 then those intrinsics are not supported (AVX2 for example is first supported in VS 2013). |
OpenCV also does runtime dispatching as far as I know :-) And CPUID doesn't necessarily lie - it merely tells you what the processor is capable of supporting. You need OS support on top of that because the kernel needs to save and restore xmm/ymm registers when a context switch happens. The code I linked above works fine on Windows as well fwiw. All you need to do to perform is the check is to compile it with something that allows you to query cpuid and xcr0. Then you can run it on whatever outdated software you want to. Here's my suggestion:
|
I'm struggling on documentation here, but I'm willing to go out on a bit of a limb. Posts like this suggest that the widespread use of the Your CPUID method looks really useful though - I agree that it would be great to have a tiny Python package that just provides the ability to tell you what intrinsics are supported at runtime. However, I agree with @jakirkham that we would not automatically install software with intrinsics enabled by default as we should allow a user to choose whether they want them or not. We should merely prevent them from installing something compiled with an intrinsic set their hardware/OS does not support. How this is implemented is up for discussion in here I guess - since, as mentioned by @ukoethe, features are not hierarchical and so it could require some changes to the conda infrastructure. |
@patricksnape I just checked some opencv code - i think you are right. The codebase is too convoluted to tell for sure though. Not automatically installing binaries which rely on AVX seems reasonable - I could imagine some users using the same conda environment on different machines (e.g. on a network mount) which may support different features. FWIW, everything that supports 64bit automatically supports SSE and SSE2 and at least gcc enables this by default. |
I can't remember the particular instance that caused problems ATM, but if I do I will link you to it. |
Just another point, we would want to be able to select GPU implementations (or none) in some cases. So, I think this again fits in this feature category where we will want to select at run time, which one is used. |
I wanted to check in on the status of two questions:
|
SSE3 for Anaconda's new compilers with default options. I think that means the same for conda-forge soon. Until then, I think SSE2 is a safe assumption. Features in the build/features sense are not recommended. Instead, use the new CB3 variant mechanism, and create metapackages to install particular matching sets of packages. @jjhelmus made a good mockup, but I can't remember where it is posted. |
I bet this is what you're talking about: https://github.com/jjhelmus/selectable_pkg_variants_example |
Yep, that's exactly it. Hopefully that might work. There's a lot we hope to do to make that whole scheme simpler, but Jonathan's example is the best I've seen so far. |
As the Conda compilers do a pretty good job optimizing for a high end and low end CPU, we have mostly addressed this issue in a broad way. For more specific optimizations, Conda does have an |
@jakirkham Should we open an issue for documentation? |
I thought we had one but idk where it is. It is not ready. |
If there are additional features needed from Conda or Conda-Build, would raise issues on those repos |
Is there now some maintainer-facing documentation on how to publish package variants for multiple microarchitectures? If not, would that be an interesting topic for a sprint at SciPy 2022? (I will be there) |
Related #27
Building some low level packages benefit significant from special compiler options like enabling SSE and/or AVX instructions. These options may or may not exist for different target architectures. Also, in some cases, these features may end up being leveraged by the OS so smart decisions must be made to make sure we don't incur a penalty. We should really think about how we want to approach this as it will have an effect on things like BLAS, NumPy, SciPy, and other low level libraries that do significant computation.
The text was updated successfully, but these errors were encountered: