-
Notifications
You must be signed in to change notification settings - Fork 7
ARM wheels for macosx and manylinux #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I created a PR for macos ARM wheels. That approach creates universal2 versions of libomp and libgraphblas so they can be linked against both x86 and arm code. cibuildwheel has proper cross-compilation support for everything else (cython, extensions, etc). It's a bit hacky, but once GitHub Actions has M1 runners, promised Q4 2023 github/roadmap#528 , then the hack can go away. |
For Linux aarch64, cibuildwheel has simple instructions on how to do it with emulation. See https://cibuildwheel.readthedocs.io/en/stable/faq/#emulation if someone wants to take a stab at it. |
#84 is merged, which effectively closes this issue, but there are a couple points to discuss: Build time. The aarch64 wheel job failed the first time and I had to retry. Building SuiteSparse:GraphBLAS took about 5.5 hours, and this didn't leave enough time to compile and build the Python package. The retried job finished in 5 hours and 14 minutes. Let's keep an eye on how often we need to retry and to see if we can speed it up somehow (or break it into multiple jobs, or do a step in parallel). MacOS wheel sizes. I think the way we currently build SuiteSparse:GraphBLAS on osx for arm64 and x86_64 at the same time makes both wheels twice the size they need to be. I think this uses about 400MB extra for each release, which is non-trivial. @alugowski how difficult would it be to split up builds for osx arm64 and x86_64? https://pypi.org/project/suitesparse-graphblas/7.4.4.1a0/#files |
Ouch, I hoped 45 minutes is enough margin. I'd suggest taking out one more optimized type for aarch64. Out of the 4 left optimized I'd take out INT8. Do it for your own sanity, you don't want to be stuck with an unreliable workflow. I tried the same types as x86 (reached 70%), just one type (about 2.5 hours-ish, IIRC) and the four types you see in the PR. Three should be reliable. I know I'm a broken record on this, but build times do matter and at some point optimizing everything isn't practical. Given how long that build takes, each iteration is basically a day. QEMU emulates one virtual core, so there is no parallelism to be had. You probably can do a regular cross-compile. That would make the build take the same amount of time as the native one. It's not trivial like it is on macOS, but should be doable. Install a cross-compiling GCC with https://packages.ubuntu.com/jammy/g++-aarch64-linux-gnu and then make a CMake toolchain file. You'd have to run that before cibw, only on aarch64. It'll be a lot of fiddling that I don't have the stomach for given the build time, hence why I abandoned this approach. But it is an option. |
Is it a problem, though? It is temporary, as moving to a native ARM runner when they're available will make this easy to fix. PyPI accepts the files, the build runtimes are the same, so does the extra size impact something? It does use a bit more space for the end user, but I don't think anyone will even notice. And if it is a problem for someone, that one person can link against their own. Note that when I followed the SuiteSparse instructions on how to build SuiteSparse:GraphBLAS it ended up using 1 GB. The previous "official" method of running the Linux builds with Docker uses several GB. So I'm not too bothered about wasting ~120 MB, which is the uncompressed single architecture libgraphblas.dylib size. We could take the aarch64 approach and use a second runner then build only one architecture on each. That way the build sizes will not get inflated. The benefit is that the workflow will be nearly the same as if ARM runners were available, so that move will be even easier later. The downside is that it's impossible to automatically test ARM wheels on GitHub, and the more this workflow diverges from x86 the more likely you'll ship broken wheels at some point. |
All these issues go away with GraphBLAS 8's JIT. One FYI on that, from Tim's email it seems like it's using the system compiler. We took a similar JIT approach with KDT about a decade ago, and the problem was that some systems didn't have a compiler. Even in HPC world, the compilers would live only on the head node and the JIT would get triggered on the compute nodes. AFAIK neither Windows nor macOS have compilers by default, and many Linux docker images don't include one either. It's a one-line install on macOS and Linux, but that's something that needs to be documented. |
Sounds good to me. I suspect we could replace
Good point about it being temporary. That's reason enough for me.
I'm happy to defer to you :) We (i.e. @jim22k) can (probably) manually test macos arm wheels if necessary.
Thanks for the info. After SuiteSparse:GraphBLAS is built, I also wonder how sane it would be to build the wheels for different Python versions (3.8, 3.9, etc.) in parallel tasks. I don't think this would be easy.
Heh, love the enthusiasm (since the JIT is pretty cool), but I would say some of these issues go away some of the time. I don't want to assume users will always have suitable compilers available for the JIT. Users of Python and their environments tend to be very diverse. Thanks again for all your work here @alugowski. I have some idea how heavy of a lift all this was, and wheel building is now in a very good state. |
Don't drop FP32. pygraphblas examples use it so anyone who learns from those tutorials or uses them as the basis for their own code will use it too. Hot take: bool is mostly used by people who don't know iso matrices exist. If parallel builds were low hanging fruit then cibuildwheel would be doing it. GitHub has massive monetary incentive to help them, too. I believe cibw throws everything for one architecture into a container (or QEMU VM) then loops python versions in the container. So even if you added parallelism to that loop you'd still have one QEMU core. I can't think of a way to do it besides using more runners. Linux runners have 2 cores (macos has 3), and you'd be parallelizing only the fast part, so there is little potential upside anyway. |
I worry the big benefits of the JIT will be missed from taking easy half measures. Assuming it works, Tim should support JIT or emulation only. If performance is important to the user it's not a big ask to require a compiler. For compatibility uses the emulated approach will be sufficient. Otherwise you get people clutching to the optimized kernels and you're not only stuck supporting two massive code paths but you're not going to do really cool things only possible with the JIT. UDTs and user-defined semirings being second-class citizens is a big miss. Yeah the compiler issue is a support problem. But Shoaib figured it out pretty well with SEJITS and things have only improved in the last decade. Python has a good warning package, and others like pytorch use it for exactly stuff like this. If a kernel is unopimized then emit the warning with instructions on what the user needs to do. If they don't care they can easily silence or ignore the warning. |
I disagree about BOOL. We use BOOL regularly in I suppose some benchmarks may prefer to use Replaced |
I wrote up what I think split macos runners would like: https://github.com/alugowski/python-suitesparse-graphblas/tree/splitarm We can test the artifact when the build finishes: https://github.com/alugowski/python-suitesparse-graphblas/actions/runs/4886558542/jobs/8722055829 |
Like what? By planning to continue to pre-compiled operations as is done today? I don't want to hobble the current experience by expecting everyone to use a JIT. This isn't either/or--we can support both. If users want to leverage the SuiteSparse:GraphBLAS JIT, we should totally make that as nice an experience as we can. We can do this without impacting users who can't use the JIT for whatever reason (at the cost of increased wheel size and build time). Users can choose whether they want to go all-in with SuiteSparse's JIT, use only vanilla, or support both. I don't want
This is where we disagree. I think this is a big ask for some users, and I want to support those users. Downstream libraries are free to decide differently and require a compiler. Perhaps we could have a strategy where we also build tiny wheels that entirely rely on the JIT (or the generic fallbacks).
Nice! Looks reasonable to me :) |
Ran another job with only the arm64 build: https://github.com/alugowski/python-suitesparse-graphblas/actions/runs/4887294770 It works on my machine so I made a PR. Looking at the diff it doesn't look as risky as I imagined. It doesn't expand the "broken arm64 wheel" potential much at all compared to the universal ones. |
I guess our disagreement is that my suspicion is that the intersection of "absolutely can't install a compiler" and "fallback is too slow" is effectively non-existent. Sounds like you may know of important users in that intersection, and that's fine.
(Assuming, of course, that the JIT offers a comparable experience to today.) Just pointing out that there are costs of supporting both. Tests have to verify both. There's a 6-hour build step. Very large codebase size. Though that's largely not my problem (if I can pull myself away from trying to solve these build issues :P) so I'll keep my advocacy at a minimum. My worry is that with two available paths parts of the ecosystem will only support the easier one (pre-compiled) even if they could support the JIT too. (That includes us, figuring out available compilers may be tricky) That would make the JIT an option with a chicken-and-egg problem. If you write library code you now can't assume JIT performance or features. That means potentially more complex code that only uses the supplied types/semirings, or you have to write the same operation twice. Example: Aydin and I had a lot of generic graph traversal code use what we called a "select 2nd" "semiring". Not a true mathematical semiring, but it meant we could do BFS and other algorithms that only cared about the existence of edges and worked on any datatype, numeric or UDT. It also optimized well because it didn't read one of the arguments. If I wrote that in graphblas today the code would need the JIT. Users without the JIT would think I'm lying about my performance claims.
Oh god how true that is :) |
I agree with much of what you say, but I don't want this library to be a "forcing function" to move everybody to the JIT. We should let people who build off of this library to make their own decisions and to have a pleasant experience regardless of their choices. I also agree that UDTs and UDFs are incredibly important parts of GraphBLAS, and the faster they can be, the better. I'm okay with having long build times for python-suitesparse-graphblas. I am not yet comfortable being the one responsible for making sure (or forcing) users have a JIT-enabled environment. I appreciate your perspective and the spirited discussion :) |
Btw build time with using BOOL instead of INT8 is way down (4h15m for aarch64): and wheel sizes are down: 🎉 |
Everything looks great as far as we can tell. Closing 🚀 |
Wow that's a big difference with BOOL :) Gotta say, this feels very luxurious: (venv) notchie ~/temp (:|✔) % pip install suitesparse-graphblas==7.4.4.1a1
Collecting suitesparse-graphblas==7.4.4.1a1
Using cached suitesparse_graphblas-7.4.4.1a1-cp311-cp311-macosx_11_0_arm64.whl (45.2 MB)
Collecting cffi>=1.11 (from suitesparse-graphblas==7.4.4.1a1)
Using cached cffi-1.15.1-cp311-cp311-macosx_11_0_arm64.whl (174 kB)
Collecting numpy>=1.19 (from suitesparse-graphblas==7.4.4.1a1)
Using cached numpy-1.24.3-cp311-cp311-macosx_11_0_arm64.whl (13.8 MB)
Collecting pycparser (from cffi>=1.11->suitesparse-graphblas==7.4.4.1a1)
Using cached pycparser-2.21-py2.py3-none-any.whl (118 kB)
Installing collected packages: pycparser, numpy, cffi, suitesparse-graphblas
Successfully installed cffi-1.15.1 numpy-1.24.3 pycparser-2.21 suitesparse-graphblas-7.4.4.1a1 |
As a follow-up to #66 and #73, we still need to add ARM support to our automated CI for building wheels via cibuildwheel. Specifically, I would like
suitesparse-graphblas
to have wheels for every 64-bit platform that the latest NumPy has wheels for. Right now, this means adding the following wheels:CC @alugowski and @jim22k who added wheel building via cibuildwheel and may want to follow this issue.
The text was updated successfully, but these errors were encountered: