Skip to content

ARM wheels for macosx and manylinux #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eriknw opened this issue Apr 4, 2023 · 18 comments
Closed

ARM wheels for macosx and manylinux #81

eriknw opened this issue Apr 4, 2023 · 18 comments

Comments

@eriknw
Copy link
Member

eriknw commented Apr 4, 2023

As a follow-up to #66 and #73, we still need to add ARM support to our automated CI for building wheels via cibuildwheel. Specifically, I would like suitesparse-graphblas to have wheels for every 64-bit platform that the latest NumPy has wheels for. Right now, this means adding the following wheels:

cp38-cp38-macosx_11_0_arm64.whl
cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
cp39-cp39-macosx_11_0_arm64.whl
cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
cp310-cp310-macosx_11_0_arm64.whl
cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
cp311-cp311-macosx_11_0_arm64.whl
cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

CC @alugowski and @jim22k who added wheel building via cibuildwheel and may want to follow this issue.

@alugowski
Copy link
Collaborator

I created a PR for macos ARM wheels.

That approach creates universal2 versions of libomp and libgraphblas so they can be linked against both x86 and arm code. cibuildwheel has proper cross-compilation support for everything else (cython, extensions, etc).

It's a bit hacky, but once GitHub Actions has M1 runners, promised Q4 2023 github/roadmap#528 , then the hack can go away.

@alugowski
Copy link
Collaborator

For Linux aarch64, cibuildwheel has simple instructions on how to do it with emulation. See https://cibuildwheel.readthedocs.io/en/stable/faq/#emulation if someone wants to take a stab at it.

@eriknw
Copy link
Member Author

eriknw commented May 4, 2023

#84 is merged, which effectively closes this issue, but there are a couple points to discuss:

Build time. The aarch64 wheel job failed the first time and I had to retry. Building SuiteSparse:GraphBLAS took about 5.5 hours, and this didn't leave enough time to compile and build the Python package. The retried job finished in 5 hours and 14 minutes. Let's keep an eye on how often we need to retry and to see if we can speed it up somehow (or break it into multiple jobs, or do a step in parallel).

https://github.com/GraphBLAS/python-suitesparse-graphblas/actions/runs/4875944788/jobs/8703792669

MacOS wheel sizes. I think the way we currently build SuiteSparse:GraphBLAS on osx for arm64 and x86_64 at the same time makes both wheels twice the size they need to be. I think this uses about 400MB extra for each release, which is non-trivial. @alugowski how difficult would it be to split up builds for osx arm64 and x86_64?

https://pypi.org/project/suitesparse-graphblas/7.4.4.1a0/#files

@alugowski
Copy link
Collaborator

Build time. The aarch64 wheel job failed the first time and I had to retry. Building SuiteSparse:GraphBLAS took about 5.5 hours, and this didn't leave enough time to compile and build the Python package. The retried job finished in 5 hours and 14 minutes. Let's keep an eye on how often we need to retry and to see if we can speed it up somehow (or break it into multiple jobs, or do a step in parallel).

Ouch, I hoped 45 minutes is enough margin. I'd suggest taking out one more optimized type for aarch64. Out of the 4 left optimized I'd take out INT8. Do it for your own sanity, you don't want to be stuck with an unreliable workflow. I tried the same types as x86 (reached 70%), just one type (about 2.5 hours-ish, IIRC) and the four types you see in the PR. Three should be reliable. I know I'm a broken record on this, but build times do matter and at some point optimizing everything isn't practical. Given how long that build takes, each iteration is basically a day.

QEMU emulates one virtual core, so there is no parallelism to be had.

You probably can do a regular cross-compile. That would make the build take the same amount of time as the native one. It's not trivial like it is on macOS, but should be doable. Install a cross-compiling GCC with https://packages.ubuntu.com/jammy/g++-aarch64-linux-gnu and then make a CMake toolchain file. You'd have to run that before cibw, only on aarch64. It'll be a lot of fiddling that I don't have the stomach for given the build time, hence why I abandoned this approach. But it is an option.

@alugowski
Copy link
Collaborator

alugowski commented May 4, 2023

MacOS wheel sizes. I think the way we currently build SuiteSparse:GraphBLAS on osx for arm64 and x86_64 at the same time makes both wheels twice the size they need to be. I think this uses about 400MB extra for each release, which is non-trivial. @alugowski how difficult would it be to split up builds for osx arm64 and x86_64?

https://pypi.org/project/suitesparse-graphblas/7.4.4.1a0/#files

Is it a problem, though?

It is temporary, as moving to a native ARM runner when they're available will make this easy to fix. PyPI accepts the files, the build runtimes are the same, so does the extra size impact something? It does use a bit more space for the end user, but I don't think anyone will even notice. And if it is a problem for someone, that one person can link against their own. Note that when I followed the SuiteSparse instructions on how to build SuiteSparse:GraphBLAS it ended up using 1 GB. The previous "official" method of running the Linux builds with Docker uses several GB. So I'm not too bothered about wasting ~120 MB, which is the uncompressed single architecture libgraphblas.dylib size.

We could take the aarch64 approach and use a second runner then build only one architecture on each. That way the build sizes will not get inflated. The benefit is that the workflow will be nearly the same as if ARM runners were available, so that move will be even easier later. The downside is that it's impossible to automatically test ARM wheels on GitHub, and the more this workflow diverges from x86 the more likely you'll ship broken wheels at some point.

@alugowski
Copy link
Collaborator

All these issues go away with GraphBLAS 8's JIT.

One FYI on that, from Tim's email it seems like it's using the system compiler. We took a similar JIT approach with KDT about a decade ago, and the problem was that some systems didn't have a compiler. Even in HPC world, the compilers would live only on the head node and the JIT would get triggered on the compute nodes. AFAIK neither Windows nor macOS have compilers by default, and many Linux docker images don't include one either. It's a one-line install on macOS and Linux, but that's something that needs to be documented.

@eriknw
Copy link
Member Author

eriknw commented May 4, 2023

Ouch, I hoped 45 minutes is enough margin. I'd suggest taking out one more optimized type for aarch64. Out of the 4 left optimized I'd take out INT8. Do it for your own sanity, you don't want to be stuck with an unreliable workflow.

Sounds good to me. I suspect we could replace INT8 or FP32 with BOOL (presuming BOOL would be faster, b/c there are fewer operations with BOOL).

It is temporary, as moving to a native ARM runner when they're available will make this easy to fix. PyPI accepts the files, the build runtimes are the same, so does the extra size impact something?

Good point about it being temporary. That's reason enough for me.

We could take the aarch64 approach and use a second runner then build only one architecture on each. That way the build sizes will not get inflated. The benefit is that the workflow will be nearly the same as if ARM runners were available, so that move will be even easier later. The downside is that it's impossible to automatically test ARM wheels on GitHub, and the more this workflow diverges from x86 the more likely you'll ship broken wheels at some point.

I'm happy to defer to you :) We (i.e. @jim22k) can (probably) manually test macos arm wheels if necessary.

QEMU emulates one virtual core, so there is no parallelism to be had.

Thanks for the info. After SuiteSparse:GraphBLAS is built, I also wonder how sane it would be to build the wheels for different Python versions (3.8, 3.9, etc.) in parallel tasks. I don't think this would be easy.

All these issues go away with GraphBLAS 8's JIT.

Heh, love the enthusiasm (since the JIT is pretty cool), but I would say some of these issues go away some of the time. I don't want to assume users will always have suitable compilers available for the JIT. Users of Python and their environments tend to be very diverse.

Thanks again for all your work here @alugowski. I have some idea how heavy of a lift all this was, and wheel building is now in a very good state.

@alugowski
Copy link
Collaborator

Don't drop FP32. pygraphblas examples use it so anyone who learns from those tutorials or uses them as the basis for their own code will use it too. Hot take: bool is mostly used by people who don't know iso matrices exist.

If parallel builds were low hanging fruit then cibuildwheel would be doing it. GitHub has massive monetary incentive to help them, too. I believe cibw throws everything for one architecture into a container (or QEMU VM) then loops python versions in the container. So even if you added parallelism to that loop you'd still have one QEMU core. I can't think of a way to do it besides using more runners.

Linux runners have 2 cores (macos has 3), and you'd be parallelizing only the fast part, so there is little potential upside anyway.

@alugowski
Copy link
Collaborator

I worry the big benefits of the JIT will be missed from taking easy half measures. Assuming it works, Tim should support JIT or emulation only. If performance is important to the user it's not a big ask to require a compiler. For compatibility uses the emulated approach will be sufficient.

Otherwise you get people clutching to the optimized kernels and you're not only stuck supporting two massive code paths but you're not going to do really cool things only possible with the JIT. UDTs and user-defined semirings being second-class citizens is a big miss.

Yeah the compiler issue is a support problem. But Shoaib figured it out pretty well with SEJITS and things have only improved in the last decade. Python has a good warning package, and others like pytorch use it for exactly stuff like this. If a kernel is unopimized then emit the warning with instructions on what the user needs to do. If they don't care they can easily silence or ignore the warning.

@eriknw
Copy link
Member Author

eriknw commented May 4, 2023

I disagree about BOOL. We use BOOL regularly in graphblas-algorithms for, you know, doing boolean logic.

I suppose some benchmarks may prefer to use FP32 (benchmarks may also want to use smaller int types too).

Replaced INT8 with BOOL in #86.

@alugowski
Copy link
Collaborator

We (i.e. @jim22k) can (probably) manually test macos arm wheels if necessary.

I wrote up what I think split macos runners would like: https://github.com/alugowski/python-suitesparse-graphblas/tree/splitarm

We can test the artifact when the build finishes: https://github.com/alugowski/python-suitesparse-graphblas/actions/runs/4886558542/jobs/8722055829

@eriknw
Copy link
Member Author

eriknw commented May 4, 2023

I worry the big benefits of the JIT will be missed from taking easy half measures.

Like what? By planning to continue to pre-compiled operations as is done today? I don't want to hobble the current experience by expecting everyone to use a JIT. This isn't either/or--we can support both.

If users want to leverage the SuiteSparse:GraphBLAS JIT, we should totally make that as nice an experience as we can. We can do this without impacting users who can't use the JIT for whatever reason (at the cost of increased wheel size and build time). Users can choose whether they want to go all-in with SuiteSparse's JIT, use only vanilla, or support both. I don't want python-suitesparse-graphblas to make that choice for them. I'm really glad progress has been made on the compiler issue, but being "solved" somewhere is not the same as being "solved" everywhere, and Python is very diverse.

If performance is important to the user it's not a big ask to require a compiler.

This is where we disagree. I think this is a big ask for some users, and I want to support those users. Downstream libraries are free to decide differently and require a compiler.

Perhaps we could have a strategy where we also build tiny wheels that entirely rely on the JIT (or the generic fallbacks).

I wrote up what I think split macos runners would like: https://github.com/alugowski/python-suitesparse-graphblas/tree/splitarm

Nice! Looks reasonable to me :)

@alugowski
Copy link
Collaborator

alugowski commented May 4, 2023

Ran another job with only the arm64 build: https://github.com/alugowski/python-suitesparse-graphblas/actions/runs/4887294770

It works on my machine so I made a PR.

Looking at the diff it doesn't look as risky as I imagined. It doesn't expand the "broken arm64 wheel" potential much at all compared to the universal ones.

@alugowski
Copy link
Collaborator

I guess our disagreement is that my suspicion is that the intersection of "absolutely can't install a compiler" and "fallback is too slow" is effectively non-existent. Sounds like you may know of important users in that intersection, and that's fine.

I worry the big benefits of the JIT will be missed from taking easy half measures.

Like what? By planning to continue to pre-compiled operations as is done today? I don't want to hobble the current experience by expecting everyone to use a JIT. This isn't either/or--we can support both.

(Assuming, of course, that the JIT offers a comparable experience to today.)

Just pointing out that there are costs of supporting both. Tests have to verify both. There's a 6-hour build step. Very large codebase size. Though that's largely not my problem (if I can pull myself away from trying to solve these build issues :P) so I'll keep my advocacy at a minimum.

My worry is that with two available paths parts of the ecosystem will only support the easier one (pre-compiled) even if they could support the JIT too. (That includes us, figuring out available compilers may be tricky)

That would make the JIT an option with a chicken-and-egg problem. If you write library code you now can't assume JIT performance or features. That means potentially more complex code that only uses the supplied types/semirings, or you have to write the same operation twice.

Example: Aydin and I had a lot of generic graph traversal code use what we called a "select 2nd" "semiring". Not a true mathematical semiring, but it meant we could do BFS and other algorithms that only cared about the existence of edges and worked on any datatype, numeric or UDT. It also optimized well because it didn't read one of the arguments. If I wrote that in graphblas today the code would need the JIT. Users without the JIT would think I'm lying about my performance claims.

Python is very diverse

Oh god how true that is :)

@eriknw
Copy link
Member Author

eriknw commented May 5, 2023

I agree with much of what you say, but I don't want this library to be a "forcing function" to move everybody to the JIT. We should let people who build off of this library to make their own decisions and to have a pleasant experience regardless of their choices.

I also agree that UDTs and UDFs are incredibly important parts of GraphBLAS, and the faster they can be, the better.
However, I also think the inherent complexity of UDFs is high, and I think it's okay to have multiple code paths (vanilla GraphBLAS, SuiteSparse w/o JIT, SuiteSparse w/ JIT, SuiteSparse on GPU, MLIR GraphBLAS, DASK-ified GraphBLAS, other implementations, etc.) when it's appropriate to do so to match some of that complexity. I'm good at managing and testing variations.

I'm okay with having long build times for python-suitesparse-graphblas. I am not yet comfortable being the one responsible for making sure (or forcing) users have a JIT-enabled environment.

I appreciate your perspective and the spirited discussion :)

@eriknw
Copy link
Member Author

eriknw commented May 5, 2023

Btw build time with using BOOL instead of INT8 is way down (4h15m for aarch64):
https://github.com/GraphBLAS/python-suitesparse-graphblas/actions/runs/4889210571/usage

and wheel sizes are down:
https://pypi.org/project/suitesparse-graphblas/7.4.4.1a1/#files

🎉

@eriknw
Copy link
Member Author

eriknw commented May 5, 2023

Everything looks great as far as we can tell. Closing 🚀

@eriknw eriknw closed this as completed May 5, 2023
@alugowski
Copy link
Collaborator

Wow that's a big difference with BOOL :)

Gotta say, this feels very luxurious:

(venv) notchie ~/temp (:|✔) % pip install suitesparse-graphblas==7.4.4.1a1
Collecting suitesparse-graphblas==7.4.4.1a1
  Using cached suitesparse_graphblas-7.4.4.1a1-cp311-cp311-macosx_11_0_arm64.whl (45.2 MB)
Collecting cffi>=1.11 (from suitesparse-graphblas==7.4.4.1a1)
  Using cached cffi-1.15.1-cp311-cp311-macosx_11_0_arm64.whl (174 kB)
Collecting numpy>=1.19 (from suitesparse-graphblas==7.4.4.1a1)
  Using cached numpy-1.24.3-cp311-cp311-macosx_11_0_arm64.whl (13.8 MB)
Collecting pycparser (from cffi>=1.11->suitesparse-graphblas==7.4.4.1a1)
  Using cached pycparser-2.21-py2.py3-none-any.whl (118 kB)
Installing collected packages: pycparser, numpy, cffi, suitesparse-graphblas
Successfully installed cffi-1.15.1 numpy-1.24.3 pycparser-2.21 suitesparse-graphblas-7.4.4.1a1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants