Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Cross-Platform Refactor: Build System and Binary Distribution #1032

Closed
Titus-von-Koeller opened this issue Feb 4, 2024 · 11 comments
Closed

Comments

@Titus-von-Koeller
Copy link
Collaborator

Titus-von-Koeller commented Feb 4, 2024

We're confirming the choice of migrating to CMake for cross-platform builds and are in the process of moving everything to Github workflows.

The goal of this issue is to have a central place to discuss, agree on work still needed and track progress on the following topics:

  • the build + distribution of binaries
  • security implications
  • discuss the distribution of binaries in light of PyPI's size restrictions, considering alternatives like splitting binaries for each platform or using an extra index, akin to PyTorch's approach.

Strategies for Binary Distribution and Size Constraints: Discuss the options for distributing binaries, considering PyPI's size restrictions and the potential for using separate packages for each platform or an additional index. We will likely ask for an exemption to the PyPi size constraint for the time being, but we need a new way to go forward. I am aware of useful discussions around these topics in PRs and issues in the last week, but since I'll be away on vacation and need to leave now, I couldn't yet integrate everything into this page.

However, the idea is that this is the central hub for such discussions and I'll integrate any feedback into this description here in order to have an up-to-date summary of decisions, challenges, tasks, etc.

Distribution of binaries

Right now we're already hitting PyPI's 100MB size restriction and could therefore not add new functionality yet, as it would increase binary size.

There are two approaches that could solve this:

  1. Split out binaries for each platform into a separate binary distribution PyPi package and use PyPi.
  2. Akin to PyTorch, use an extra index (we would like to avoid this).

@albanD had a few valuable comments about that (feel free to engage in this discussion as well :) ):

AFAIK the main reason we use this is because of the binary size limit on PyPi. If you are not hitting these limits, you should use PyPi directly for sure.

There were security concerns with --extra-index-url but not --index-url. The main downside is that we must vendor on our page all the packages we depend on.

The main thing you want to ensure if you're suggesting extra-index-url is that you control the package name both in pypi AND your url to make sure you control all versions.

[... otherwise someone can] create a package with the same name on PyPi and [take] precedence over ours because they bumped the version. (could also [happen] without the version bump btw.

Security implications

Also from @albanD:

The [risks] that are more critical I think are about people poluting our binaries or getting users to download malicious binaries from our usual channels.
The blogpost (https://johnstawinski.com/2024/01/11/playing-with-fire-how-we-executed-a-critical-supply-chain-attack-on-pytorch/) is definitely one of these. There are a few variants of this depending on your CI system. But in general, how do you protect any secret that you use in your CI/CD (in particular, the ones with write access to pypi, github release, custom website where you ask people to download from, etc).

@akx
Copy link
Contributor

akx commented Feb 5, 2024

Hmm... looking at https://github.com/TimDettmers/bitsandbytes/pull/949/checks, the wheels that get built are only up to 22 megabytes?

Screenshot 2024-02-05 at 11 05 27

@matthewdouglas
Copy link
Member

matthewdouglas commented Feb 5, 2024

Hmm... looking at https://github.com/TimDettmers/bitsandbytes/pull/949/checks, the wheels that get built are only up to 22 megabytes?

These artifacts only have the libraries built for one CUDA toolkit version. Side note: I think the code still expects libbitsandbytes.dll to be named libbitsandbytes_cpu.dll.

image

The current Linux x86-64 wheels on PyPI have binaries for 11.0, 11.1, 11.4, 11.5, 11.7, 11.8, 12.0, 12.1, 12.2, and 12.3, so it adds up quickly. The wheel is 105MB and when unpacked it's 3x that size.

If I recall correctly, VS2022 support came with CUDA 11.6, and I don't think we should bother with anything older than that for Windows builds. On Windows we don't really have to worry about a BC break but it's something to consider for Linux. But with that said, my opinion is that the default binary wheels for both should narrow the list down and instruct to build from source for some of the older versions. Alternatively, build separate wheels for those older ones?

The torch releases look like this:

Date PyTorch CUDA Python
2022-03-10 1.11 11.3 3.7-3.10
2022-06-28 1.12 11.3, 11.6 3.7-3.10
2022-10-28 1.13 11.6, 11.7 3.7-3.10
2023-03-15 2.0 11.7, 11.8 3.8-3.11
2023-10-04 2.1 11.8, 12.1 3.8-3.11
2024-01-30 2.2 11.8, 12.1 3.8-3.11

Some of the NVIDIA NGC containers from the past ~2 years:

Release PyTorch CUDA Python
22.01 1.11.0a0 11.6.0 3.8
22.03 1.12.0a0 11.6.1 3.8
22.05 1.12.0a0 11.7.0 3.8
22.08 1.13.0a0 11.7.1 3.8
22.09 1.13.0a0 11.8.0 3.8
22.12 1.14.0a0 11.8.0 3.8
23.01 1.14.0a0 12.0.1 3.8
23.03 2.0.0a0 12.1.0 3.8
23.04 2.1.0a0 12.1.0 3.8
23.05 2.0.0 12.1.1 3.10
23.06 2.1.0a0 12.1.1 3.10
23.08 2.1.0a0 12.2.1 3.10
23.09 2.1.0a0 12.2.1 3.10
23.10 2.1.0a0 12.2.2 3.10
23.11 2.2.0a0 12.3.0 3.10
24.01 2.2.0a0 12.3.2 3.10

My opinion here is maybe to ship with at most five versions by default: cu117, cu118, cu121, cu122, cu123. This would mean dropping these five that are in the current wheels: cu110, cu111, cu114, cu115, cu120. The idea: cover the support matrix from the last 4 torch releases.

@rickardp
Copy link
Contributor

rickardp commented Feb 7, 2024

Akin to PyTorch, use an extra index (we would like to avoid this).

IMHO, this should be avoided if at all possible. I would consider 1) in this case, even if it is a breaking change (maybe bump major version?)

Does CUDA follow semantic versioning? I.e. will CUDA 12.2.1-compiled binaries also run on CUDA 12.2.2 ? I hope at least this much is true, then we can at least do what @matthewdouglas suggests.

Additionally, could we rely on minor version compatibility?
This page seems to indicate so

From CUDA 11 onwards, applications compiled with a CUDA Toolkit release from within a CUDA major release family can run, with limited feature-set, on systems having at least the minimum required driver version as indicated below. This minimum required driver can be different from the driver packaged with the CUDA Toolkit but should belong to the same major release.

If so, we can reduce the build matrix significantly, basically compiling only for CUDA 11 and CUDA 12. We would have to ve careful when upgrading though, as each new minor version drops all the older ones.

Note that this document speaks about ABIs in CUDA in general. Not sure if this applies to CUDA and toolkit, i.e. if they follow standard Linux semantic .so rules. I guess we are bound to Torch for hte CUDA toolkit version to use, otherwise we could also look at compiling statically in this case?

@matthewdouglas
Copy link
Member

I guess we are bound to Torch for hte CUDA toolkit version to use, otherwise we could also look at compiling statically in this case?

My understanding is that nvcc statically links to the CUDA runtime library by default. The flags here are -cudart and -cudadevrt and they don't appear to be used to deviate from that default in the old Makefile or in the new CMake build.

For what it's worth, I've had success running most of the tests on Windows with torch==2.2.0+cu121 and bitsandbytes compiled with CUDA toolkits 12.0.1, 12.1.1, 12.2.2, and 12.3.2. These have cudart versions of 12.0.146, 12.1.105, 12.2.140, and 12.3.101 respectively. The torch wheel came with cudart64_12.dll = 12.1.105.

I haven't tried <=11.8 yet but will at some point.

I have some thoughts here on how things can be simplified and will try to put together some PRs soon. One of the other things I'm thinking is we should be able to do away with the separate nomatmul builds and do runtime only checks, always linking to cuBLASLt.

@Titus-von-Koeller
Copy link
Collaborator Author

Windows binaries

Hey @matthewdouglas @wkpark @rickardp @akx,

Just checking in about making the Windows build pip installable. I currently don't have a Windows machine to test stuff and I saw that some of you seemed to have already played around with the new wheels.

I didn't have time to investigate this, but it seems to me Windows packaging is pretty much solved (thanks for this amazing work ❤️ ). Are you aware of anything that's still missing in regard to fully supporting pip install bitsandbytes on Windows with the next release?

My idea would be to publish to testPypi this week and make sure everything is as we expect it to be before going ahead with the official release afterwards.

@matthewdouglas
Copy link
Member

@Titus-von-Koeller I've been able to successfully run most of the unit tests, but experience a system crash when testing the 32bit optimizers. I'm not sure if it's just because I have an underwhelming amount of vRAM (6GB) or if there's more to it. I have a memory dump from this but have not looked into it yet.

Apart from that I think we just need to make sure to include the binaries for at least both cu121 and cu118 in the wheels.

@akx
Copy link
Contributor

akx commented Mar 5, 2024

I gave bdist_wheel_windows-latest_x86_64_3.11 from https://github.com/TimDettmers/bitsandbytes/actions/runs/8082336850/artifacts/1282938239 a shot on my Windows machine (GTX 1070, so not exactly brand new) and at least python -m bitsandbytes (which runs a small self-test) and some pytest tests run successfully.

That said, the wheel only contains CPU and CUDA 12 (not CUDA 11 and 12), so that needs to be rectified:

$ zipinfo bitsandbytes-0.43.0.dev0-cp311-cp311-win_amd64.whl | grep dll
-rw-rw-rw-  2.0 fat    36352 b- defN 24-Feb-28 16:04 bitsandbytes/libbitsandbytes_cpu.dll
-rw-rw-rw-  2.0 fat 36000768 b- defN 24-Feb-28 16:04 bitsandbytes/libbitsandbytes_cuda121.dll
-rw-rw-rw-  2.0 fat 35986944 b- defN 24-Feb-28 16:04 bitsandbytes/libbitsandbytes_cuda121_nocublaslt.dll

@Titus-von-Koeller
Copy link
Collaborator Author

Ok, so Tim gave me the following statement:

we need to ensure that all GPUs are supported by the CUDA versions we have.
We do not want a GPU that requires a lower CUDA version than we have.
Since we no longer support 10.2 that has been resolved I think.
So we can support just the pytorch binaries

He thinks that we should simplify the CUDA versions we're supporting and reduce to only the ones PyTorch supports.

Tim concurs that the versions from the last 4 releases should be enough going forward:

  • this would mean keeping cu117, cu118, cu121, cu122, cu123
  • and dropping cu110, cu111, cu114, cu115, cu120

Thanks a lot @matthewdouglas for your thorough analysis on the topic, this was extremely helpful ❤️ !

Tim also thinks that "CUDA 12.2.1-compiled binaries [should] also run on CUDA 12.2.2". That would potentially really simplify things then, as @rickardp explained, right?

If so, we can reduce the build matrix significantly, basically compiling only for CUDA 11 and CUDA 12.

@rickardp Could you please clarify what you meant with the below sentence, especially the second part?

We would have to be careful when upgrading though, as each new minor version drops all the older ones.

@rickardp
Copy link
Contributor

rickardp commented Mar 5, 2024

Sounds good!

That would potentially really simplify things then, as @rickardp explained, right?

Yes, I just wanted confirmation that I understood this correctly but this is also my understanding

Could you please clarify what you meant with the below sentence, especially the second part?

I meant exactly this. If we upgraded to 12.2.2, it cannot (typically) run in 12.2.1 ABI. Or so I would assume, based on the above.

@matthewdouglas
Copy link
Member

matthewdouglas commented Mar 5, 2024

I definitely would expect compiling with 12.2.1 to run on 12.2.2 also. Patch version compatibility shouldn't be much of an issue.

But based on the docs, after 11.1, I'd also expect to see some level of minor version compatibility too - i.e. compile with 12.3, run with any 12.x at runtime assuming a new enough driver is installed.

As far as device coverage goes, here's the main things I think to note:

  • CUDA 11.1 is the minimum required for sm_86 (Ampere)
  • In 11.1.0, release notes state "Enhanced CUDA compatibility across minor releases of CUDA will enable CUDA applications to be compatible with all versions of a particular CUDA major release."
  • CUDA 11.8.0 is the minimum for sm_89 (Ada) and sm_90 (Hopper)
  • CUDA 11.x is required for Kepler (sm_35, sm_37) and we're not expecting to include binaries
  • After (cmake) Fix cuda arch selection #1091 we start to build with an extra PTX target for the highest compute capability selected, which gives some forward compatibility. In theory this should mean that we could build with CUDA <=11.7 with a max target of compute_86,sm_86 and still run on Ada/Hopper, for example.

Aside:
The 12.4 release is rather imminent - I've found the release notes up for it here: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Edit: CUDA 12.4.0 download page is up now.

Edit 2: Attaching a Windows binary built with CUDA 12.4.0 in case anyone wants to try. Note: only built for Pascal and newer.
libbitsandbytes_cuda124.zip

@Titus-von-Koeller
Copy link
Collaborator Author

Archiving this, because it's out of date and we ended up favoring other modes of interaction to coordinate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants