Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed macOS system libcxxabi? #148

Conversation

conda-forge-admin
Copy link
Contributor

Hi! This is the friendly automated conda-forge-webservice.

I've started rerendering the recipe as instructed in #147.

If I find any needed changes to the recipe, I'll push them to this PR shortly. Thank you for waiting!

Here's a checklist to do before merging.

  • Bump the build number if needed.

Fixes #147

@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

conda-forge-webservices[bot] and others added 2 commits May 26, 2024 11:55
@h-vetinari h-vetinari changed the title MNT: rerender Changed macOS system libcxxabi? May 26, 2024
@h-vetinari
Copy link
Member

h-vetinari commented May 26, 2024

@conda-forge/libcxx @conda-forge/core, this is an interesting situation. It seems that the macOS system libcxxabi changed ABI somewhere between 10.9 and 10.13, which is manifesting as segfaults in the boost test.

In more detail: #144 is not ready (at all), but made me look at libcxx-testing due to failures. As it turns out, a mere rerender there (conda-forge/libcxx-testing-feedstock#8) causes the feedstock to fail, where the only real change is moving the deployment target from 10.9 to 10.13. So the immediate question becomes: how could #131 have passed then (which was obviously already using 10.13 at the time, and runs libcxx-testing as a downstream test)? I don't yet have an answer.

Where it gets interesting is that - on a wild hunch - I undid the patch here where @isuruf brought some struct in alignment with the old system libcxxabi (AFAIU), and lo and behold, the segfaults are gone (see last and second-to-last runs here).

I think I've mostly understood why we need to rely on the system libcxxabi, but it seems strange to me that the ABI there seems to have changed? I mean, not that strange, since AFAIU macOS eventually mirrors what LLVM does, but still: it would appear that we cannot actually build against something that would work for both 10.9 & 10.13. Presumably, something about this must already have been broken (and it's just surfacing now), and I have no idea how often similar breaks might have happened in the libcxxabi's shipped by newer deployment targets.

Would appreciate your thoughts.

@beckermr
Copy link
Member

beckermr commented May 26, 2024

It's very odd we haven't seen reports of segfaults from this in the wild.

Also, given the name of the library, an ABI break here is ironic to say the least.

I still don't fully understand why the packages I build need this in the first place. It could be time to push on that more.

@h-vetinari
Copy link
Member

I still don't fully understand why the packages I build need this in the first place. It could be time to push on that more.

libcxx depends on symbols from libcxxabi, but AFAICT we cannot ship our own libcxxabi, because some packages (Python?) always prefer the system one. Here's the intro from pybind11_exception_rtti_test:

The backstory is that when we also built libcxxabi on osx, we saw issues where the RTTI was not making it between different compilations for exceptions when using pybind11. We (really @isuruf) figured out that by having only one copy of the libcxxabi lib (i.e., just the system one and not the one on conda), it fixed this issue. (On OSX the system one always gets pulled in first by python). Thus we switched to linking libcxx against the system libcxxabi on osx.

However, we might be stretching what can be done with the system ABI to the breaking point. Maybe we need to go back to shipping libcxxabi, and patching python to ensure we prefer our libcxxabi...?

@beckermr
Copy link
Member

Yeah I wrote that readme file IIRC. 😀

Yes, ideally we'd bake in to our binaries to load our libcxxabi first. It would need to be any binary, not just python. A compile-time equivalent to ld preload if you will.

I know of no such option. :/

@h-vetinari
Copy link
Member

So the immediate question becomes: how could #131 have passed then (which was obviously already using 10.13 at the time, and runs libcxx-testing as a downstream test)? I don't yet have an answer.

I think it's possible that it matters which libcxxabi other relevant packages in that test (python, pybind11, etc.) have been built against. Currently, python and pybind11 have not been rebuilt against 10.13 yet, so if this hypothesis is true, it has to be something else.

Yes, ideally we'd bake in to our binaries to load our libcxxabi first.

Could we patch out the default lookup being done in /usr/lib/libc++abi.dylib for affected projects?

@beckermr
Copy link
Member

Idk but that would have to be done case by case. The issue iirc is that when projects link against the few libs we do not ship on osx, it is those libs that pull in the system libcxxabi. Idk how, eg Python itself, would link directly.

@h-vetinari
Copy link
Member

The issue iirc is that when projects link against the few libs we do not ship on osx, it is those libs that pull in the system libcxxabi.

Isn't that just a question of what's on the (default or project-specific) library search path that the linker uses to go look for symbols?

@beckermr
Copy link
Member

Btw this is why i think it is all binaries. Nearly everything will link against the system libs. If it is the system libs that pull in libcxxabi, then there is not a lot we can do about it besides some sort of ld preload type adjustment to the linking.

@beckermr
Copy link
Member

I dont follow the question. IIRC, the system libcxxabi is pulled in first despite the presence of other rpath entries in the binary. We can't exclude the system linking paths since we need the system libs for some things so we're stuck. The only fix we found was to ld preload the conda libcxxabi. Thus we stopped using the conda one on osx since the system one appeared to be pulled in first by the system libs no matter what. The two are apparently not always interchangeable and we can't adjust the system libs, so we're stuck.

@beckermr
Copy link
Member

The tests in the testing feedstock used to fail if we used the conda libcxxabi. You should be able to reproduce by adding that back locally and running the builds. Maybe you can find a way to fix things that is better.

@beckermr
Copy link
Member

Otoh this is a bit of a red herring. Even if we used the conda one all the time, apparently there is an Abi break upstream and that would apply to the conda libs too, right?

@h-vetinari
Copy link
Member

@beckermr: Even if we used the conda one all the time, apparently there is an Abi break upstream and that would apply to the conda libs too, right?

yup, it's the same problem of (apparently) no long-term stable ABI that we have with the system libcxxabi. When Isuru introduced that patch, upstream said (more context; separated by mailing-list-to-discourse transition):

[libcxx maintainer]: Generally speaking, we make the assumption that there’s a single copy of libc++ in use inside a final linked image. Or if you want to use libc++ as an implementation detail and link it statically, ensure it doesn’t export any symbols and has absolutely no ABI surface outside of the executable where it’s being used. Chrome does that.

Anything else is very brittle, and I’m reluctant to add workarounds that make it look like it works, because you may run into other issues that we won’t be able to patch so easily.

[someone else]: I had thought libc++ was more immune to this kind of problem given Apple’s usual support to deploy on older systems.

[libcxx maintainer]: It can’t really be immune to those problems since they are byproducts of how linking works at a pretty fundamental level. Back-deploying to older platforms works nicely as long as you use the libc++ shipped with the system (we make sure of that), or that you “embed” libc++ into your application by linking it as a static archive and disregard the system libc++ altogether (Chrome does that). When you do something in-between, that’s when things start failing.

Isuru had said:

I’ll try to make sure we are using only one version.

That has worked well for libcxx, but not sure if this is feasible w.r.t. relying on system libcxxabi.

@beckermr: The tests in the testing feedstock used to fail if we used the conda libcxxabi. You should be able to reproduce by adding that back locally and running the builds. Maybe you can find a way to fix things that is better.

I tried going back to shipping our own libcxxabi in #149, and the downstream tests pass without any further changes (in contrast to this PR). However, they fail when we disable patch 0002 (which passes here).

So it really seems like an either-or choice. Again, there maybe other packages currently compiled against the "wrong" libcxxabi already in place, but it seems we can't have our cake and eat it too.

@h-vetinari
Copy link
Member

Like we discussed in the core call, publishing a rebased version of this PR to libcxx_dev: #150

(separate to keep context/discussion of this PR)

@h-vetinari
Copy link
Member

So I had one hypothesis that this could have had something to do with the ABI that python itself was built against. But 3.12.4 was built against 10.13 and the tests here (with the patch to match the old ABI) still fail.

Could you share your thoughts @isuruf please - as you authored that patch, I assume you understand much better what's going on.

@isuruf
Copy link
Member

isuruf commented Jun 19, 2024

You are running into the same issue as #77 (comment).
The analysis is there if you want to read it.

@h-vetinari
Copy link
Member

You are running into the same issue as #77 (comment). The analysis is there if you want to read it.

Thank you, that clearly points as the macOS 12 image bump as the culprit. The longer explanation there then goes into detail of how pybind11 might end up using a cached system libcxx, but AFAIU, enabling that was exactly the use-case for you adding that as a test to conda-forge/libcxx-testing-feedstock#1, and patching (d84a6e1) the ABI here.

If we wanted to match the system ABI today, a "fix" would be to just undo that ABI-patch, but I was asking for your input (same at the second-to-last core meeting) whether there's a way to fix this more sustainably.

AFAIU we're a bit between a rock and a hard place, as in: the system ABI will keep changing subtly, so we can either keep building for one ABI (the one in the CI images...? since that's what we test against here) and accept the failing corner cases, or figure out a way how to force loading our own libcxx over the system one. I was hoping that there's a better way that I'm not seeing.

@isuruf
Copy link
Member

isuruf commented Jun 20, 2024

We can build two variants. One with (__osx <12) and one without the patch (__osx >=12)

@h-vetinari
Copy link
Member

h-vetinari commented Jun 20, 2024

We can build two variants. One with (__osx <12) and one without the patch (__osx >=12)

I agree that this is an obvious solution for this case. I was worried how many such ABI breaks are still lurking in newer macOS versions that we're blind to (and whether we'd then end up building for each major macOS version). But I guess we can consider these cases as rare and treat them on a case-by-case basis...

Would you be alright with not testing the __osx <12 variant, at least once the deprecated macos-11 image isn't available anymore? Or is there a way you can imagine for us to fake/overwrite the system libcxx so that we can test the old ABI as well?

@h-vetinari
Copy link
Member

We can build two variants. One with (__osx <12) and one without the patch (__osx >=12)

I've implemented this in #154, could you (@isuruf; or anyone else interested) PTAL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

@conda-forge-admin, please rerender
4 participants