Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable output backends #2943

Open
adetaylor opened this issue Sep 27, 2024 · 4 comments
Open

Pluggable output backends #2943

adetaylor opened this issue Sep 27, 2024 · 4 comments

Comments

@adetaylor
Copy link
Contributor

I was chatting with @pvdrz at the Rust for Linux conference about some bold ideas for bindgen and he thought I should write them up, so here goes.

The problem space

cxx is a pretty popular C++ interop tool these days. Its biggest advantage is that Rust code manipulates C++ data using friendly vocabulary types such as cxx::UniquePtr<T> and cxx::CxxString. These make FFI much safer in practice than bindgen-style bindings, because for example there's no need to track object lifetimes, nor string lengths, etc. Irrespective of the specifics of the policy relating to the unsafe keyword, cxx-style interop is much less error prone in practice than bindgen-style FFI.

However, cxx can't generate such bindings from pre-existing C++ headers. The full language boundary needs to be declared in an IDL (which looks a lot like Rust code but actually isn't quite).

For years now, the Chromium project has wanted to combine the ergonomic advantages of cxx with bindgen-style generation from headers:

The idea

The idea is: bindgen can output different "flavors" of bindings. Perhaps this is pluggable, perhaps there are multiple flavors built in.

  • Basic flavor: current bindgen output. Lots of pointers.
  • cxx-like flavor: detects things like std::unique_ptr and substitutes them with cxx::UniquePtr, generating both C++ and Rust side code to make this possible. This is basically what autocxx does but it's really pretty complicated since it's a post-processor, a lot of the complexity would drop away if it were built into bindgen itself.
  • Future flavors could use a CppRef<T> type for C++ references, etc.

What this would require

It would require:

  • bindgen gains the ability to generate C++ side shim functions as well as Rust side code.
  • Some amount of modularity so that these output backends are not very invasive into the bindgen code. To give an indication of what this takes, autocxx has had to fork bindgen primarily to add extra information about the generated items. The fork is here, with a partial list of the reasons here.

Simplest first step

All this sounds vague. A hypothetical first step could be a --emit-cxx-crate-strings which:

  • Spots all use of const &std::string. If a C++ function takes such a parameter,
  • A synthetic Rust-side function is created which takes a &cxx::CxxString and then calls the real function

This is just a small step, but users of both cxx and bindgen could immediately benefit because they can create such strings in Rust.

Simplest second step

The next step after that would be to support std::string by value, which would require either cunning to store self-referential types on the Rust stack or, more likely, wrapping the std::string in a std::unique_ptr. This would require a small C++ shim function to be generated to unwrap it then pass it into the original C++ API.

@pvdrz
Copy link
Contributor

pvdrz commented Sep 27, 2024

Hey 👋

I support this proposal for several reasons. The first comes from a "human-cost" perspective and its based around the idea that we should join efforts so we can have more eyes in fewer tools instead of having more tools being maintained in isolation.

The second one is that even though this sounds like a pretty big change as a whole, most of the bold things stated here are already done by bindgen in one way or another:

  • We already generate C code for the --wrap-static-fns, so generating C++ code wouldn't be that much of a leap.
  • We already generate Rust code using third party crates such as libloading.

The general idea of having gated outputs sounds reasonable for me, specially since it doesn't affect the default behavior of bindgen and it won't break other people's code. Relying on cxx types doesn't worry me as the crate itself is stable so we shouldn't expect any breaking changes in the future.

My only remaining question here is how do we start? Would anyone be willing to submit some patches with this? or do you consider we should have some discussions before to iron out some kinks?

@emilio
Copy link
Contributor

emilio commented Sep 27, 2024

Yeah, not opposed to something like this, but FWIW you can already do stuff like "replace std::string with my custom type" (same for unique_ptr and pretty much anything).

E.g., for a header like:

// t.hpp
#include <string>

struct Something {
  std::string m_member;
};

And a bindgen invocation like:

$ ./target/debug/bindgen t.hpp \
    --enable-cxx-namespaces \
    --allowlist-item "Something" \
    --blocklist-item "std::.*" --blocklist-item "__gnu_cxx::.*" \
    --module-raw-line "root::std" "pub use cxx::CxxString as string;" 

You get this (cleaned up a little bit):

/* automatically generated by rust-bindgen 0.70.1 */

#[allow(non_snake_case, non_camel_case_types, non_upper_case_globals)]
pub mod root {
    #[allow(unused_imports)]
    use self::super::root;
    pub mod std {
        #[allow(unused_imports)]
        use self::super::super::root;
        pub use cxx::CxxString as string;
    }
    #[repr(C)]
    pub struct Something {
        pub m_member: root::std::string,
    }
    #[allow(clippy::unnecessary_operation, clippy::identity_op)]
    const _: () = {
        ["Size of Something"][::std::mem::size_of::<Something>() - 32usize];
        ["Alignment of Something"]
            [::std::mem::align_of::<Something>() - 8usize];
        ["Offset of field: Something::m_member"]
            [::std::mem::offset_of!(Something, m_member) - 0usize];
    };
}

We use that extensively in Firefox FWIW.

@adetaylor
Copy link
Contributor Author

Thanks for the positive response!

@emilio yep! The prelude and command-line which autocxx feeds to bindgen is quite long :) However cxx is also opinionated about what it does not support, and one of those cases is std::string as a member variable, since it can't be safely moved by Rust. More generally, most C++ types are represented in cxx as zero sized types so there can't be overlapping reference UB. (Alternative approaches are to wrap everything in UnsafeCell and MaybeUninitialized or, as noted before, to never allow Rust references to C++ types).

So that's why I'm first broaching this as a kind of "philosophy of bindgen" discussion: these hypothetical pluggable backends might make quite different decisions to what bindgen does now:

  • don't necessarily attempt to represent everything, but be opinionated about rejecting things which can't be used ergonomically (ergonomically = basically, without pointers being involved)
  • ban some things in some contexts
  • do not use raw pointers
  • ensure all C++ types can never be accessed by value
  • always call through to C++ for any field access
    etc.

@pvdrz

My only remaining question here is how do we start? Would anyone be willing to submit some patches with this? or do you consider we should have some discussions before to iron out some kinks?

I guess one way forward would be for me to write a document or something along the lines of "if autocxx+cxx had been implemented as a bindgen plugin, what would it look like?" and then y'all can see how revolted you are :) Does that sound like a plan? I am unlikely to have time to actually reimplement autocxx (it's 44 KLOC, and I'm not sure that's the right move anyway) but perhaps such a document can move the discussion forward? Like you, I'm really interested in joining efforts to reduce duplication.

@pvdrz
Copy link
Contributor

pvdrz commented Oct 2, 2024

Thanks for the positive response!

@emilio yep! The prelude and command-line which autocxx feeds to bindgen is quite long :) However cxx is also opinionated about what it does not support, and one of those cases is std::string as a member variable, since it can't be safely moved by Rust. More generally, most C++ types are represented in cxx as zero sized types so there can't be overlapping reference UB. (Alternative approaches are to wrap everything in UnsafeCell and MaybeUninitialized or, as noted before, to never allow Rust references to C++ types).

So that's why I'm first broaching this as a kind of "philosophy of bindgen" discussion: these hypothetical pluggable backends might make quite different decisions to what bindgen does now:

* don't necessarily attempt to represent everything, but be opinionated about rejecting things which can't be used ergonomically (ergonomically = basically, without pointers being involved)

* ban some things in some contexts

* do not use raw pointers

* ensure all C++ types can never be accessed by value

* always call through to C++ for any field access
  etc.

I think this is fine and it might be an actual advantage as having a smaller surface of code that's actually accepted might make the whole task more approachable.

I'm just shooting at the air here but maybe we could expose a limited subset of bindgen internals so we can write these backends as separate crates, where the existing backend is just one of these crates.

@pvdrz

My only remaining question here is how do we start? Would anyone be willing to submit some patches with this? or do you consider we should have some discussions before to iron out some kinks?

I guess one way forward would be for me to write a document or something along the lines of "if autocxx+cxx had been implemented as a bindgen plugin, what would it look like?" and then y'all can see how revolted you are :) Does that sound like a plan? I am unlikely to have time to actually reimplement autocxx (it's 44 KLOC, and I'm not sure that's the right move anyway) but perhaps such a document can move the discussion forward? Like you, I'm really interested in joining efforts to reduce duplication.

That sounds like a good way to kickstart the discussion 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants