Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unfork bindgen #124

Open
8 tasks
adetaylor opened this issue Nov 24, 2020 · 8 comments
Open
8 tasks

Unfork bindgen #124

adetaylor opened this issue Nov 24, 2020 · 8 comments
Labels
architectural Architectural improvement

Comments

@adetaylor
Copy link
Collaborator

adetaylor commented Nov 24, 2020

bindgen doesn't pass on quite all the information we need in order to be able to generate autocxx bindings. Options in the future might be to fork bindgen, ask very nicely if they mind us upstreaming patches to add metadata, or switch away from bindgen and use llvm directly.

In any case this bug will be a live tracker of all known cases where we don't have all the information that we require about the underlying C++.

@martinboehme
Copy link
Collaborator

  • Classes vs structs. Both present as structs in the bindgen output. When we then generate extra C++ we have to plump for one or the other. In some ABIs this will result in a binary incompatibility or failure to compile.

Not sure I understand -- I always thought a struct was exactly identical to a class, just that it has public visibility by default? And I believe it's legal for forward declarations of a struct to use class and vice versa (see this StackOverflow answer). I'm pretty sure there isn't an ABI difference either. Or am I misunderstanding what you're referring to here?

@adetaylor
Copy link
Collaborator Author

This is #54 - let's discuss over there.

@martinboehme
Copy link
Collaborator

Thanks for the clarification -- interesting!

@adetaylor
Copy link
Collaborator Author

For #414 and #102 many types have 'annotations' attached (the RustTy type in our fork of bindgen). Such annotations are currently discarded using an ignore_annotations method. Each time that happens, it's probably a bug. For example this means we would try (and fail) to generate POD struct fields using types with template parameters which bindgen has dropped. Once a plan presents itself here, we should ensure we never drop information like this.

@martinboehme
Copy link
Collaborator

Note from #426: The Rust function signature for copy constructors and move constructors is the same, so bindgen needs to add an annotation that allows us to distinguish between them.

@bsilver8192
Copy link
Contributor

#799 is also on this list, and I'm pretty sure #815 is too.

@adetaylor
Copy link
Collaborator Author

I've renamed this issue to track the hopes of unforking autocxx-bindgen and using plain old bindgen here.

I intend to work on this somewhat over the next couple of weeks to see how far I can get. Even if I can reduce the scale of the fork, that's still beneficial. Right now, the need to roll bindgen is the major ongoing maintenance headache for autocxx, and we need to zap that.

Ways forward:

  • bindgen's ParseCallbacks now receive quite a bit more information than they used to. Specifically, new_item_found probably enables us to eliminate a few of the bindgen_original_name annotations already, and I hope bindgen will accept PRs to expand the callbacks here — it seems that this callbacks mechanism is exactly intended for Fancy Postprocessing Use-Cases like ours.
  • A fair amount of the forking is attempting to identify unused template parameters. Perhaps bindgen could add a mode where it, err, doesn't — it just ensures all template parameters are used, if necessary by adding in extra PhantomData. I don't know if this is actually possible but this would greatly increase the percentage of C++ types which can be handled by autocxx. I'm trying this in Attempts to include all type params #1425.
  • I had a chat with one of the bindgen maintainers, and it resulted in this issue - they were open in principle to the idea of "pluggable output backends". Maybe that's not necessary given the callback mechanism.

adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 14, 2025
This is a major change to autocxx-bindgen which aims to improve maintainability
by reducing divergence from upstream bindgen. Specifically, it:

* Significantly reduces the textual diffence
* Makes the remaining changes less invasive, such that merge conflicts should
  be easier to resolve
* Redoes the changes such that they're more attuned to the current
  evolution of upstream bindgen, so it may be possible to upstream some or
  ideally all of these changes. (The ultimate goal is to unfork bindgen!)

See google/autocxx#124 and
rust-lang#2943 for the background here.

Specifically this change:
* Removes the #[bindgen_semantic_attributes] which were added to all
  sorts of items. Instead,
* Much more is communicated via the existing ParseCallbacks mechansim.
* In some cases, it's still necessary to annotate individual types -
  in this case we generate a newtype wrapper instead of attributes.

This commit also re-enables the bindgen test suite. It does not yet add
tests for all the above new functionality; that's yet to come.
adetaylor added a commit that referenced this issue Feb 14, 2025
autocxx-bindgen has just been changed substantially to have less dramatic
forking from bindgen.  This PR makes the corresponding changes in autocxx. It's
not quite perfect yet (e.g. some O(n^2) behavior) but should be good enough to
run tests.

Part of #124.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 16, 2025
ParseCallbacks previously reported structs but not enums. Enhance it to do so.
At the moment, little information is provided about enums - but bindgen doesn't
handle (rare) anonymous enums so this seems the right amount of information to
report. At the moment, effectively this just provides a mapping between name
and DiscoveredItemId.

One of a number of PRs I'll be raising for google/autocxx#124.

In future PRs I'll be hoping to add further callbacks which report more
information based on DiscoveredItemId, so having the DiscoveredItemId
for each enum is an important pre-requisite.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 16, 2025
ParseCallbacks previously reported structs but not enums. Enhance it to do so.
At the moment, little information is provided about enums - but bindgen doesn't
handle (rare) anonymous enums so this seems the right amount of information to
report. At the moment, effectively this just provides a mapping between name
and DiscoveredItemId.

One of a number of PRs I'll be raising for google/autocxx#124.

In future PRs I'll be hoping to add further callbacks which report more
information based on DiscoveredItemId, so having the DiscoveredItemId
for each enum is an important pre-requisite.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 16, 2025
These files aspects of bindgen behavior which may not be generally useful to
most consumers but are more important to downstream postprocessors such as
autocxx.

One of them tests enums embedded within classes, and the other tests various
types of C++ constructor.

Part of google/autocxx#124.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 16, 2025
With a new command-line option, this ensures that char16_t
is distinct from uint16_t in generated bindings. On some platforms
these are distinct types, so it can be important for downstream
post processors to spot the difference.

See the documentation on the new command-line option for
expected behavior and usage here.

Part of google/autocxx#124.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 20, 2025
No functional change - just deduplicating the logic which calls this callback,
which will make it easier to make further changes in future.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 20, 2025
This adds more information to ParseCallbacks which indicates the location in
the original source code at which a given item was found.

This has proven to be useful in downstream code generators in providing
diagnostics to explain why a given item can't be represented in Rust. (There
are lots of reasons why this might not be the case - autocxx has around 100
which can be found here -
/~https://github.com/google/autocxx/blob/d85eac76c9b3089d0d86249e857ff0e8c36b988f/engine/src/conversion/convert_error.rs#L39
- but irrespective of the specific reasons, it's useful to be able to point to
the original location when emitting diagnostics).

Should we make this a new callback or include this information within
the existing callback?

Pros of making it a new callback:
* No compatibility breakage.

Pros of including it in this existing callback:
* No need to specify and test a policy about whether such callbacks
  always happen together, or may arrive individually
* Easier for recipients (including bindgen's own test suite) to
  keep track of the source code location received.
* Because we add new items to the DiscoveryItem enum anyway,
  we seem to have accepted it's OK to break compatibility in this
  callback (for now at least).

Therefore I'm adding it as a parameter to the existing callback. If it's
deemed acceptable to break compatibility in this way, I will follow the
same thought process for some other changes too.

Part of google/autocxx#124.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 20, 2025
In its default behavior, bindgen does not emit information for any
C++ operatorXYZ function (unless operatorXYZ is a valid identifier,
which it isn't for C++ overloaded operators).

This PR introduces a new command line option to represent all operators.
This is not useful for those who directly consume the output of bindgen,
but is useful for post-processors. Specifically, consumers will need to
implement the callback to rename these items to something more useful.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 20, 2025
For virtual functions, bindgen has traditionally emitted a void* pointer
because that's the least bad option for Rust code directly consuming the
bindings. For downstream postprocessors and code generators, this throws away
useful information about the actual type of the receiver. Provide a
command-line option to put that back.

Part of google/autocxx#124
github-merge-queue bot pushed a commit to rust-lang/rust-bindgen that referenced this issue Feb 20, 2025
For virtual functions, bindgen has traditionally emitted a void* pointer
because that's the least bad option for Rust code directly consuming the
bindings. For downstream postprocessors and code generators, this throws away
useful information about the actual type of the receiver. Provide a
command-line option to put that back.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 20, 2025
This PR reports the memory layout information (size, alignment, packed-ness)
for each struct into the ParseCallbacks.

The use-case for this is fairly complex. Postprocessors such as autocxx
currently do not necessarily pass all bindgen's output through: autocxx is
opinionated in deciding what C++ types are "reasonable" to expose to Rust. For
example types containing C++ strings aren't represented to Rust
by-value. Instead, these types are represented as "opaque" types in the cxx
sense (https://cxx.rs/extern-c++.html#opaque-c-types). As far as Rust is
concerned, these types have no fields and can only be poked at using methods.

In order to make these allocable on the stack, autocxx _does_ need to know
the layout of these structs, so it can make an "opaque" fake struct in lieu
of the real one. That's what is enabled by this callback. autocxx and other
tools can post-process the bindgen output to replace bindgen's generated
type with an opaque data-only struct like this:

  #[repr(align(16),packed)]
  pub struct TheType {
    _data: UnsafeCell<MaybeUninit<[u8;4]>> // etc.
  } // the actual struct is quite a bit more complex

The extra information provided in the callback within this PR allows that.

For completeness, there are multiple ways that the whole stack could
be rearchitected to avoid the need for this.

1. We could teach bindgen all the rules for when a struct should be
   "opaque", then bindgen could generate it that way in the first place.
   This suggests major extensibility of bindgen to allow highly
   opinionated backends, along the lines of rust-lang#2943.

2. That decision could be delegated to the ParseCallbacks with a new
      fn make_struct_opaque(&self, some_struct_info) -> bool
   call. Although this sounds simple, the decisions required here are
   very complex and depend (for instance) upon all the different
   constructors, ability to represent template params, etc., so
   in practice this would require post-processors to run bindgen twice,
   once to gather all that information and then again to give
   truthful answers to that callback.

3. Post-processors such as autocxx could instead generate opaque types
   like this:
     #[repr(transparent)]
     pub struct TheType {
       _hidden_field:
         UnsafeCell<MaybeUninit<TheTypeAsGeneratedByBindgen>>
     } // it's more complex in reality
   or similar. This turns out to be very complex because
   TheTypeAsGeneratedByBindgen might include complex STL structures
   such as std::string which autocxx currently instructs bindgen to
   replace with a simpler type, precisely so it can avoid all this
   complexity.

Any of these options are months of work either in bindgen or autocxx
or both, so I'm hoping that it's OK to simply emit the layout
information which bindgen already knows.

Compatibility: this is a further compatibility break to the
DiscoveredItem enum. However, it seems to be accepted that we're going
to grow that enum over time. If the compatibility break is a concern
this information could be reported via a separate callback.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 21, 2025
This makes two complementary improvements to the ParseCallbacks.
The first is that Mods are now announced, as a new type of
DiscoveredItem. The second is that the parentage of each item is
announced. The parent of an item is often a mod (i.e. a
C++ namespace) but not necessarily - it might be a struct within
a struct, or similar.

The reported information here is dependent on two pre-existing
bindgen options:
* whether to report C++ namespaces at all
* whether to report inline namespaces conservatively.

For that reason, the test suite gains two new tests.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 21, 2025
This makes two complementary improvements to the ParseCallbacks.
The first is that Mods are now announced, as a new type of
DiscoveredItem. The second is that the parentage of each item is
announced. The parent of an item is often a mod (i.e. a
C++ namespace) but not necessarily - it might be a struct within
a struct, or similar.

The reported information here is dependent on two pre-existing
bindgen options:
* whether to report C++ namespaces at all
* whether to report inline namespaces conservatively.

For that reason, the test suite gains two new tests.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 21, 2025
This change reports extra C++ information about items:

* Whether methods are virtual or pure virtual or neither
* Whether a method is a "special member", e.g. a move constructor
* Whether a method is defaulted or deleted
* C++ visibility (for structs, enums, unions and methods)

It builds on top of rust-lang#3145.

This PR is not yet ready for merge, since we need to merge rust-lang#3139
and then enhance the tests to cover those cases too.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 21, 2025
This change reports extra C++ information about items:

* Whether methods are virtual or pure virtual or neither
* Whether a method is a "special member", e.g. a move constructor
* Whether a method is defaulted or deleted
* C++ visibility (for structs, enums, unions and methods)

It builds on top of rust-lang#3145.

A follow up PR should enhance the tests once rust-lang#3139 is merged.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 22, 2025
Downstream postprocessors such as autocxx may want to use bindgen's
representation of types to generate additional C++ code. bindgen is remarkably
faithful at passing through enough information to make this possible - but for
some C++ types, bindgen chooses to elide some template parameters. It's not
safe for additional C++ code to be generated which references that type.

This adds a callback by which such tools can recognize types where template
parameters have been elided like this, so that extra C++ is avoided.

This information could be provided in the existing
`new_item_found` callback, but it's very niche and unlikely to be used by
the majority of consumers, so a new callback is added instead.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 22, 2025
Downstream postprocessors such as autocxx may want to use bindgen's
representation of types to generate additional C++ code. bindgen is remarkably
faithful at passing through enough information to make this possible - but for
some C++ types, bindgen chooses to elide some template parameters. It's not
safe for additional C++ code to be generated which references that type.

This adds a callback by which such tools can recognize types where template
parameters have been elided like this, so that extra C++ is avoided.

This information could be provided in the existing
`new_item_found` callback, but it's very niche and unlikely to be used by
the majority of consumers, so a new callback is added instead.

An alternative approach here is to provide a mode to bindgen where it
*always* uses all template params, by adding additional PhantomData
fields to structs where those params are not currently used.
This is being prototyped in google/autocxx#1425
but is unlikely to be successful, on the assumption that lots of the
templated types can't actually be properly represented by bindgen/Rust,
so the current strategy of discarding them is more likely to work
in the broad strokes.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 24, 2025
As standard, bindgen treats C++ pointers and references the same in its
output. Downstream postprocessors might want to treat these things
differently; for example to use a CppRef<T> type for C++ references
to encode the fact that they're not (usually) null.

This PR emits references wrapped in a newtype wrapper called
  bindgen_marker_Reference<T>
This behavior is enabled by an option flag; it isn't default.
This type isn't actually defined in bindgen; users are expected
to define or replace this during postprocessing (e.g. by using
syn to reinterpret bindgen's output, or perhaps there are ways
to make this useful using --raw-line or other means to inject
a transparent newtype wrapper).

The same approach is taken to types which bindgen chooses to make
opaque. This is done in various circumstances but the most common
case is for non-type template parameters.

Alternative designs considered:
* Apply an attribute to the function taking or returning
  such a param, e.g.
    #[bindgen_marker_takes_reference_arg(1)]
    fn takes_reference(a: u32, b: *const Foo)
  This avoids the marker type, but the problem here is a
  more invasive change to bindgen because type information
  can no longer be contained in a syn::Type; type metadata
  needs to be communicated around inside bindgen.

* Make a ParseCallbacks call each time a type is opaque
  or a reference. This would work for standalone opaque types,
  e.g.
    pub struct Bar {
     pub _bindgen_opaque_blob: u8
    }
  but the main case where we need these is where bindgen is
  using an opaque or reference type in the signature of some
  function, and there's no real opportunity to describe this
  in any kind of callback, since the callback would have to
  describe where exactly in the function signature the
  opaque or reference type has been used (and then we run into
  the same problems of passing this metadata around in the
  innards of bindgen).

In order to maintain the current simple design where any C++
type is represented as a simple syn::Type, the best approach
seems to be to do this wrapping in a fake marker type.

Another design decision here was to represent an RValue
reference as:
  TypeKind::Reference(_, ReferenceKind::RValue)
rather than a new variant:
  TypeKind::RValueReference(_)
In the majority of cases references are treated the same way
whether they're rvalue or lvalue, so this was less invasive,
but either is fine.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 24, 2025
As standard, bindgen treats C++ pointers and references the same in its
output. Downstream postprocessors might want to treat these things
differently; for example to use a CppRef<T> type for C++ references
to encode the fact that they're not (usually) null.

This PR emits references wrapped in a newtype wrapper called
  bindgen_marker_Reference<T>
This behavior is enabled by an option flag; it isn't default.
This type isn't actually defined in bindgen; users are expected
to define or replace this during postprocessing (e.g. by using
syn to reinterpret bindgen's output, or perhaps there are ways
to make this useful using --raw-line or other means to inject
a transparent newtype wrapper).

The same approach is taken to types which bindgen chooses to make
opaque. This is done in various circumstances but the most common
case is for non-type template parameters.

Alternative designs considered:
* Apply an attribute to the function taking or returning
  such a param, e.g.
    #[bindgen_marker_takes_reference_arg(1)]
    fn takes_reference(a: u32, b: *const Foo)
  This avoids the marker type, but the problem here is a
  more invasive change to bindgen because type information
  can no longer be contained in a syn::Type; type metadata
  needs to be communicated around inside bindgen.

* Make a ParseCallbacks call each time a type is opaque
  or a reference. This would work for standalone opaque types,
  e.g.
    pub struct Bar {
     pub _bindgen_opaque_blob: u8
    }
  but the main case where we need these is where bindgen is
  using an opaque or reference type in the signature of some
  function, and there's no real opportunity to describe this
  in any kind of callback, since the callback would have to
  describe where exactly in the function signature the
  opaque or reference type has been used (and then we run into
  the same problems of passing this metadata around in the
  innards of bindgen).

In order to maintain the current simple design where any C++
type is represented as a simple syn::Type, the best approach
seems to be to do this wrapping in a fake marker type.

Another design decision here was to represent an RValue
reference as:
  TypeKind::Reference(_, ReferenceKind::RValue)
rather than a new variant:
  TypeKind::RValueReference(_)
In the majority of cases references are treated the same way
whether they're rvalue or lvalue, so this was less invasive,
but either is fine.

Part of google/autocxx#124
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 24, 2025
As standard, bindgen treats C++ pointers and references the same in its
output. Downstream postprocessors might want to treat these things
differently; for example to use a CppRef<T> type for C++ references
to encode the fact that they're not (usually) null.

This PR emits references wrapped in a newtype wrapper called
  bindgen_marker_Reference<T>
This behavior is enabled by an option flag; it isn't default.
This type isn't actually defined in bindgen; users are expected
to define or replace this during postprocessing (e.g. by using
syn to reinterpret bindgen's output, or perhaps there are ways
to make this useful using --raw-line or other means to inject
a transparent newtype wrapper).

The same approach is taken to types which bindgen chooses to make
opaque. This is done in various circumstances but the most common
case is for non-type template parameters.

Alternative designs considered:
* Apply an attribute to the function taking or returning
  such a param, e.g.
    #[bindgen_marker_takes_reference_arg(1)]
    fn takes_reference(a: u32, b: *const Foo)
  This avoids the marker type, but the problem here is a
  more invasive change to bindgen because type information
  can no longer be contained in a syn::Type; type metadata
  needs to be communicated around inside bindgen.

* Make a ParseCallbacks call each time a type is opaque
  or a reference. This would work for standalone opaque types,
  e.g.
    pub struct Bar {
     pub _bindgen_opaque_blob: u8
    }
  but the main case where we need these is where bindgen is
  using an opaque or reference type in the signature of some
  function, and there's no real opportunity to describe this
  in any kind of callback, since the callback would have to
  describe where exactly in the function signature the
  opaque or reference type has been used (and then we run into
  the same problems of passing this metadata around in the
  innards of bindgen).

In order to maintain the current simple design where any C++
type is represented as a simple syn::Type, the best approach
seems to be to do this wrapping in a fake marker type.

Another design decision here was to represent an RValue
reference as:
  TypeKind::Reference(_, ReferenceKind::RValue)
rather than a new variant:
  TypeKind::RValueReference(_)
In the majority of cases references are treated the same way
whether they're rvalue or lvalue, so this was less invasive,
but either is fine.

Part of google/autocxx#124
@adetaylor adetaylor added the architectural Architectural improvement label Feb 24, 2025
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 26, 2025
Provide an option to add a line to every mod generated by bindgen.

Part of google/autocxx#124, though
this is only in fact necessary if we make the changes given in
google/autocxx#1456.
adetaylor added a commit to adetaylor/rust-bindgen that referenced this issue Feb 26, 2025
This uses an existing callback to allow overriding of constructor name.

Part of google/autocxx#124, though only
necessary if we also do google/autocxx#1456.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architectural Architectural improvement
Projects
None yet
Development

No branches or pull requests

3 participants