Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teach codegen rules to accomodate compiled languages #14258

Closed
tdyas opened this issue Jan 26, 2022 · 10 comments · Fixed by #14707
Closed

teach codegen rules to accomodate compiled languages #14258

tdyas opened this issue Jan 26, 2022 · 10 comments · Fixed by #14707
Assignees

Comments

@tdyas
Copy link
Contributor

tdyas commented Jan 26, 2022

Problem

The existing codegen rules do not work for compiled languages because those core rules treat generated sources as loose files and provide no way to wrap that generated code in the target types needed by the backends for compiled languages to actually build that code.

This is not a problem for Python (and other scripting languages) because mere existence of the generated sources is enough for the Python interpreter to consume those files. Contrast that with the Go backend which needs to build those sources into a package archive that is then linked into Go binaries. The Go backend tracks each package by using the go_package and go_third_party_package target types. (The go_third_party_package targets are generated automatically by processing the applicable go.mod file.) Loose source files cannot be processed by the Go backend.

The same applies for the Java and Scala backends.

Potential Solutions

Pants will need either a modification of the core codegen rules or a new codegen API to support compiled languages. The new API could use target generation to generate language-specific target types to wrap generated code. Further discussion in the comments.

@tdyas
Copy link
Contributor Author

tdyas commented Jan 26, 2022

Using Scala and Protobuf as an example, one solution could leverage target generation as follows:

  1. Define a _scala_protobuf_sources target type with a dependency on its protobuf_sources target. (This would be used internally to wrap generated Scala sources, and would not be exposed to the user.)
  2. Modify the protobuf_sources target type to make it a target generator by implementing GenerateTargetsRequest for it. The applicable target generation rule should look for a generate field on the parent protobuf_sources target to define which wrapper targets to generate. Specifically, the rule should generate the _scala_protobuf_sources target if scala is present in the values for the generate field; for example, generate=["scala"].
  • Open question: Individual language backends should probably register using a UnionRule with each particular codegen technology. Thus, there could be a "GenerateProtobufCodegenTargetsRequest` union that the Scala, Java, and Go codegen backends would register for. The protobuf codegen rule would then request each language-specific backend to make its wrapper target.
  1. Modify the rules for HydrateSourcesRequest / SourceFilesRequest to recognize when a wrapper target for codegen is having its sources requested and run the applicable codegen tool at that point. That will generate a Snapshot which is what upstream users need to process the sources.
  • Open question: Maybe use a mix-in on the target type or a marker field?

More open questions:

  1. The new codegen rules should probably run dependency inference and injection (InferDependenciesRequest and InjectDependenciesRequest respectively) on the generated sources. This may require eagerly running the codegen tool during target generation instead of later during the actual build. This solution will need to think through how and when to do that.
  • Motivation: The Go backend relies heavily on dependency inference to infer dependencies from the import paths of Go packages. Generated Go code for protobuf has a non-trivial number of imports which would be annoying for developers to have to resolve manually. (And given the wrapper targets are not intended to be exposed to users, they may not be able to do so.) The Java and Scala backends also rely on dependency inference to infer which sources must be compiled together.
  1. Do users need to ever manually set dependencies? I.e., should the wrapper targets be accessible to users? (This would mean exposing LANG_TECH_sources target types for each language and codegen technology, e.g. scala_protobuf_sources, go_protobuf_package, etc.)

@tdyas
Copy link
Contributor Author

tdyas commented Jan 26, 2022

The choice of how to specify what backends to run impacts #14041.

@stuhood
Copy link
Member

stuhood commented Feb 3, 2022

For the purposes of the JVM at least, I think that this can be accomplished by making

@staticmethod
def for_targets(
union_membership: UnionMembership,
component: CoarsenedTarget,
resolve: CoursierResolveKey,
*,
root: bool = False,
) -> ClasspathEntryRequest:
"""Constructs a subclass compatible with the members of the CoarsenedTarget.
If the CoarsenedTarget is a root of a compile graph, pass `root=True` to allow usage of
request types which are marked `root_only`.
"""
... codegen-aware.

To do that, I think that that method should call into some non-rule code extracted from the existing GenerateSourcesRequest-consuming codegen logic to additionally account for the fact that a codegen impl can generate the relevant inputs.

This would allow for inner nodes in the graph (i.e., things which are depended on by non-codegen'd sources) to be codegen'd, which is 95% of what we need. The last 5% would be allowing codegen targets to be roots in, say, check, but I don't think that that blocks anything.

@Eric-Arellano : Does the above sound right?

tdyas pushed a commit that referenced this issue Feb 4, 2022
…release (#14351)

As described in #14258, codegen backends for compiled languages do not currently work since the core codegen rules only supporting generating sources, but not generating targets to allow the backends for compiled languages to actually compile the sources.

For now, remove the registration code for those backends so users do not try to use them.
@tdyas tdyas self-assigned this Feb 16, 2022
@tdyas
Copy link
Contributor Author

tdyas commented Feb 17, 2022

As explained below, I propose we model codegen for compiled languages by introducing target types for each language and codegen technology; for example, scala_protobuf_sources, java_thrift_sources, and go_protobuf_sources. A motivating factor is that allows us to easily deal with fields like resolve, which are necessary for JVM languages in the current Pants design while alternate solutions make it infeasible to deal with that field. Moreover, it provides a way for other Pants features like dependency inference to operate as intended with compiled languages.

Let's consider alternate solutions:

  1. If we were to put the JVM's resolve directly field on protobuf_sources or thrift_sources like python_source_root already is, there will be a conflict between the JVM resolve field and the Python resolve field since it would probably make sense to include the Python resolve field as well. And while we could "rename" the field by adding jvm_resolve and python_resolve fields, I submit this will unreasonably complicate the developer experience.

  2. Over time, we invariably will probably have to add more language-specific fields to protobuf_sources and thrift_sources, which will increase the complexity of those target types. We would have to rename fields to have the language-specific prefix to reduce confusion, we might as well just have separate target types at that point.

  3. The proposed JVM-only solution would not solve codegen issues for the Go backend. We already have explicit user interest in Go protobuf support. The model for codegen going forward should be consistent for all language backends; that will be easier to teach to users.

  4. Regarding a generator field on protobuf_sources and thrift_sources to specify which languages to generate, this has the same complications as in (2): It packs the different languages confusingly into one target which conflicts with how we teach targets as "independent units of the build." The generated code arguably has its own identity separate from the original sources used to generate it. The generated code is its own "independent unit of the build" and thus a target separate from the original sources target. This will allow dependency inference to operate normally for the JVM languages and Go, which is necessary to be able to build those languages in our current modeling.

Proposed solution:

  1. Introduce language-specific codegen targets. Naming will be LANG_TECH_sources. This will be called a "codegen target type" with the protobuf/thrift target will be the "original sources target type." Codgen targets will eagerly trigger code generation so that dependency inference and injection can run on the generated sources.

  2. A dependency between the language-specific codegen target (e.g., scala_protobuf_sources and any single original sources target (e.g., protobuf_sources) in a directory will automatically be inferred.

    • Open question: Should this be via dependencies field or a new generate_from field?
    • It should be an error for Pants to be unable to infer the original sources target for a codegen target.
  3. Deprecate existing plugin fields on the original sources target types, e.g. python_source_root, and move them to the language-specific codegen target type.

  4. Open question: How should tailor automatically detect use of the language-specific targets? Could "dependency redirection" work? For example, if a scala_sources target had a dependency on a protobuf_sources target, then Pants could infer a dependency from the scala_sources target to the scala_protobuf_sources target in the same directory (maybe using the dependency inference in (2) above). If tailor sees the original dependency on the protobuf_sources target, then it knows to generate the scala_protobuf_sources target.

    • Python dependency inference does not need explicit deps. Needs some thought on how tailor could detect this case.

@stuhood
Copy link
Member

stuhood commented Feb 17, 2022

I'm not quite convinced of the need for per-language target types, as plugin fields have already been demonstrated to work for python_source_root. While needing to rename the resolve field to prefix with a language is slightly awkward it seems less awkward than duplicated target types.

At a fundamental level, the target API is "field" centric... the extra target type to encapsulate slightly different fields doesn't seem like it adds much here.

Additionally, we had a "multiple target types" implementation in v1, and the impact of that was a proliferation of macros to avoid the boilerplate involved in having X copies of a target. At Twitter it was create_thrift_libraries, which created X language-specific targets, which was necessary given the API, but mostly just ended up obscuring which target types were being created, and with which arguments.

  1. Regarding a generator field on protobuf_sources and thrift_sources to specify which languages to generate, this has the same complications as in (2): It packs the different languages confusingly into one target which conflicts with how we teach targets as "independent units of the build." The generated code arguably has its own identity separate from the original sources used to generate it. The generated code is its own "independent unit of the build" and thus a target separate from the original sources target. This will allow dependency inference to operate normally for the JVM languages and Go, which is necessary to be able to build those languages in our current modeling.

Adding a "generator" field should only actually be necessary to remove ambiguity in the case where multiple conflicting instances are installed. I expect that if/when we need to add a field like this, the solution can be lazer focused within this method:

if request.enable_codegen and len(relevant_generate_request_types) > 1:
raise AmbiguousCodegenImplementationsException(
relevant_generate_request_types, for_sources_types=request.for_sources_types
)
("when there is more than one matching implementation, consult the generator field value").

When a generator field type becomes necessary, it will also be compatible with parametrize, in case you do actually want more than one.

Codegen targets will eagerly trigger code generation so that dependency inference and injection can run on the generated sources.

This is the fundamental issue that needs to be resolved by this ticket, so it should probably have more than a footnote. It's not clear how having per-language targets makes this any easier (the API is field centric rather than target-type centric, after all).

@tdyas
Copy link
Contributor Author

tdyas commented Feb 17, 2022

This is the fundamental issue that needs to be resolved by this ticket, so it should probably have more than a footnote. It's not clear how having per-language targets makes this any easier (the API is field centric rather than target-type centric, after all).

Yes and is actually one of the fundamental points that I am still stuck on. Even if protobuf_sources is the only target for protobuf codegen, my initial work this morning points at needing to do dependency inference / target generation in two phases, once to get the protobuf-to-protobuf dependencies and then again after doing code generation to get the, for example, Scala-to-Scala dependencies.

@stuhood
Copy link
Member

stuhood commented Feb 17, 2022

my initial work this morning points at needing to do dependency inference / target generation in two phases, once to get the protobuf-to-protobuf dependencies and then again after doing code generation to get the, for example, Scala-to-Scala dependencies.

In practice, codegenerated code should never actually need dependency inference, because it cannot depend on non-thirdparty code which hasn't itself been generated. So it should always (?) be possible to declare the dependencies of one of these targets before generation.

@tdyas
Copy link
Contributor Author

tdyas commented Feb 17, 2022

In practice, codegenerated code should never actually need dependency inference, because it cannot depend on non-thirdparty code which hasn't itself been generated. So it should always (?) be possible to declare the dependencies of one of these targets before generation.

For Go, some form of analysis will be necessary (whether that is the existing "dependency inference" rules is open still for this design, but running our Go package analyzer will be necessary regardless). The generated code contains import statements to specific third-party runtime libraries and (this is the annoying part) to other generated code included in the file being generated. See https://github.com/golang/protobuf/blob/5d5e8c018a13017f9d5b8bf4fad64aaa42a87308/protoc-gen-go/generator/generator.go#L1268. Although for the latter, those dependencies should be just the "generated" form of the dependency already inferred for the protobuf source file.

@jsirois
Copy link
Contributor

jsirois commented Feb 17, 2022

A note for the future:

In practice, codegenerated code should never actually need dependency inference, because it cannot depend on non-thirdparty code which hasn't itself been generated.

This assumes a 3rdparty code generator I think. A bespoke in-tree code-generator can certainly depend on 1st party code for a runtime support library. I think one of the motivating cases that stressed the v0 / v1 was 4sq's bespoke in-tree code generator of which I cannot recall the name.

@stuhood
Copy link
Member

stuhood commented Feb 23, 2022

In practice, codegenerated code should never actually need dependency inference, because it cannot depend on non-thirdparty code which hasn't itself been generated.

This assumes a 3rdparty code generator I think. A bespoke in-tree code-generator can certainly depend on 1st party code for a runtime support library. I think one of the motivating cases that stressed the v0 / v1 was 4sq's bespoke in-tree code generator of which I cannot recall the name.

Yep, makes sense. Discussed with @tdyas a bit offline, but: the core @rules of Pants using a single target type does not preclude a private/external plugin from declaring their own target types if they want to infer/inject additional dependencies.

@stuhood stuhood assigned stuhood and unassigned Eric-Arellano Feb 28, 2022
Eric-Arellano pushed a commit that referenced this issue Mar 2, 2022
Add support for generating Go from protobuf sources using the [protoc-gen-go](https://pkg.go.dev/google.golang.org/protobuf/cmd/protoc-gen-go) and [protoc-gen-go-grpc](https://pkg.go.dev/google.golang.org/grpc/cmd/protoc-gen-go-grpc) plugins to `protoc`.

This is not actually wired up yet to Go because we need to solve #14258, but it does give us the technology to generate the `.go` files.

Note that this adds a new technique for us to install a Go tool deterministically. For now, the `go.mod` and `go.sum` are hardcoded, but we can choose to expose this through the options system in the future if need be.

[ci skip-rust]
Eric-Arellano added a commit that referenced this issue Mar 4, 2022
Closes #14258. As described there, codegen for compiled languages is more complex because the generated code must be _compiled_, unlike Python where the code only needs to be present.

We still use the `GenerateSourcesRequest` plugin hook to generate the raw `.go` files and so that integrations like `export-codegen` goal still work. But that alone is not powerful enough to know how to compile the Go code. 

So, we add a new Go-specific plugin hook. Plugin implementations return back the standardized type `BuildGoPackageRequest`, which is all the information needed to compile a particular package, including by compiling its transitive dependencies. That allows for complex codegen modeling such as Protobuf needing the Protobuf third-party Go package compiled first, or Protobufs depending on other Protobufs.

Rule authors can then directly tell Pants to compile that codegen (#14705), or it can be loaded via a `dependency` on a normal `go_package`.

[ci skip-rust]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants