- Author(s): gnossen
- Approver: lidizheng
- Status: Implemented
- Implemented in: Python
- Last updated: April 9, 2020
- Discussion at: https://groups.google.com/g/grpc-io/c/fuvDt-jZL2k
Generated protobuf code is hard to manage for those without monorepos equipped with a hermetic build system. This has been echoed by the maintainers of libraries wrapping gRPC as well as direct users of gRPC. In this document, we aim to lay out a more Pythonic way of dealing with Protocol Buffer-backed modules.
gRPC and protocol buffers evolved in a Google ecosystem supported by a monorepo and hermetic build system. When engineers check a .proto
file into source control using that toolchain, the resultant code generated by the protocol compiler does not end up in source control. Instead, the code is generated on demand during builds and cached in a distributed build artifact store. While such sophisticated machinery has started to become available in the open source community (e.g. Bazel), it has not yet gained much traction. Instead, small Git repos and non-hermetic language-specific build tools are the norm. As a result, code generated by the protocol compiler is often checked into source control alongside the code which uses it.
At the least, this results in surprises when an update to a .proto
file does not result in an update to the behavior of the user’s code. However, when client and server code lives in separate repos, this can result in aliasing, where one repository houses generated code from an earlier version of the protocol buffer than the other.
Open source users are aware of this gap in the ecosystem and are actively looking for ways to fill it. Many have settled on protocol buffer monorepos as a solution to the problem, wherein all .proto
files for an organization are placed in a single source code repository and included by all other repositories as a submodule. But even this is not a complete solution. In addition, some mechanism must be put in place for the repositories housing the client and server code to retrieve the desired protocol buffer and generate code for the target language.
The protocol compiler paired with Google's build system also means that the average engineer never has to manually invoke protoc
. Instead, when an update is made to a .proto
file and some code file references it, the code for the protocol buffer is regenerated without any manual intervention on the part of the engineer. Compare that to today's workflow for gRPC users:
- Update .proto file.
- Manually regenerate code. (remembering how to use all of the CLI flags)
- Make necessary corresponding updates to code using the protocol buffer.
- Rerun.
It's easy for several of those steps to slip one's mind while developing. Moreover, figuring out how to invoke protoc
in a way that meshes with your imports can be quite difficult. Python developers in particular are unused to build-time steps such as these; it is much more common to perform them at runtime.
Whereas today, users import protobuf code by invoking protoc
to generate a file called foo_pb2.py
and including the following line of code:
import foo_pb2
import foo_pb2_grpc
They will now also have the option of completely skipping the manual protoc invocation and instead writing
protos = grpc.protos('foo.proto')
services = grpc.services('foo.proto')
These two new functions return the same module objects as the import foo_pb2
and import foo_pb2_grpc
statements. In order to maintain interoperability with any proto-backed modules loaded in the same process, after these functions are invoked, import foo_pb2
and import foo_pb2_grpc
will be no-ops. That is, a side-effect of calling grpc.protos
and grpc.services
is insertion of the returned modules into the per-process cache. This ensures that regardless of whether the application calls grpc.protos('foo.proto')
or import foo_pb2
or in which order, only a single version of the module will ever be loaded into the process. This avoids situations in which interoperability breaks due to two modules expecting the same protobuf-level message type, but expecting two different Python-level Message
classes, one backed by a _pb2.py
file and one backed by a .proto
file.
The wrapper function around the import serves several purposes here. First, it puts the user in control of naming the module (in a manner similar to Javascript), meaning the user never has to concern themself with the confusing _pb2
suffix. Second, the function provides a wrapping layer through which the library can provide guidance in the case of failed imports.
To be precise, we propose the introduction of three new functions with the following signatures:
def protos(proto_file: Text,
runtime_generated: Optional[bool] = True) -> types.ModuleType:
pass
def services(proto_file: Text,
runtime_generated: Optional[bool] = True) -> types.ModuleType:
pass
def protos_and_services(proto_file: Text,
runtime_generated: Optional[bool] = True) -> Tuple[types.ModuleType, types.ModuleType]:
pass
The final function, protos_and_services
is a simple convenience function allowing the user to import protos and services in a single function call. All three of these functions will be idempotent. That is, like the python built-in import
statement, after an initial call, subsequent inbvocations will not result in a reload of the ".proto" file from disk.
The change will be entirely backward compatible. Users manually invoking protoc
today will not be required to change their build process.
These functions will behave like normal import
statements. sys.path
will be used to search for .proto
files. The path under which each particular .proto
file was found will be passed to the protobuf parser as the root of the tree (equivalent to the -I
flag of protoc
). This means that a file located at ${SYS_PATH_ENTRY}/foo/bar/baz.proto
will result in the instantation of a module with the fully qualified name foo.bar.baz_pb2
. Users are expected to have a directory structure mirroring their desired import structure.
Users have reported that getting Protobuf paths to match Python import paths is quite tricky today using protoc
. In the case of a failure to import, the library will print troubleshooting information alongside the error.
In general, our recommendation will be for users to align their Python import path, Python fully qualified module name, Protobuf import path, and Protobuf package name. This will ensure that only one entry will ever be required on the PYTHONPATH
. As an example, suppose you had the following file at "src/protos/foo/bar.proto":
syntax = "proto3";
package foo;
message BarMessage {
...
}
And the following at "src/protos/foo/baz.proto":
syntax = "proto3";
package foo;
import "foo/bar.proto";
...
Then, ensuring that the src/protos/
directory is on sys.path
either by running from that directory or specifying it with the PYTHONPATH
environment variable, you will be able to import as follows:
bar_protos = grpc.protos('foo/bar.proto')
baz_protos = grpc.protos('foo/baz.proto')
The critical bit here is that all import
statements within ".proto" files must be resolvable along a path on sys.path
. Suppose instead that baz.proto
had imported bar.proto
with import "src/protos/foo/bar.proto"
. Now, in order to resolve the import, at least two paths would have to be on sys.path
: the repo root and src/protos/
. For simplicity's sake, the root used for both calls to grpc.protos
and the protobuf import
statement should be unified.
In order to avoid naming clashes between protos you've authored yourself and any other protos pulled in by third-party dependencies, the root directory of your proto source tree should have a universally unique name. Ideally, this uniqueness should be guaranteed by your organization having ownership of the package by that name on PyPI. See PEP 423 for a deeper discussion of how package name uniqueness is handled in the Python ecosystem.
It should be stated that, in practice, it is possible to use these new functions to load totally arbitrary protos. Suppose you wrote a server that took .proto
files as inputs from clients, instatiated modules from them, and returned some data about the file. For example, the number of message types contained within the supplied file. This could become problematic as new syntax features are added to the Protobuf specification. In the worst case, this would require a redeploy of the server with a sufficiently up-to-date version of grpcio-tools
. But, regardless, we will claim no support for this use case. The intent of these functions is to enable import of fixed .proto
files, known at build time.
gRPC makes a point not to incur a direct dependency on protocol buffers. It is not the intent of this feature to change that. Instead, the implementations of these new functions will live in the grpcio-tools
package, which necessarily already has a hard dependency on protobuf
. If the grpcio
package finds that grpc_tools
is importable, it will do so and use the implementations found there to back the protos
and services
functions. Otherwise, it will raise a NotImplementedError
.
In order to take advantage of newer language features, support will only be added for Python versions 3.6+. All other versions will result in a NotImplementedError
.
This proposal has been implemented here. This implementation uses the C++ protobuf runtime already built into the grpcio-tools
C extension to parse the protocol buffers and generate textual Python code in memory. This code is then used to instantiate the modules to be provided to the calling application.
Consideration was given to implementing the functionality of what is here presented as grpc.protos
in the protobuf
Python package. After a thorough investigation, we found that this feature could not be implemented in a way that satisfied both the compatibility requirements of the gRPC library and the Protobuf library. However, we are able to provide all desired functionality with an implementation entirely in the grpcio
package. If at some point in the future this is no longer the case and an implementation of the required functionality in the protobuf
repo is feasible, our grpc.protos
function can simply proxy to it.