For now, the only way to obtain documentation is to generate it. You can only generate it once the project is installed (Sphinx load Python modules in order to generate documentation, but, unless the project is installed, the imports will not work).
The way to do this is to:
virtualenv ./.venv
./.venv/bin/activate
./setup.py install
./setup.py build_sphinx
This project is an attempt to re-implement Protocol Buffers parser (both sources and binary) as a C extension for Python.
Only Version 3 of Protocol Buffers is and will be supported. Only Python 3 is and possibly future version will be supported.
Unfortunately, protoc
, the official implementation of Protocol
Buffers accepts a lot more than the specification allows, in
particular, it accepts virtually any Protobuf v2 syntax in Protobuf
v3 source files. This implementation tries to match this behavior.
Imports are resolved differently in this implementation and in
protoc
: this implementation allows parsing source files in
parallel, however, this resolves certain edge cases of import
statements differently from how protoc
does it. This is as
intended.
Protobuf was initially written for C++, where run-time code generation is problematic. When C++ approach was copied into other languages, it may or may not have made sense there. In particular, in Python, having generated Python code for parsing messages doesn’t make much sense. This is why this implementation doesn’t create descriptors and Python sources doing the deserialization. All you need is to point the parser to the Protobuf source files.
- Options are parsed but ignored. Options are needed for
protoc
plugins to do pre- or post-processing on generated sources.protopy
works in a very different way, so that none ofprotoc
plugins would even make sense here. - Options on fields (such as
[default=true]
onbool
fields) are ignored. The behavior of fields that weren’t sent over the wire is different: they will be always set toNone
. There’s no concept of default values. This is as intended. - Fields sent over the wire, which are not in the description are
silently ignored by
protoc
generated code, but result in errors inprotopy
. In general, I found the later to be a saner approach. In the future, I might consider making this configurable. protopy
doesn’t attempt to find errors in your definitions. It may allow things that aren’t valid forprotoc
. If you want a better error reporting,protoc
is still a better tool. One particular difference here is how definitions get resolved to their full names. It is technically possible to create an ambiguous name using Protobuf IDL: this is because it is not possible to tell which part of the name represents the package and which part represents the local name. Thus:foo.bar.baz ^ ^--- local name: "bar.baz" ^------- package: "foo"
or
foo.bar.baz ^ ^-- local name: "baz" ^---------- package: "foo.bar"
protoc
will actually alert you when you have this kind of conflict.protopy
will use whichever definition it encountered last.- Zeroth enumeration values: my reading of documentation is that
enum Foo { BAR = 0; BAZ = 1; }
Shouldn’t be even valid. But
protoc
accepts this. Actually, it requires enums to have at least one member, and at least one member whose value is zero.protopy
makes no distinction between zeroth member of an enum and other indices, but sinceprotoc
assigns a special meaning to zeroth member: it will use that as a default, if no value is given for the field, you should always have zeroth member of every enumeration to stay compatible.
Descriptors in Protobuf deserve a separate paragraph.
Imagine you are a C++ programmer in the late 90’s-early 2000’s. Imagine now the insurmountable task of having to generate C++ code that creates some static data-structures which describe the layout of your data. Well, you are in a pickle. You only have literals to describe numerical values and C-style arrays. The initializer lists don’t really work. I mean, they do, but nesting is horribly broken.
This is how the creative and crafty authors of protobuf invented
descriptors. Here’s the basic idea: instead of encoding arbitrary
complex structures that describe Protobuf messages in the C++
source code, one would embed a serialized description of those same
structures. One would only need to hand-code the meta-structure,
the Descriptor
. Thus, descriptor is an actual Protobuf message,
which you can find here. The code generated by protoc
creates a
description for every Protobuf message, serializes it using this
message definition, which it then embeds in the generated code in a
binary form.
On top of information pertaining to the message itself, generated descriptors contain auxiliary information pertaining to certain features of the serialized message, s.a. the location of the original source file containing message definition, and even a scheme for translating Protobuf message field names into JSON. The later deserves a specially warm place in hell, but we’ll deal with it later.
Well, now you are a Python programmer in the late second decade of
21st century… Do you have a problem encoding complex
data-structures such as lists and hash-tables using Python syntax?
Obviously not. But, the generator code for Protobuf was already
written to use descriptors, and major changes are hard. However,
with time, the developers of protoc
realized that descriptors
don’t capture all the information they need in order to create
Protobuf types, and this is why, when you open a generated
*_pb2.py
file (by the way misnamed, because Protobuf v3 will
also generate files with this extension), you see a massive
copy-and-paste kind of extremely poorly named and equally poorly
formatted code.
So, you have this hodge-podge mess of descriptors and badly generated Python code. But, generated code has this particular aura of magic around it: all these abbreviated vague terms with lots of underscores thrown in make you think that something very complex and important is going on there. Something you better not even try understanding because it’s so beyond your, mere mortal, abilities.
However, armed with some experience, you may be far too quick to dismiss this not-so-valuable feature (which is what I did). Here’s the payback.
There is nothing that says that field names in Protobuf should be encoded using snake-case, but it must have been a convention accepted in Google at the time the format was created. Similarly, there is no particular reason why JSON field names should use camel-case, but it’s kind of a tradition in JavaScript, so, maybe in JSON too… but not really, no.
However, in their eternal wisdom, protoc
authors decided that
they will provide a “standardized” way to convert between
Protobuf and JSON. The idea is very similar to how JSON parser
and serializer are implemented in Go: they would translate JSON
names into Go names using some pre-defined schema, but one can
also augment structs in Go with “tags”, which may completely
alter the way field names are translated.
Protobuf supports this through a special option you can attach to
message fields: json_name
. You don’t need to know this
information, and, essentially, in overwhelming majority of cases
you can correctly predict the desired JSON field name, simply
given Protobuf field name. Unfortunately, this will not work all
the time.
As you might have figured out by now, I think, that json_name
was a horrible idea, something that increased parsing complexity
with no real benefits to the user. This is why I don’t and will
not support it in this implementation.
I have no plans supporting this feature, service
and rpc
definitions are accepted by the parser but result in empty nodes.
Performance is in the gutter… C++ rocks, Python sucks. I finally more or less finished generator and performance tester. Here’s what I’ve learned:
I assumed there will be no difference between using Python data-structures and C++ data-structures to store the contents of the messages. I’ve never been so wrong.
Here’s how Google’s Protobuf for Python does it: they parse the messages into C++ structs, and expose accessors to Python, so, whenever Python side of the code wants to know the value in that struct, they convert it on-demand.
protopy
code parses everything into Python tuples, dictionaries
enums and lists. So, what’s the bottom line? protopy
code is
3-8 times slower than protobuf
.
It is still about 10 times faster than protobuf
pure Python
implementation. The speed will also depend a lot on the structure
of your messages. Smaller messages with less nested messages will
be parsed faster by protopy
than protobuf
. Messages with a lot
of nesting, long lists and especially hash-tables will tend to
parse faster by protobuf
.
I’m considering rewriting the parser to use a C struct for storing message data rather than using Python’s data-structures, but it will be a long process.
You can find benchmark project in test/performance.
Well, not so bad news: on Python 3.7, protopy
is only about half
as slow as C++ code.
The approach taken by the library differs from how protobuf
module works. protobuf
is designed in a way, perhaps, suitable
for C++ or Java language, but it is foreign to Python on a
conceptual level. Python’s actual type-system is structural in
its heart. This means that types in Python are defined in terms of
type operations like addition and multiplication on the basic
constituents. Thus, given:
from collections import namedtuple
class Foo:
def __init__(self, x, y):
self.x = x
self.y = y
Bar = namedtuple('Bar', 'x, y')
def baz(arg):
print('arg: {} {}'.format(arg.x, arg.y))
Foo(1, True)
is of the same type as Bar(1, True)
from a
standpoint of baz
.
This is while Java type system is nominative in its heart. Types are almost never considered the same, unless they are defined to be interchangeable intentionally in the source of the program.
Since serialization operates on a multi-program level, there is no
real way for environments like Java or C++ to ensure type
correctness of serialized data. And this is why, I believe, the
approach taken by this library is better. protopy.Serializer
doesn’t try to establish that the serialized object belongs to a
specific class. The requirements for object to be serialized as a
particular message are that it responds to the very broadly defined
protocols. In particular, in order for the object to be serialized
as a message, it needs to be iterable. The iteration must provide
mapping between field values and their desired keys.
Similarly, if the field is encoded as repeatable
its value must
implement sequence protocol. If it maps to a map
field, then it
must implement mapping protocol. Atomic values must implement
corresponding protocols too. I.e. if a field is to be encoded as
int32
, then it should be either an integer, or an object whose
class inherits from integer, or it must define __int__()
method.
If it is meant to be encoded as a string, it must define
__str__()
and so on.
This also means that, unlike in protobuf
package, there are no
special message classes that implement MessageToString()
or
MessageFromString()
methods etc. I believe that this is a bad
way to go about serialization: it encourages boilerplate code that
translates between domain objects into objects whose only purpose
is to be accepted by serializer before they are discarded. This is
just wasteful.
Unlike when using protobuf
package, protopy
can serialize
atomic values s.a. integers and strings. This removes the need for
proprietary extensions s.a. =goolge.protobuf.wrappers.StringValue=.
Sometimes you may encounter Protobuf messages with special
serialization behavior. For example, the StringValue
mentioned
above is intended to map to a str
instances in Python. protopy
doesn’t do this translation automatically. Instead, it allows you
to substitute known message types with your own constructors.
For example, you could do this:
content = ...
parser = BinParser(['.'])
result = parser.parse('test.proto', 'Wrapper', content)
original = parser.def_parser.find_definition(b'Wrapper')
class Replacement:
def __init__(self, original):
self.original = original
def replacement(self, *args):
return {'replaced': self.original(*args)}
rp = Replacement(original)
parser.def_parser.update_definition(b'Wrapper', rp.replacement)
result = parser.parse('test.proto', 'Wrapper', content)
assert result['replaced'].wrapped_type == 'Wrapped'
However, you should be extremely careful when doing this. The above example will prevent you from serializing the resulting message back into its binary form.
Python enumerators are very complex and slow. Yet, enumerations
are one of the basic building blocks of Protobuf definitions.
DefParser
allows you to customize the way enumerations are
instantiated when binary payload is read. In particular, there’s
even a helper procedure, that you can substitute in place of the
default enumeration factory: protopy.parser.simple_enum
in order
to speed up parsing. Note that this function will make enumerators
impossible to serialize back.
I’ve tried some other projects claiming to implement some degree of Protobuf format. Here’s what I’ve found:
- pyrobuf Doesn’t support maps. I wasn’t able to verify their performance claims. Documentation does not exist.
- protocyt Doesn’t support block comments, I couldn’t really get past this error. Documentation does not exist.
- protobuf-gen Is actually just a tool for fixing few problems with Python definitions generated by Google’s compiler.
- mypy-protobuf Is actually a plugin for
protoc
, not sure about its purpose. - protobuf-py3 Is an obsolete fork of Google’s Protobuf.
- pystream-protobuf Is a wrapper for Google’s Protobuf.
- python3-protobuf Is just another name for
protobuf
package, they point to the same source code, but this one doesn’t have a lot of Google’s proprietary extensions. - protol Is a tool for automating Python stubs generation using
ptotoc
.
Many parts of the code are still of prototype quality, however, the interface is more or less stable.
- [X] Memory (de)allocation needs to be:
- Done from APR pools.
- More fine-grained pools.
- [ ] Naming needs work, some names use inconsistent conventions.
- [ ] Const correctness. A lot of code lacks this.
- [X] Revise how arguments are supplied to message constructors, maybe we can shave some fat there by creating a tuple right away rather than collecting them into a hash-table and then into a tuple.
- [X] Rewrite
setup.py
so that it also builds the lexer and the parser (maybe, conditionally), then get rid ofmain.c
and few more junk files inlib
. - [X] Few more exotic types need testing: very long varints and floats, I think they don’t parse correctly.
- [X]
defparser
is kind of a mess, it can be reorganized and cleaned up a bit. - [X] ints in
list
could be encoded into pointers instead of allocating extra memory. - [X]
cons
may have an alternative version, where it doesn’t allocate more memory, but uses the thevalue
as is. Irrelevant since using APR pools. - [X] Some code in
protopy.y
never releases memory / could allocate less. Irrelevant since using APR pools. - [ ] Serializer needs work, a lot of functionality there repeats, and may be consolidated.
- [ ] It seems like there’s a bug with scheduling of parsing files, somehow very few threads get scheduled when reading files in bulks.
Keep this number low
find ./protopy \( -name '*.py' -or -name '*.[chyl]' \) -exec wc -l {} +
Right now it’s 11315, I would like to get in under 10K.
Finally, I was able to put this project under a free license.
This project is licensed under LGPL v3. It relies on Apache Portable Runtime library, which is licensed under Apache 2.0 license (find the license text under lib directory.)