proto: add streaming APIs for Unmarshal/Marshal #507

ghost · 2018-02-02T15:40:05Z

Hi, I'm not a Go expert so please excuse me if there is another way to acomplish my objective.

I'm looking for a way to use Unmarshall method mainly, having the io.Reader type as an input. Especifically the io.ReadCloser wrapped in the Response type of the net/http library.

My problem is that the current Unmarshall method of the proto library only receive bytes, and I have to use the ioutils.ReadAll(...) method to get them wich allocates innecesary memory; this happens because there is no way to know the body content length for sure. Even if I get the content length using the net/http, the API warns me about potentially runtime error.

I already know that out there are http libraries based on Slices that avoid the memory allocation, like fasthttp, but they don't implement the net/http standard, so I can't use them.

Is there a way that the proto library implement an Unmarshall/Marshall method that receive an io.Reader instead of bytes and having the same, or better, performance than the encoding/json or encoding/xml Unmarshall libraries?

Thanks for your patient and time.

dsnet · 2018-02-02T18:33:36Z

Hi,

Is there a way that the proto library implement an Unmarshall/Marshall method that receive an io.Reader

It is certainly possible to make the proto API support io.Reader and io.Writer, but it's actually not going to help performance since the problem of when some underlying []byte is allocated is simply just shifted to the proto package. For this reason, it is actually generally more flexible for the API to deal directly with []byte when it comes to performance.

In order to achieve better performance, ioutil.ReadAll is not a great choice as you mentioned since it always allocates. I recommend that you create a cached pool of buffers on the server side to read the stream into. You could use sync.Pool or a buffered chan []byte to act as the pool. If you do choose to do buffer caching, you should make sure there is an upper limit on the size of buffers that may be returned to the pool (see golang/go#23199).

joscarsson · 2018-05-22T17:15:04Z

@dsnet: I'm curious about your reasoning here - if I understand you correctly, the performance you're talking about concerns small (kilobytes) protobuf messages where the number of malloc equivalent calls is the bottleneck? And the cached pool of buffers would eliminate the allocation/deallocation of the memory?

I'm thinking that for applications that handle larger protobuf messages, API support for io.Reader and io.Writer would provide some benefit. The problem right now if I have a .pb file with say 100 MB of data is that it would first need to read it into memory as a []byte, and then that byte array would be read and parsed by protobuf to create the structs. Before the initial []byte is deallocated, we'd be using at least 2 * 100 MB of memory.

If we could stream data from the file in smaller chunks, we would only need at most 100 MB + X kB of memory allocated, where X is the chunk size. If this is better for performance depends how you define performance I guess, but it'd at least use less memory.

Perhaps large protobuf messages on file isn't a very common use-case, and that's why this API doesn't exist/isn't asked for? Another option is that I have misunderstood something, and that's why it doesn't exist 🙂

joscarsson · 2018-05-22T17:21:55Z

And right after I wrote the above I found https://developers.google.com/protocol-buffers/docs/techniques (read the section "Large data sets"), which states:

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

I guess io.Reader/io.Writer could still be convenient to stream through gzip and such though?

dsnet · 2018-05-22T21:20:29Z

An unserialized proto message uses memory proportional to the encoded wire data. Thus, even if the input was an io.Reader, the proto package is still going to read that all into memory in order to deserialize. If we wanted to avoid allocations and re-use buffers, something similar to #609 seems to be more of a step in the right direction.

I guess io.Reader/io.Writer could still be convenient to stream through gzip and such though?

Let's suppose you have stream of proto messages of type Foo, where all the individual foo messages together is really large, but individually fairly small. You could encode and decode the data in such a way that it is compatible to a proto message like this:

message FooList {
    repeated Foo = 1;
}

When encoding, this would be written as:

var ms []*foopb.Foo = ...
var b bytes.Buffer // or some other io.Writer
for _, m := range ms {
    b.WriteByte(1<<3 | 2) // write a tag of field 1 of type bytes
    b2 := proto.Marshal(m)
    var b3 [binary.MaxVarintLen64]byte
    b.Write(b3[:binary.PutUvarint(b3[:], uint64(len(b2)))]) // write the length prefix
    b.Write(b2) // write the message data itself
}

Encoding in this form as the benefit that []*foopb.Foo is wire equivalent to the FooList shown above. That being said, you're welcome to use whatever framing format you desire, even if it's as simple as prefixing each message with a fixed uint64.

See https://developers.google.com/protocol-buffers/docs/encoding.

joscarsson · 2018-05-23T07:23:05Z

@dsnet: Thanks for taking the time to answer!

Regarding memory allocation, I understand protobuf needs to read everything into memory (after all, it deserializes into structs). I was only reasoning about the intermediate memory used when deserializing a huge message - we first need to read the full message into []byte before even sending it to protobuf, which then allocates the same amount of memory again for the actual structs. If we could deserialize in chunks, i.e.:

Read 50 kB from file.
Pass into protobuf which deserializes only that chunk and creates whatever structs possible from the chunked data.
Read another 50 kB from file.
Again pass 50 kB into protobuf.
... and so on.

This would mean we wouldn't need to first allocate a []byte for the full message, and then pass it into protobuf. The only extra memory we would need is the temporary 50 kB chunk size. Given the constraint mentioned during "Large data sets", I can see why this isn't a concern though - proto messages are not meant to be several hundred MBs of data.

Thanks for the example regarding repeated vs just encoding the []*Foo array!

mimoo · 2018-10-09T10:30:04Z

what about reading a stream and knowing when to stop for the unmarshalling to work?

neild · 2018-10-11T21:31:01Z

The protobuf binary encoding doesn't include any end-of-record marker, so any framing needs to be added externally.

That is, there's no way for the proto decoder to know when to stop. You need to tell it.

mimoo · 2018-10-11T22:12:12Z

alright that's what I figured, does grpc prepends a length field or something?

neild · 2018-10-11T22:16:34Z

I'm not personally familiar with the gRPC protocol, but probably.

mimoo · 2018-10-11T23:14:48Z

indeed. Thanks!

jkopczyn · 2020-03-16T19:56:19Z

I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially.

neild · 2020-03-16T20:51:34Z

I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially.

See #912.

This is a thing I'd like to do, but getting the design right is tricky and it's a substantial amount of work.

Edit: is tricky, getting the design right is tricky.

jkopczyn · 2020-03-16T23:06:16Z

Ah, yeah, if it's still under consideration that's good enough for me. This was just the bug that showed up from a search.

…

On Mon, Mar 16, 2020, 13:51 Damien Neil ***@***.***> wrote: I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially. See #912 <#912>. This is a thing I'd like to do, but getting the design right isn't tricky and it's a substantial amount of work. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#507 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJR7Z5DPYUBZTGG4KLGMQTRH2GOJANCNFSM4EO5OJXQ> .

ghost changed the title ~~Generate Unmarshall/Marshall from Reader for Messages~~ Create Unmarshall/Marshall from Reader for Messages Feb 2, 2018

dsnet closed this as completed Feb 2, 2018

dsnet added the question label Feb 2, 2018

dsnet changed the title ~~Create Unmarshall/Marshall from Reader for Messages~~ proto: add streaming APIs for Unmarshal/Marshal Feb 2, 2018

vmg mentioned this issue Mar 26, 2019

Serialization and deserialization of large messages can be slow twitchtv/twirp#166

Closed

golang locked as resolved and limited conversation to collaborators Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proto: add streaming APIs for Unmarshal/Marshal #507

proto: add streaming APIs for Unmarshal/Marshal #507

ghost commented Feb 2, 2018 •

edited by ghost

Loading

dsnet commented Feb 2, 2018

joscarsson commented May 22, 2018

joscarsson commented May 22, 2018

dsnet commented May 22, 2018

joscarsson commented May 23, 2018

mimoo commented Oct 9, 2018

neild commented Oct 11, 2018

mimoo commented Oct 11, 2018 •

edited

Loading

neild commented Oct 11, 2018

mimoo commented Oct 11, 2018

jkopczyn commented Mar 16, 2020

neild commented Mar 16, 2020 •

edited

Loading

jkopczyn commented Mar 16, 2020 via email

proto: add streaming APIs for Unmarshal/Marshal #507

proto: add streaming APIs for Unmarshal/Marshal #507

Comments

ghost commented Feb 2, 2018 • edited by ghost Loading

dsnet commented Feb 2, 2018

joscarsson commented May 22, 2018

joscarsson commented May 22, 2018

dsnet commented May 22, 2018

joscarsson commented May 23, 2018

mimoo commented Oct 9, 2018

neild commented Oct 11, 2018

mimoo commented Oct 11, 2018 • edited Loading

neild commented Oct 11, 2018

mimoo commented Oct 11, 2018

jkopczyn commented Mar 16, 2020

neild commented Mar 16, 2020 • edited Loading

jkopczyn commented Mar 16, 2020 via email

ghost commented Feb 2, 2018 •

edited by ghost

Loading

mimoo commented Oct 11, 2018 •

edited

Loading

neild commented Mar 16, 2020 •

edited

Loading