Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proto: add streaming APIs for Unmarshal/Marshal #507

Closed
ghost opened this issue Feb 2, 2018 · 13 comments
Closed

proto: add streaming APIs for Unmarshal/Marshal #507

ghost opened this issue Feb 2, 2018 · 13 comments
Labels

Comments

@ghost
Copy link

ghost commented Feb 2, 2018

Hi, I'm not a Go expert so please excuse me if there is another way to acomplish my objective.

I'm looking for a way to use Unmarshall method mainly, having the io.Reader type as an input. Especifically the io.ReadCloser wrapped in the Response type of the net/http library.

My problem is that the current Unmarshall method of the proto library only receive bytes, and I have to use the ioutils.ReadAll(...) method to get them wich allocates innecesary memory; this happens because there is no way to know the body content length for sure. Even if I get the content length using the net/http, the API warns me about potentially runtime error.

I already know that out there are http libraries based on Slices that avoid the memory allocation, like fasthttp, but they don't implement the net/http standard, so I can't use them.

Is there a way that the proto library implement an Unmarshall/Marshall method that receive an io.Reader instead of bytes and having the same, or better, performance than the encoding/json or encoding/xml Unmarshall libraries?

Thanks for your patient and time.

@ghost ghost changed the title Generate Unmarshall/Marshall from Reader for Messages Create Unmarshall/Marshall from Reader for Messages Feb 2, 2018
@dsnet
Copy link
Member

dsnet commented Feb 2, 2018

Hi,

Is there a way that the proto library implement an Unmarshall/Marshall method that receive an io.Reader

It is certainly possible to make the proto API support io.Reader and io.Writer, but it's actually not going to help performance since the problem of when some underlying []byte is allocated is simply just shifted to the proto package. For this reason, it is actually generally more flexible for the API to deal directly with []byte when it comes to performance.

In order to achieve better performance, ioutil.ReadAll is not a great choice as you mentioned since it always allocates. I recommend that you create a cached pool of buffers on the server side to read the stream into. You could use sync.Pool or a buffered chan []byte to act as the pool. If you do choose to do buffer caching, you should make sure there is an upper limit on the size of buffers that may be returned to the pool (see golang/go#23199).

@dsnet dsnet closed this as completed Feb 2, 2018
@dsnet dsnet added the question label Feb 2, 2018
@dsnet dsnet changed the title Create Unmarshall/Marshall from Reader for Messages proto: add streaming APIs for Unmarshal/Marshal Feb 2, 2018
@joscarsson
Copy link

@dsnet: I'm curious about your reasoning here - if I understand you correctly, the performance you're talking about concerns small (kilobytes) protobuf messages where the number of malloc equivalent calls is the bottleneck? And the cached pool of buffers would eliminate the allocation/deallocation of the memory?

I'm thinking that for applications that handle larger protobuf messages, API support for io.Reader and io.Writer would provide some benefit. The problem right now if I have a .pb file with say 100 MB of data is that it would first need to read it into memory as a []byte, and then that byte array would be read and parsed by protobuf to create the structs. Before the initial []byte is deallocated, we'd be using at least 2 * 100 MB of memory.

If we could stream data from the file in smaller chunks, we would only need at most 100 MB + X kB of memory allocated, where X is the chunk size. If this is better for performance depends how you define performance I guess, but it'd at least use less memory.

Perhaps large protobuf messages on file isn't a very common use-case, and that's why this API doesn't exist/isn't asked for? Another option is that I have misunderstood something, and that's why it doesn't exist 🙂

@joscarsson
Copy link

And right after I wrote the above I found https://developers.google.com/protocol-buffers/docs/techniques (read the section "Large data sets"), which states:

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

I guess io.Reader/io.Writer could still be convenient to stream through gzip and such though?

@dsnet
Copy link
Member

dsnet commented May 22, 2018

An unserialized proto message uses memory proportional to the encoded wire data. Thus, even if the input was an io.Reader, the proto package is still going to read that all into memory in order to deserialize. If we wanted to avoid allocations and re-use buffers, something similar to #609 seems to be more of a step in the right direction.

I guess io.Reader/io.Writer could still be convenient to stream through gzip and such though?

Let's suppose you have stream of proto messages of type Foo, where all the individual foo messages together is really large, but individually fairly small. You could encode and decode the data in such a way that it is compatible to a proto message like this:

message FooList {
    repeated Foo = 1;
}

When encoding, this would be written as:

var ms []*foopb.Foo = ...
var b bytes.Buffer // or some other io.Writer
for _, m := range ms {
    b.WriteByte(1<<3 | 2) // write a tag of field 1 of type bytes
    b2 := proto.Marshal(m)
    var b3 [binary.MaxVarintLen64]byte
    b.Write(b3[:binary.PutUvarint(b3[:], uint64(len(b2)))]) // write the length prefix
    b.Write(b2) // write the message data itself
}

Encoding in this form as the benefit that []*foopb.Foo is wire equivalent to the FooList shown above. That being said, you're welcome to use whatever framing format you desire, even if it's as simple as prefixing each message with a fixed uint64.

See https://developers.google.com/protocol-buffers/docs/encoding.

@joscarsson
Copy link

@dsnet: Thanks for taking the time to answer!

Regarding memory allocation, I understand protobuf needs to read everything into memory (after all, it deserializes into structs). I was only reasoning about the intermediate memory used when deserializing a huge message - we first need to read the full message into []byte before even sending it to protobuf, which then allocates the same amount of memory again for the actual structs. If we could deserialize in chunks, i.e.:

  1. Read 50 kB from file.
  2. Pass into protobuf which deserializes only that chunk and creates whatever structs possible from the chunked data.
  3. Read another 50 kB from file.
  4. Again pass 50 kB into protobuf.
  5. ... and so on.

This would mean we wouldn't need to first allocate a []byte for the full message, and then pass it into protobuf. The only extra memory we would need is the temporary 50 kB chunk size. Given the constraint mentioned during "Large data sets", I can see why this isn't a concern though - proto messages are not meant to be several hundred MBs of data.

Thanks for the example regarding repeated vs just encoding the []*Foo array!

@mimoo
Copy link

mimoo commented Oct 9, 2018

what about reading a stream and knowing when to stop for the unmarshalling to work?

@neild
Copy link
Contributor

neild commented Oct 11, 2018

The protobuf binary encoding doesn't include any end-of-record marker, so any framing needs to be added externally.

That is, there's no way for the proto decoder to know when to stop. You need to tell it.

@mimoo
Copy link

mimoo commented Oct 11, 2018

alright that's what I figured, does grpc prepends a length field or something?

@neild
Copy link
Contributor

neild commented Oct 11, 2018

I'm not personally familiar with the gRPC protocol, but probably.

@mimoo
Copy link

mimoo commented Oct 11, 2018

indeed. Thanks!

@jkopczyn
Copy link

I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially.

@neild
Copy link
Contributor

neild commented Mar 16, 2020

I think this should be reopened. It's a fairly straightforward use-case and would improve usability substantially.

See #912.

This is a thing I'd like to do, but getting the design right is tricky and it's a substantial amount of work.

Edit: is tricky, getting the design right is tricky.

@jkopczyn
Copy link

jkopczyn commented Mar 16, 2020 via email

@golang golang locked as resolved and limited conversation to collaborators Jun 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants