Replace use of InputStream in transport APIs #99

herbyderby · 2015-02-20T02:39:10Z

The current use of InputStream has a few problems:

it is non-idiomatic to split the responsibility for opening and closing a stream; it also prevents features such as try-with-resources
it essentially forces extra byte copying in order to consume the data

I haven't looked at the implementation, but since the comments say that the InputStream is non-blocking, I assume that it is used instead of byte[] simply to prevent modifications of the data when it is passed to multiple listeners (a reasonable concern).

Since there already appears to be a Guava dependency, the obvious alternative would be to pass a (inherently read-only) ByteSource instead.

(ByteString from protobuf would be another option, but for various reasons ByteSource is probably a better choice)

ejona86 · 2015-02-20T07:40:46Z

Thanks for the suggestion! We tried to get rid of InputStream multiple times, but we couldn't find a better alternative.

TL;DR: although InputStream is not without its downfalls, InputStream turns out to perform at least as well as its opponents.

The InputStream wraps the native byte buffer implementation of the underlying transport, which is generally a collection of memory blocks virtually concatenated that are zero-copy from the network. Because of details such as not being guaranteed to be able to get a byte[] from the block of memory, we can't pass a simple byte[] or byte[][], and may need a destination byte[] to get data out of the buffers.

Our core does not depend on any particular protobuf implementation (the proto package does, and it is in core, but we just haven't moved it out yet). Protobuf's CodedInputStream can consume byte[], ByteBuffer, or InputStream (no public method for ByteString!). Nano proto and Wire only can consume byte[]. So those are really the destination types that matter for us.

byte[] has the disadvantage of requiring a large, contiguous copy of the data from our buffers, and there is really no real magic available to make it perform better. I'll note here that when compression is being used, we don't know the uncompressed length of the message and it is decompressed on-the-fly. ByteBuffer is no better than byte[], because 1) we don't actually have ByteBuffers and 2) because ByteBuffers can't be concatenated as ByteBuffer is a final class without such support.

The current InputStream is read-once; once bytes have been consumed, we release the backing bytes. Read-once does not seem to be a problem. We also ensure that if the InputStream was not closed when the call returned, we close it. If the InputStream is fully consumed, then closing has no effect.

ByteSource is actually fairly reasonable, and has been in Guava long enough to not cause too many problems. The only interesting method I see on ByteSource though for our use case is read() (returning byte[]), as we could optimize the method to size the array correctly for uncompressed data. But, actually, you can do the same optimization using InputStream.available(). However, for compressed data, it will require decompressing into byte[]s, and then copying into a new, perfectly-sized byte[] at the end. A user of InputStream would not need a perfectly-sized byte[], so read() doesn't actually look like a win.

Our byte buffers are generally pooled, so we can't rely on garbage collection to free them. Thus, we need an interface or API that can communicate some sort of life cycle. For this reason alone ByteSource is a weak choice.

Initially, Guava made a ton of sense, but as the project has matured, we are using less and less of it naturally (creating your own Buffer type means you won't find a library to help you, for instance). There also seems to be a steady push to reduce/remove our Guava dependency from outside the core team. At this point, the only API that will have Guava types is the Future Stub (for ListenableFuture and similar). Guava is still used internally many places, but having it as part of an API would need a convincing reason.

You made two points noting the issues with InputStream:

it is non-idiomatic to split the responsibility for opening and closing a stream; it also prevents features such as try-with-resources
it essentially forces extra byte copying in order to consume the data

I agree with both of these points. However, ByteSource skirts around #1 by not closing our resource (the backing bytes), which doesn't really help us much. Also, in practice, although a bit icky, #1 does not seem like it will be a problem. ByteSource is no better for #2, as the copies still have to occur.

maniksurtani · 2015-02-20T10:22:34Z

I agree with not leaking Guava into the API. Preferably implementation as
well, but as you say, that's less important right now.

On 20 February 2015 at 07:40, Eric Anderson notifications@github.com
wrote:

Thanks for the suggestion! We tried to get rid of InputStream multiple
times, but we couldn't find a better alternative.

TL;DR: although InputStream is not without its downfalls, InputStream
turns out to perform at least as well as its opponents.

The InputStream wraps the native byte buffer implementation of the
underlying transport, which is generally a collection of memory blocks
virtually concatenated that are zero-copy from the network. Because of
details such as not being guaranteed to be able to get a byte[] from the
block of memory
http://docs.oracle.com/javase/7/docs/api/java/nio/ByteBuffer.html#hasArray(),
we can't pass a simple byte[] or byte[][], and may need a destination
byte[] to get data out of the buffers.

Our core does not depend on any particular protobuf implementation (the
proto package does, and it is in core, but we just haven't moved it out
yet). Protobuf's CodedInputStream can consume byte[], ByteBuffer, or
InputStream (no public method for ByteString!). Nano proto and Wire only
can consume byte[]. So those are really the destination types that matter
for us.

byte[] has the disadvantage of requiring a large, contiguous copy of the
data from our buffers, and there is really no real magic available to make
it perform better. I'll note here that when compression is being used, we
don't know the uncompressed length of the message and it is decompressed
on-the-fly. ByteBuffer is no better than byte[], because 1) we don't
actually have ByteBuffers and 2) because ByteBuffers can't be concatenated
as ByteBuffer is a final class without such support.

The current InputStream is read-once; once bytes have been consumed, we
release the backing bytes. Read-once does not seem to be a problem. We also
ensure that if the InputStream was not closed when the call returned, we
close it. If the InputStream is fully consumed, then closing has no effect.

ByteSource is actually fairly reasonable, and has been in Guava long
enough to not cause too many problems. The only interesting method I see on
ByteSource though for our use case is read() (returning byte[]), as we
could optimize the method to size the array correctly for uncompressed
data. But, actually, you can do the same optimization using InputStream.
However, for compressed data, it will require decompressing into byte[]s,
and then copying into a new, perfectly-sized byte[] at the end. A user of
InputStream would not need a perfectly-sized byte[], so read() doesn't
actually look like a win.

Our byte buffers are generally pooled, so we can't rely on garbage
collection to free them. Thus, we need an interface or API that can
communicate some sort of life cycle. For this reason alone ByteSource is a
weak choice.

Initially, Guava made a ton of sense, but as the project has matured,
we are using less and less of it naturally (creating your own Buffer type
means you won't find a library to help you, for instance). There also seems
to be a steady push to reduce/remove our Guava dependency from outside the
core team. At this point, the only API that will have Guava types is the
Future Stub (for ListenableFuture and similar). Guava is still used
internally many places, but having it as part of an API would need a
convincing reason.

You made two points noting the issues with InputStream:

it is non-idiomatic to split the responsibility for opening and
closing a stream; it also prevents features such as try-with-resources

it essentially forces extra byte copying in order to consume the
data

I agree with both of these points. However, ByteSource skirts around #1
#1 by not closing our resource
(the backing bytes), which doesn't really help us much. Also, in practice,
although a bit icky, #1 #1 does
not seem like it will be a problem. ByteSource is no better for #2
#2, as the copies still have to
occur.

—
Reply to this email directly or view it on GitHub
#99 (comment).

herbyderby · 2015-02-20T16:16:23Z

Thanks for the information. Can you explain why the stream needs to be read-once? If that is not a hard requirement, then even your own ByteSource-like interface would be an improvement. e.g.:

interface Payload {
  InputStream openStream() throws IOException;
}

try-with-resources will work
the same method will both open and close the stream
would make it trivial for Guava users to adapt to ByteSource if they want

The underlying resources could still be released as soon as the method returns.

Even if you stick with InputStream, it is strange to document that the stream "must be closed by the listener" if that is not actually true. Also, it should probably say something about the stream not being usable after the method returns.

ejona86 · 2015-02-20T17:24:58Z

Transport is really an internal-only set of interfaces; normal users should have no reason to use them. Thus, in some ways we don't care how ugly they look as long as the get the job done. There are very few implementations of StreamListener: basically just ChannelImpl and ServerImpl.

I was actually a bit wrong about closing the InputStream when the method returns. We do make sure to close it, but from the implementation in Channel/Server, not when calling messageRead(). And so, the documentation is actually correct; full ownership has been passed, as we expect it to be consumed on a different thread.

So we must have some way to tell the transport that we are done with the bytes. ByteSource doesn't fit the bill, but "interface Payload" that was Closeable would. Note that in this case, "closing the InputStream" would do nothing, so we aren't any less likely to have bugs than before as now we are passing ownership with Payload instead of InputStream.

Try-with-resources is not useful to us in the near future, as it isn't available on Android and ChannelImpl is used on Android. Guava users aren't negatively impacted by the current API because they won't be using transport directly; instead, they will be using Channel/Call which provides a demarshalled object via a Marshaller.

If you look at Marshaller, you can see a ByteSource TODO there. For retries it could be nice to have a ByteSource-like object. It may not, though. The serialized representation is likely more compact in memory, and so may be better to retain while allowing the object to be GCed. However, in either case the library would consume the ByteSource; there would still be no benefit to Guava users.

herbyderby · 2015-02-27T02:13:41Z

Thanks for the background, that explanation makes more sense.

herbyderby closed this as completed Feb 27, 2015

lock bot locked as resolved and limited conversation to collaborators Sep 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace use of InputStream in transport APIs #99

Replace use of InputStream in transport APIs #99

herbyderby commented Feb 20, 2015

ejona86 commented Feb 20, 2015

maniksurtani commented Feb 20, 2015

herbyderby commented Feb 20, 2015

ejona86 commented Feb 20, 2015

herbyderby commented Feb 27, 2015

Replace use of InputStream in transport APIs #99

Replace use of InputStream in transport APIs #99

Comments

herbyderby commented Feb 20, 2015

ejona86 commented Feb 20, 2015

maniksurtani commented Feb 20, 2015

herbyderby commented Feb 20, 2015

ejona86 commented Feb 20, 2015

herbyderby commented Feb 27, 2015