-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud strorage v2 API, writeObject missing header x-goog-request-params for client streaming call #2121
Comments
Unfortunately what you are running into is Working as Intended. Reason for this: When a ClientStream is opened, there is no guarantee of a Request message which the BucketName can be sourced from. For a unary, or server stream a Request Message must be provided before the request is initiated. Having spent a significant amount of time integrating with the new pre-release GCS gRPC Api, I strongly recommend against you using If you would like to try out a preview implementation (all pre-GA disclaimers present) see this code snippet showing how to construct an instance that uses the gRPC transport, see here: java-storage/samples/snippets/src/main/java/com/example/storage/QuickstartGrpcSample.java Lines 29 to 42 in c805051
Particularly when it comes to the media operations of Objects (reading[1][2], writing[3][4]), the generated StorageClient is very rough to use from a Java IO perspective, where as using One large advantage to you today using the above code sample, is you can choose which transport you want at construction time and the code my team and I have written will take care of the details of what needs to happen when JSON over HTTP or gRPC. [1] https://cloud.google.com/storage/docs/streaming-uploads#storage-stream-upload-object-java |
Hi Ben - thanks for the quick response. I looked at the old API for quite a while. We have one place, our data service, where we really want to use async. Using the old API would require a lot of extra plumbing and work-arounds to make that work. We were about to do this plumbing when I saw the new API appear in the SDK and we decided to wait for that, because it solves a lot of problems for us. Specifically, we have a generic data service where we want to run async, un-cached data transfers as streams, in both directions. As far as I can tell, all the options using the old API either use blocking calls or require a physical scratch location. Have I missed something? In the data service we're sending moderately large (few GB) datasets and doing format translation as they pass through (e.g. CSV -> Arrow -> Parquet, then Parquet -> Arrow -> JSON), so we really want to process chunk-by-chunk. Depending on the format we might not know how big the dataset is, when a chunk will end etc. We already have a framework that handles all the streaming, conversion etc. for regular gRPC, so using StorageClient is pretty easy for us. Using the old Storage API, we'll need to synthesise a lot of this behaviour using worker pools and callbacks, which can be done but it will create a lot of extra scaffolding and potential for errors. With the streaming data pattern our data service is really light weight. We can run it in a small container with no extra resources or attached volumes. We can complete avoid doing thread-per-request, blocking, buffering etc. For all our other components we can get away with a nice simple thread-per-request model, but our data transfer service is where we really want the streaming functionality. I can see lower down the stack in the SDK the async capabilities are there, but they are all masked from client code in the Storage API, presumably this is by design? Is there any way to get streaming / async functionality using the regular Storage API? |
Thank you for the additional detail. Are you having multiple things write to the same object concurrently? When interacting with GCS almost all operations are synchronous, (i.e. Listing objects proceeds a page at a time using the token from each page to fetch the next one, writing to an object there is head of line blocking, etc). In the case of When using the WriteChannel returned by For encoding composition it is also possible with the Read and Write channels we return. I'm not familiar with Parquet or Arrow's APIs so I'll stick to something I do have experience with. Compression. Contrived example, but it illustrates the composition pattern. Imagine you need to read an object, compress it and write a base64 encoded representation of the compress bytes to a new object. import com.google.cloud.ReadChannel;
import com.google.cloud.WriteChannel;
import com.google.cloud.storage.BlobId;
import com.google.cloud.storage.BlobInfo;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.Storage.BlobWriteOption;
import com.google.cloud.storage.StorageOptions;
import com.google.common.io.ByteStreams;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.channels.Channels;
import java.util.zip.GZIPOutputStream;
import org.apache.commons.codec.binary.Base64OutputStream;
final class Compression {
public static void main(String[] args) throws Exception {
try (Storage storage = StorageOptions.grpc().build().getService()) {
BlobId readFrom = BlobId.of("bucket", "object-in");
BlobInfo writeTo = BlobInfo.newBuilder(BlobId.of("bucket", "object-out")).build();
try (
// 1. Open a read channel for the object to read from
ReadChannel readChannel = storage.reader(readFrom);
// 2. Make a stream from the channel since gzip is an OutputStream
InputStream inputStream = Channels.newInputStream(readChannel);
// 3. Open a write channel for the object to write to
WriteChannel writeChannel = storage.writer(writeTo, BlobWriteOption.doesNotExist());
// 4. Make a stream from the channel since gzip is an OutputStream
OutputStream writeStream = Channels.newOutputStream(writeChannel);
// 5. Wire in the base64 encoded representation
Base64OutputStream b64OutputStream = new Base64OutputStream(writeStream);
// 6. Create the gzip compression stream
GZIPOutputStream gzip = new GZIPOutputStream(b64OutputStream)
) {
// 7. define the acceptable chunk size that can be buffered into memory and can be written
// in a single request
writeChannel.setChunkSize(4 * 1024 * 1024);
// use a utility to copy all the bytes
ByteStreams.copy(inputStream, gzip);
}
}
}
} All buffering in the library is currently exclusively in ram, and can be fully controlled. The default is 16 MiB for writer and 2MiB for Reader (though in the case of reader you can disable buffering all together by setting chunkSize to 0). In the future we will offer the option to use disks for buffering instead of ram, but that is not yet available. For the simply Unary calls, our existing api does not provide any sort of ApiFuture today because we are presenting a single api for both transports. For the Object read and write paths, there isn't any async, in places where we can there is non-blocking io, while in others there is backpressure due to the semantics of dealing with media in GCS. We try pretty much everything in our power to ensure that there isn't library user visible transient error seen. As for weight/impact to the running process. I've run benchmarks against the existing JSON implementation using |
Hi Ben, thanks for all the extra info. In summary, the "Storage" class is a higher level API which hides a lot of details from client code (buffering, retries, API transport etc.) and is meant to "just work", it is only available with a blocking API (although some write operations are offloaded internally). The "StorageClient" is a lower level API which is more like a thin binding on the raw gRPC end points, it is available with blocking, future or streaming APIs as is standard for gRPC. Also "StorageClient" and the underlying gRPC API are still in public beta. Is this essentially the difference? Most people will want the former and this should be the default. However in our use case I think we do want the latter (reasons below). Regarding the original issue in this ticket. Is it not possible to return a specialised delegate from writeObjectCallable which uses an interceptor to set the required metadata when the first message is sent? Per my understanding the first message should always contain the object spec, otherwise anyway the API will return an error. If that's not possible, would it be worth adding in the method documentation a quick note of how to set the missing header? Just to close this ticket off. Regarding backpressure on the gRPC interface. In regular gRPC you have ClientCallStreamObserver and ServerCallStreamObserver to manage this - we are already using ServerCallStreamObserver on our public APIs. However it seems the GAX layer hides this away, even though there is a ClientCallStreamObserver just a couple of layers down. I'm guessing this needs to be raised as an issue in https://github.com/googleapis/sdk-platform-java rather than here? |
On why we want to use the StorageClient bindings. Probably not hugely relevant to the issue but might be interesting! There are a few reasons.
Some of these points are certainly debatable! Still, I think the event-based pattern is probably the right choice for our data component. For every other component we've done thread-per-request and that's fine. Using event-based streaming for our data service is blazing fast and runs with very few resources. Maybe it's possible to build a blocking version that has good performance, but I think it would be a lot harder to get right and it certainly wouldn't be better than what we have, plus it would be a hell of a lot of work. So I guess the conclusion is, we really want the async APIs! And gRPC is the absolute best for us, because we're already using it. |
Thanks for the additional context and clarification. A few comments:
This is correct
The StorageClient is an automatically generated clients that provides additional features oin top of what you would get from the raw grpc stubs. Docs, methods, proto messages are all derived from the
The StorageClient is actually tagged as The underlying GCS gRPC API is still in private preview, so it is pre-alpha and has no SLA associated with its use at this time. Until such time as the GCS gRPC API is GA, your use of it will only receive best effort support if an issue were to arise.
The metadata/header value must be present when the client stream is opened, irrespective of any message queued. Opening a client steam is an independent action from sending a message on the stream. The client stream could be opened and not have any message on it for some time.
When the GCS gRPC API enters it's public preview phase there will be API Docs stating the constraints. These are not yet published, being in private preview.
Gax internally manages the If you did need full gRPC level of controls, the raw grpc stubs are available, however you would lose all the conveniences from StorageClient and shift to manual channel management, request parameters for all requests, custom retries, etc. |
Thanks Ben, that is all clear and very helpful. Just one more question if I can. Do you have any plans to release an async version of the higher level "Storage" API, or async / streaming overloads for key methods? Both AWS and Azure do this by providing an alternative client interface (S3Client / S3AsyncClient and BlobClient / BlobAsyncClient). I'm sure the blocking / thread-per-request model is what most people want, but there are cases where event-based is preferable, for example middleware or platform products like ours, and especially in the storage / data space. |
The idea of an async centric API has been floated before, but we don't have an ETA at this time of when/how/if it will happen for the library. |
The Google Cloud Storage product team plans to offer full support for use of the Closing this issue at this time as it does not affect |
Environment details
Steps to reproduce
The issue occurs using StorageClient writeObject as a client streaming call. Unary calls and server streaming calls do not appear to have this issue.
Looking through the SDK code, I found that for unary calls this header gets set to provide the bucket name in contextWithBucketName()
Code example
The addMissingRequestParams() method is what I have done to work around the error, it looks like this:
With the addMissingRequestParams() method the call succeeds just fine. However, without it we get an error and the stack trace below. I'm guessing this is not the expected behavior and the storage SDK should add these in for client streaming calls the same way it does for the others.
Stack trace
Any additional information below
I appreciate these are new APIs and the old APIs are still the recommended ones. Still the new StorageClient on the v2 gRPC API is mostly working for me and we really need the client streaming semantics, so we've been watching for when it came avaialble. Hopefully this one is an easy fix!
The text was updated successfully, but these errors were encountered: