Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote cache ResourceExhausted/OutOfRange errors #20674

Closed
jasonwbarnett opened this issue Mar 14, 2024 · 10 comments · Fixed by #20708
Closed

remote cache ResourceExhausted/OutOfRange errors #20674

jasonwbarnett opened this issue Mar 14, 2024 · 10 comments · Fixed by #20708
Assignees

Comments

@jasonwbarnett
Copy link
Contributor

jasonwbarnett commented Mar 14, 2024

Describe the bug

When remote caching is enabled and you use adhoc_tool that outputs a directory with a ton of files (i.e. 180K), pants attempts to call FindMissingBlobs (/build.bazel.remote.execution.v2.ContentAddressableStorage/FindMissingBlobs) and the message exceeds 4MB which leads to a ResourceExhausted or OutOfRange error depending on the configuration of the remote caching service. In my case I am using bazel-remote. I started by modifying bazel-remote's MaxRecvMsgSize which revealed a different error (OutOfRange).

Before increasing MaxRecvMsgSize in bazel-remote

[WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5790951 vs. 4194304)"

This error message is coming from bazel-remote.

After increasing MaxRecvMsgSize in bazel-remote

[WARN] Failed to write to remote cache (1 occurrences so far): OutOfRange: "Error, message length too large: found 5790600 bytes, the limit is: 4194304 bytes"

This error is coming from tonic. I validated this by modifying tonic and updating pants to use a branch on my tonic fork. I ended up submitting this PR: hyperium/tonic#1658.

I took this one step further. I updated tonic's DEFAULT_MAX_RECV_MESSAGE_SIZE (jwb/set-max-message-size-very-high) and validated that everything functions correctly after I do that. Which to be clear, both the client (i.e. pants via tonic) and the server (i.e. bazel-remote) needed to be modified to support the large FindMissingBlobs message.

I see at least two possible solutions:

  1. Add a configuration option to pants which controls tonic's max_decoding_message_size. To be clear, I'm not sure this there is a relationship between max_decoding_message_size and DEFAULT_MAX_RECV_MESSAGE_SIZE, but I'm hoping so 🤞
  2. Update pants so that chunks or streams the FindMissingBlobs calls to the server.

Pants version

2.19.0

OS

Both Linux and Mac OS.

@jasonwbarnett jasonwbarnett changed the title unab remote cache ResourceExhausted/OutOfRange errors Mar 14, 2024
@huonw huonw added the remote label Mar 14, 2024
@jasonwbarnett
Copy link
Contributor Author

I just tested this and confirmed it works instead of modifying tonic's DEFAULT_MAX_RECV_MESSAGE_SIZE constant.

diff --git a/src/rust/engine/remote_provider/remote_provider_reapi/src/byte_store.rs b/src/rust/engine/remote_provider/remote_provider>
index 839ee07d26..e9f7bb6955 100644
--- a/src/rust/engine/remote_provider/remote_provider_reapi/src/byte_store.rs
+++ b/src/rust/engine/remote_provider/remote_provider_reapi/src/byte_store.rs
@@ -92,9 +92,10 @@ impl Provider {
             Some((options.timeout, Metric::RemoteStoreRequestTimeouts)),
         );
 
+        let limit = 25 * 1024 * 1024;
         let byte_stream_client = Arc::new(ByteStreamClient::new(channel.clone()));
 
-        let cas_client = Arc::new(ContentAddressableStorageClient::new(channel.clone()));
+        let cas_client = Arc::new(ContentAddressableStorageClient::new(channel.clone()).max_decoding_message_size(limit));
 
         let capabilities_client = Arc::new(CapabilitiesClient::new(channel));

@benjyw
Copy link
Contributor

benjyw commented Mar 21, 2024

Thanks for the legwork @jasonwbarnett . So do we want to add an option to control max_decoding_message_size? That wouldn't be too hard to do.

@jasonwbarnett
Copy link
Contributor Author

@benjyw Yes, please. That would be amazing!

@tgolsson
Copy link
Contributor

Wouldn't it be better to stick to defaults and do multiple requests if the payload is too large? It's my preferred solution when possible, since it ensures compatibility with all servers.

@jasonwbarnett
Copy link
Contributor Author

jasonwbarnett commented Mar 22, 2024

@tgolsson 100% -- But if that is not going to happen because it's say 10x the effort to implement, I'd rather have something that allows me to be unblocked than to continue in a position where I have no viable options. Plus I still think there are legitimate use cases to increase the maximum message size. I believe it's more efficient if you have a lot of large messages.

That said, my vote would be to add the functionality that allows tuning the maximum message size because I think it's valuable both tactically and strategically for certain situations. I also think it's of even more value to update the code paths to chunk messages into appropriate sizes based on the maximum message size set which would make this compatible with every server, as mentioned.

@tgolsson
Copy link
Contributor

Depending on content, it's fairly trivial to chunk a too-big request. Might be more efficient too, since the remote server could process all smaller chunks in parallel, and you can decode responses in parallel. In the case of FindMissingBlobs, since the digest is fixed size we know exactly how many digests can fit in 4MB, and can chunk it appropriately.

@tgolsson
Copy link
Contributor

@jasonwbarnett if you have a repro, can you validate #20708 fixes it?

@jasonwbarnett
Copy link
Contributor Author

@jasonwbarnett if you have a repro, can you validate #20708 fixes it?

I totally missed this. Let me test tomorrow and let you know.

@jasonwbarnett
Copy link
Contributor Author

@tgolsson I tested #20708 and never saw any errors! 🎉

@tgolsson
Copy link
Contributor

Thanks! I've been terribly busy lately but hopefully I'll get around to merging it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants