-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AC cache misses when action result exists on S3 backend #767
Comments
Hmm, interesting. I don't think any of the changes between 2.4.3 and 2.4.4 would be relevant to this. On the problematic bazel-remote instance, are you saying that you request this AC entry with curl, it shows the S3 DOWNLOAD OK log line, then the next log line is a GET 404 for the same AC entry? Are you running this bazel-remote instance with |
Yes, that's it! and when requesting the same AC entry it later on, there's no S3 line, it directly logs GET 404.
We run with the following:
do note that other instances with the same configuration are able to serve the AC entry fine. Also, the AC entry is ~5MiB. |
On the first request, the AC entry is downloaded from S3 and stored in the local disk cache. The second request gets the AC entry directly from the local disk cache. If you're getting "not found" then I suspect there's a validation failure of the entry. If the entry was validated successfully, then I think you would see log lines for each of the CAS blobs that are checked. I will try to improve the logging in this case, I might have a test build for you to try tomorrow. |
Thanks! if additionally we can migrate without wiping the disk cache that would be ideal, as we're more likely to be able to reproduce the issue with that AC entry |
I haven't been found any interesting places to add extra logging yet, unfortunately. Are you using the http or grpc interface? Can you try upgrading the problematic instance to v2.4.4? You should not need to wipe the disk cache when doing this. |
Using grpc with Bazel but I used the HTTP interface to reproduce the issue with
Will do! I'll report back and let you know if the problem persists. Thanks for the quick feedback! |
@mostynb Updated to 2.4.4 and still seeing the issue. Here are the logs for an instance showing the issue (note that this is now happening on a different artifact, on a different instance, and most of the instances still work as expected):
Note how at |
By convention bazel uses the hash of the Action (the command to run) as the lookup key for the ActionResult that it creates. Bazel is probably uploading the Action for debugging purposes (or maybe for consistency with the remote execution setup), even though it is not strictly required when only using remote caching. This probably explains why you see the same hash used for both an AC entry and a CAS entry. There are two relevant functions to look at, diskCache.GetValidatedActionResult and diskCache.get. The calling code looks at the values returned by GetValidatedActionResult, if the *pb.ActionResult is nil and the error is nil then a "not found" log is emitted. That happens in these places:
I added some extra debug logs to the places above in this branch in my fork: If not, we can look into diskCache.Get: diskCache.get (called by the diskCache.Get wrapper) returns a nil io.ReadCloser and nil error, then the caller treats that blob as not found. This happens in these places:
|
Thanks a lot! I'm seeing a lot of Here's what's happening:
All instances have the exact same configuration:
Let me know if you need more logs/info/etc! |
I have a
bazel-remote v2.4.3
instance that serves 404 on Action Results that exists in its configured S3 backend. Other instances connected to the same S3 backend, with the same configuration, are able to serve those Action Results.One action cache entry that surfaces the issue is
6e328d045a314d775457aaec080af79c1ae24088e67e2c62bbc268632eff9560
. I've tried querying thebazel-remote
instance withcurl
to reproduce the issue, and the implementation seems close enough to the gRPC implementation for this to be relevant (both go throughGetValidatedActionResult
). I'm definitely seeing the same behavior when going through gRPC/bazel.This is the request I'm sending:
The first log line I see appears to be a hit on the S3 backend for that entry:
However the line after that suggests the artifact is missing and
bazel-remote
returned a 404 (curl
also shows a 404):What's slightly surprising is that the second time I run the
curl
command, I do not see the instance querying the S3 backend, and instead logs a 404 directly:I can get the AC entry from another instance connected to the same backend:
protobuf output
2 { 1: "package.cache" 2 { 1: "903494c63057c6076dc4bdc43940cf93866a2d8a3af538964bfbf62e06b793fa" 2: 2693 } 4: 1 } 2 { 1: "libHSprimitive-0.8.0.0.a" 2 { 1: "6d7227f669ed2c506033c2a5c4556b387b125f58d77c80770c97819acfc40a20" 2: 1170436 } 4: 1 } ... 6 { 1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" } 8 { 1: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" } ... }
and for each CAS blob returned by the (working) instance, I'm able to fetch them from the original/failing instance:
GET /cas/903494c63057c6076dc4bdc43940cf93866a2d8a3af538964bfbf62e06b793fa -> 200
and yet the
/ac/
call still returns 404, even after all the CAS entries werecurl
ed.Any idea what might be happening here? The instance hasn't been upgraded to 2.4.4 in case something needs to be reproduced, though happy to close this issue if this has been fixed in 2.4.4.
The text was updated successfully, but these errors were encountered: