Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Remote Build without the Bytes handled correctly? #185

Closed
brentleyjones opened this issue Feb 7, 2020 · 13 comments · Fixed by #187
Closed

Is Remote Build without the Bytes handled correctly? #185

brentleyjones opened this issue Feb 7, 2020 · 13 comments · Fixed by #187

Comments

@brentleyjones
Copy link

I've recently reached the maximum size in my local instance of bazel-remote, and just after that I got a 404 during a build for a CAS item that wasn't found, but should have been there, since it was returned from the AC.

When bazel-remote returns something from the AC does it also increase the TTL of the corresponding CAS items (touching them or otherwise)?

@brentleyjones
Copy link
Author

brentleyjones commented Feb 7, 2020

The cache is not under pressure, each build at max adds 1/20th to the cache.

@brentleyjones
Copy link
Author

brentleyjones commented Feb 7, 2020

To be clear, the Bazel error we get because of this is:

ERROR: /Users/svbuildmobile/workspace/Flagship_iOS/Bazel_Unit_Test_Suite/AccountInterfaces/BUILD.bazel:3:1: Couldn't build file AccountInterfaces/Tests.xctest: Bundling, processing and signing Tests.__internal__.__test_bundle failed due to unexpected I/O exception: Failed to fetch file with hash 'f990e0f5a59e88ed0268b229f6b75193b60d966ae07848c2fd7778e82c4bb517' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.

@mostynb
Copy link
Collaborator

mostynb commented Feb 7, 2020

bazel-remote evicts cache items in LRU order, and ActionCache items are checked to confirm that all referenced CAS items are available before returning a cache hit to the client (unless you specify --disable_http_ac_validation and you're using HTTP). So when the AC item is returned, it and all of the referenced items should be near the "most recently used" end of the cache index. How long those items will remain in the cache depends the cache access pattern and the maximum size of the cache- there is no explicit TTL.

  • How large is your cache?
  • How large are the build artifacts for a single build?
  • How many clients are using the cache?
  • How many different configurations are the clients building?
  • Is your build hermetic/do you expect cache hits between developers building the same configuration?
  • How long does your build take?
  • Are you using a proxy backend? If so, what if any eviction algorithm does it use?

@brentleyjones
Copy link
Author

brentleyjones commented Feb 7, 2020

How large is your cache?

20gb

How large are the build artifacts for a single build?

At most 1gb (clean build), normally less than 500mb

How many clients are using the cache?

1-2 (local to each CI instance), but when this has happened it's just the single client.

How many different configurations are the clients building?

All the same. It's a CI build with locked down options.

Is your build hermetic/do you expect cache hits between developers building the same configuration?

For this use case, yes, since it's just the local CI instance, not being used remotely.

How long does your build take?

4 minutes.

Are you using a proxy backend? If so, what if any eviction algorithm does it use?

No.


Based on what you said, I assume the AC is confirmed, but then the CAS is evicted between confirmation and it being asked for by the client. But giving my stats above, that shouldn't happen if when the AC is confirmed it counts a cache hit on the CAS entries (my assumption is that this step isn't happening).

@brentleyjones
Copy link
Author

I'm also at f99b78c in case the matters.

@mostynb
Copy link
Collaborator

mostynb commented Feb 7, 2020

Are you restarting bazel-remote between runs? If so, this might be caused by #118.

In order to debug this, I think you will need a build with some extra logging- in particular, you should make GetValidatedActionResult in cache/disk/disk.go produce a log line for each of the CAS entries it looks up and whether or not they were found (and log the AC hash on the same line). You should also modify the anonymous onEvict function (also in disk.go) to print the key that is being evicted.

Background: for each referenced CAS blob DiskCache.GetValidatedActionResult calls DiskCache.Contains which in turn calls sizedLRU.Get which moves the item (if it exists) to the MRU end of the index.

Then restart bazel-remote, and redirect the stderr and stdout to a log file, then make a build that triggers this error, and save the log file. Grep the log file for the missing hash blob, and share the results here.

I am not familiar with bazel's build-without-the-bytes implementation, but this should give us a timeline of what it's doing so we can figure out if bazel-remote is behaving properly.

@brentleyjones
Copy link
Author

Are you restarting bazel-remote between runs?

No.


I'll look into making the debug edits you suggested.

@brentleyjones
Copy link
Author

This is going to take me a little while to get to, but I do plan on digging into it.

@mostynb
Copy link
Collaborator

mostynb commented Feb 10, 2020

GetValidatedActionResult might need some work. In particular, OutputDirectories might need to be recursed into- I don't quite remember the distinction between OutputFiles and files in the set of OutputDirectories.

I will try to take a look tomorrow.

mostynb added a commit to mostynb/bazel-remote that referenced this issue Feb 11, 2020
Before returning ActionCache cache hits, we need to confirm that we
have all the referenced blobs in the CAS. Previously, we checked the
OuputDirectory references themselves, but not the blobs that they
referenced.  This might have broken bazel "builds without the bytes"
in some cases.

Possible fix for buchgr#185.
@mostynb
Copy link
Collaborator

mostynb commented Feb 11, 2020

I wonder if you could test #187 ?

@brentleyjones
Copy link
Author

Yes. I'll report back within 2 hours.

@brentleyjones
Copy link
Author

#187 looks to have fixed the errors I was receiving.

mostynb added a commit to mostynb/bazel-remote that referenced this issue Feb 11, 2020
Before returning ActionCache cache hits, we need to confirm that we
have all the referenced blobs in the CAS. Previously, we checked the
OuputDirectory references themselves, but not the blobs that they
referenced.  This could break bazel "builds without the bytes" in
some cases.

Fixes buchgr#185.
mostynb added a commit that referenced this issue Feb 11, 2020
Before returning ActionCache cache hits, we need to confirm that we
have all the referenced blobs in the CAS. Previously, we checked the
OuputDirectory references themselves, but not the blobs that they
referenced.  This could break bazel "builds without the bytes" in
some cases.

Fixes #185.
@mostynb
Copy link
Collaborator

mostynb commented Feb 11, 2020

Great- thanks for the bug report, and helping test this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants