1787-Support_multiple_hash_functions #1791

Robinbux · 2024-07-04T13:03:39Z

Feature to support multiple Hash Functions as described here.

Some uncertain parts I expressed in comments now. Once discussed I will remove those and adjust the unit tests before a potential merge .

Clean Clean Clean

google-cla · 2024-07-04T13:03:44Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Robinbux · 2024-07-04T13:33:43Z

src/main/java/build/buildfarm/common/DigestUtil.java

@@ -71,20 +74,21 @@ public DigestFunction.Value getDigestFunction() {
 return DigestFunction.Value.UNKNOWN;
 }

+ // Correct me if I'm wrong but this function doesn't make any sense?


I am referring to the old function here

jasonschroeder-sfdc · 2024-07-04T17:30:37Z

src/main/java/build/buildfarm/common/BUILD

@@ -80,6 +80,20 @@ java_library(
 ],
 )

+java_library(
+ name = "digestUtil",
+ srcs = ["DigestUtil.java"],


is this also included in :common above?

Yea it it, but if I add common to src/main/java/build/buildfarm/common/resources/BUILD we have a circular dependency problem. So I created a separate instance for it.

Marzi1904 · 2024-07-05T07:17:58Z

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java

@@ -247,6 +243,13 @@ public SettableFuture<Long> load(Digest digest) {
 private final transient Entry header = new SentinelEntry();
 private volatile long unreferencedEntryCount = 0;

+ private static final int MAX_ENTRIES = 10000;
+ private static final Cache<String, DigestFunction.Value> digest_to_function_map =


why we need to introduce this cache?

Without it this change actually requires a quite huge refactoring of changing many chaining functions, the corresponding interfaces and as a result also all other implementations of those. This seems to be the cleanest solution to me.

src/main/java/build/buildfarm/instance/server/NodeInstance.java

amishra-u

Lets first target supporting multiple hashing algorithm for write request and accordingly update the pr summary.

amishra-u · 2024-07-05T19:56:19Z

src/main/java/build/buildfarm/cas/ContentAddressableStorage.java

@@ -101,6 +102,14 @@ Write getWrite(
 Compressor.Value compression, Digest digest, UUID uuid, RequestMetadata requestMetadata)
 throws EntryLimitException;

+ @Nullable
+ default Write getWrite(


There is no need to create a new overloaded method, you can update the additional parameter to the current method and update all existing implementations.

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java

amishra-u · 2024-07-05T20:03:17Z

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java

@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput(
 throws IOException {
 // caller will be the exclusive owner of this write stream. all other requests


misplaced comment

Not my comment, but I can remove it 🤷‍♂️

I mean comment got misplaced because of this code in the middle.

// Map identifier to digest function digest_to_function_map.put(key.getDigest().getHash(), key.getDigestFunction());

amishra-u · 2024-07-05T20:09:42Z

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java

@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput(
 throws IOException {
 // caller will be the exclusive owner of this write stream. all other requests
 // will block until it is returned via a close.
+ // Map identifier to digest function
+ digest_to_function_map.put(key.getDigest().getHash(), key.getDigestFunction());


This seems like a work around to store the hashing algorithm for digests. It also introduces extra overhead to maintain a cache.

amishra-u · 2024-07-05T20:10:32Z

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java

+ private static final int MAX_ENTRIES = 10000;
+ private static final Cache<String, DigestFunction.Value> digest_to_function_map =
+ Caffeine.newBuilder()
+ .expireAfterWrite(2, TimeUnit.MINUTES)


What happens if there are more than 10k write requests in 2 mins.

Doesn't matter. This is basically a LRU cache. As long as there are no 10k request (or 2 minutes) between the methods getOutput and putOrReferenceGuarded it works. But we can also increase the size if you think that is not sufficient.

You may increase it, but doing so will add extra overhead. Using an additional data structure (accessible by multiple threads) to pass a value from one method to another may seem like a shortcut to avoid further code changes. However, it's better to make the necessary changes rather than introduce a new data structure.

Fair, but it requires quite a large code change then, even more if we don't want to create a new overload method like you mentioned but change the existing interface methods as well as all implementations of those

Yes, there will be small changes at multiple places in the code. Lets try to do it proper way so it can be upstreamed.

amishra-u · 2024-07-05T20:23:21Z

src/main/java/build/buildfarm/common/resources/ResourceParser.java

+ } else {
+ // Infer the digest function based on the hash format
+ // Should go here when it's not BLAKE3
+ String hash = maybeDigestFunction;
+ builder.setDigestFunction(DigestUtil.HashFunction.forHash(hash).getDigestFunction());


DigestFunction is always included in the request, this additional handling might not be needed.
There are several other hashing algorithms with the same hash length as BLAKE3. It might be better to remove the HashFunction.forHash method entirely and always rely on the request.

I don't see it always included in the request. For blake it is, for sha it isn't.
I assume if we want to change that we would need to change the Bazel OSS first

I thought that hash function is not part of request only for SHA256 as SHA256 is default hashing algorithm for bazel. However, upon checking the Bazel code, I found that it classifies hashing algorithms into two groups: old and new. For the old ones, where the hashing algorithm can be identified using the hash length, the hash function is not part of the request. For the new ones, the hash function is included in the request. The hash detection logic should complement this Bazel method.
https://github.com/bazelbuild/bazel/blob/1dd5c20513b6ff479bf796fe2541bd492c661c94/src/main/java/com/google/devtools/build/lib/remote/ByteStreamUploader.java#L183

It is complementing that now no? For new one we extract if it's part of the request, otherwise we determine by length with this function I added:

public static HashFunction forHash(String hexDigest) { int hashLength = hexDigest.length(); return switch (hashLength) { case 64 -> // 32 bytes * 2 characters per byte HashFunction.SHA256; case 40 -> // 20 bytes * 2 characters per byte HashFunction.SHA1; case 32 -> // 16 bytes * 2 characters per byte HashFunction.MD5; default -> throw new IllegalArgumentException("hash type unrecognized: " + hexDigest); };

Agree this complements. Lets update the comment something like if hash function is not part of request then use hash length to determine the hashing algorithm. BLAKE3 is too specific.

amishra-u · 2024-07-05T20:31:36Z

src/main/java/build/buildfarm/instance/server/NodeInstance.java

@@ -1906,7 +1898,7 @@ public boolean pollOperation(String operationName, ExecutionStage.Value stage) {

 protected CacheCapabilities getCacheCapabilities() {
 return CacheCapabilities.newBuilder()
- .addDigestFunctions(digestUtil.getDigestFunction())
+ .addAllDigestFunctions(DigestUtil.getSupportedDigestFunctions())


Current change mostly focus on supporting multiple hashing algorithm for write requests but this change will signal client that buildfarm support multiple hashing algorithm for all apis including ExecutionService.Execute API.

Then how would we make the change just for write? If get Capabilities doesn't say it accepts SHA and BLAKE, the client will cancel the request if it uses a different algorithm

This could be internal patch until support for multiple hash function is enabled for all APIs.

amishra-u

Please make sure these checks passes.
https://buildkite.com/bazel/buildfarm-farmer/builds/8542

amishra-u · 2024-07-09T21:59:06Z

src/main/java/build/buildfarm/common/resources/ResourceParser.java

+ } else {
+ // Infer the digest function based on the hash format
+ // Should go here when it's not BLAKE3
+ String hash = maybeDigestFunction;
+ builder.setDigestFunction(DigestUtil.HashFunction.forHash(hash).getDigestFunction());


Agree this complements. Lets update the comment something like if hash function is not part of request then use hash length to determine the hashing algorithm. BLAKE3 is too specific.

amishra-u · 2024-07-09T22:09:34Z

src/main/java/build/buildfarm/common/DigestUtil.java

+ // It is always calling the same isValidHexDigest function with the same parameters.
+ // So it will always either return SHA256, or throw an exception.
+ // With the new method we still can't distinguish between SHA256 and BLAKE3, but at least it's better...


This doesn't seem correct. Yes it calls the same function but on different instance of DigestFunction, so for each call, the value for hash.bits will be different.

You may leave this code as it is, as it doesn't change the code behavior.

amishra-u · 2024-07-09T22:22:17Z

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java

+ private static final int MAX_ENTRIES = 10000;
+ private static final Cache<String, DigestFunction.Value> digest_to_function_map =
+ Caffeine.newBuilder()
+ .expireAfterWrite(2, TimeUnit.MINUTES)


Yes, there will be small changes at multiple places in the code. Lets try to do it proper way so it can be upstreamed.

1787-Support_multiple_hash_functions

b49bbd7

Clean Clean Clean

Robinbux requested a review from werkt as a code owner July 4, 2024 13:03

Robinbux commented Jul 4, 2024

View reviewed changes

jasonschroeder-sfdc reviewed Jul 4, 2024

View reviewed changes

Marzi1904 reviewed Jul 5, 2024

View reviewed changes

src/main/java/build/buildfarm/instance/server/NodeInstance.java Outdated Show resolved Hide resolved

get supported digest functions from DigestUtil

add62a8

amishra-u suggested changes Jul 5, 2024

View reviewed changes

amishra-u reviewed Jul 5, 2024

View reviewed changes

style

4b91fc6

amishra-u reviewed Jul 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1787-Support_multiple_hash_functions #1791

1787-Support_multiple_hash_functions #1791

Robinbux commented Jul 4, 2024 •

edited

Loading

google-cla bot commented Jul 4, 2024

Robinbux Jul 4, 2024

jasonschroeder-sfdc Jul 4, 2024

Robinbux Jul 4, 2024

Marzi1904 Jul 5, 2024

Robinbux Jul 5, 2024

amishra-u left a comment

amishra-u Jul 5, 2024

amishra-u Jul 5, 2024

Robinbux Jul 8, 2024

amishra-u Jul 8, 2024

amishra-u Jul 5, 2024

amishra-u Jul 5, 2024

Robinbux Jul 8, 2024

amishra-u Jul 8, 2024

Robinbux Jul 9, 2024

amishra-u Jul 9, 2024

amishra-u Jul 5, 2024

Robinbux Jul 8, 2024

amishra-u Jul 8, 2024

Robinbux Jul 9, 2024

amishra-u Jul 9, 2024

amishra-u Jul 5, 2024

Robinbux Jul 8, 2024

amishra-u Jul 8, 2024

amishra-u left a comment

amishra-u Jul 9, 2024

amishra-u Jul 9, 2024

amishra-u Jul 9, 2024

		@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput(
		throws IOException {
		// caller will be the exclusive owner of this write stream. all other requests

1787-Support_multiple_hash_functions #1791

Are you sure you want to change the base?

1787-Support_multiple_hash_functions #1791

Conversation

Robinbux commented Jul 4, 2024 • edited Loading

google-cla bot commented Jul 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amishra-u left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amishra-u left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Robinbux commented Jul 4, 2024 •

edited

Loading