Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1787-Support_multiple_hash_functions #1791

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Robinbux
Copy link

@Robinbux Robinbux commented Jul 4, 2024

Feature to support multiple Hash Functions as described here.

Some uncertain parts I expressed in comments now. Once discussed I will remove those and adjust the unit tests before a potential merge .

@Robinbux Robinbux requested a review from werkt as a code owner July 4, 2024 13:03
Copy link

google-cla bot commented Jul 4, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@@ -71,20 +74,21 @@ public DigestFunction.Value getDigestFunction() {
return DigestFunction.Value.UNKNOWN;
}

// Correct me if I'm wrong but this function doesn't make any sense?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I am referring to the old function here

@@ -80,6 +80,20 @@ java_library(
],
)

java_library(
name = "digestUtil",
srcs = ["DigestUtil.java"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this also included in :common above?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it it, but if I add common to src/main/java/build/buildfarm/common/resources/BUILD we have a circular dependency problem. So I created a separate instance for it.

@@ -247,6 +243,13 @@ public SettableFuture<Long> load(Digest digest) {
private final transient Entry header = new SentinelEntry();
private volatile long unreferencedEntryCount = 0;

private static final int MAX_ENTRIES = 10000;
private static final Cache<String, DigestFunction.Value> digest_to_function_map =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to introduce this cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it this change actually requires a quite huge refactoring of changing many chaining functions, the corresponding interfaces and as a result also all other implementations of those. This seems to be the cleanest solution to me.

Copy link
Contributor

@amishra-u amishra-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets first target supporting multiple hashing algorithm for write request and accordingly update the pr summary.

@@ -101,6 +102,14 @@ Write getWrite(
Compressor.Value compression, Digest digest, UUID uuid, RequestMetadata requestMetadata)
throws EntryLimitException;

@Nullable
default Write getWrite(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to create a new overloaded method, you can update the additional parameter to the current method and update all existing implementations.

src/main/java/build/buildfarm/cas/cfc/CASFileCache.java Outdated Show resolved Hide resolved
@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput(
throws IOException {
// caller will be the exclusive owner of this write stream. all other requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misplaced comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my comment, but I can remove it 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean comment got misplaced because of this code in the middle.

// Map identifier to digest function
digest_to_function_map.put(key.getDigest().getHash(), key.getDigestFunction());

@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput(
throws IOException {
// caller will be the exclusive owner of this write stream. all other requests
// will block until it is returned via a close.
// Map identifier to digest function
digest_to_function_map.put(key.getDigest().getHash(), key.getDigestFunction());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a work around to store the hashing algorithm for digests. It also introduces extra overhead to maintain a cache.

private static final int MAX_ENTRIES = 10000;
private static final Cache<String, DigestFunction.Value> digest_to_function_map =
Caffeine.newBuilder()
.expireAfterWrite(2, TimeUnit.MINUTES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are more than 10k write requests in 2 mins.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter. This is basically a LRU cache. As long as there are no 10k request (or 2 minutes) between the methods getOutput and putOrReferenceGuarded it works. But we can also increase the size if you think that is not sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may increase it, but doing so will add extra overhead. Using an additional data structure (accessible by multiple threads) to pass a value from one method to another may seem like a shortcut to avoid further code changes. However, it's better to make the necessary changes rather than introduce a new data structure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but it requires quite a large code change then, even more if we don't want to create a new overload method like you mentioned but change the existing interface methods as well as all implementations of those

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there will be small changes at multiple places in the code. Lets try to do it proper way so it can be upstreamed.

Comment on lines +312 to +316
} else {
// Infer the digest function based on the hash format
// Should go here when it's not BLAKE3
String hash = maybeDigestFunction;
builder.setDigestFunction(DigestUtil.HashFunction.forHash(hash).getDigestFunction());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DigestFunction is always included in the request, this additional handling might not be needed.
There are several other hashing algorithms with the same hash length as BLAKE3. It might be better to remove the HashFunction.forHash method entirely and always rely on the request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it always included in the request. For blake it is, for sha it isn't.
I assume if we want to change that we would need to change the Bazel OSS first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that hash function is not part of request only for SHA256 as SHA256 is default hashing algorithm for bazel. However, upon checking the Bazel code, I found that it classifies hashing algorithms into two groups: old and new. For the old ones, where the hashing algorithm can be identified using the hash length, the hash function is not part of the request. For the new ones, the hash function is included in the request. The hash detection logic should complement this Bazel method.
https://github.com/bazelbuild/bazel/blob/1dd5c20513b6ff479bf796fe2541bd492c661c94/src/main/java/com/google/devtools/build/lib/remote/ByteStreamUploader.java#L183

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is complementing that now no? For new one we extract if it's part of the request, otherwise we determine by length with this function I added:

public static HashFunction forHash(String hexDigest) {
      int hashLength = hexDigest.length();
        return switch (hashLength) {
            case 64 -> // 32 bytes * 2 characters per byte
                    HashFunction.SHA256;
            case 40 -> // 20 bytes * 2 characters per byte
                    HashFunction.SHA1;
            case 32 -> // 16 bytes * 2 characters per byte
                    HashFunction.MD5;
            default -> throw new IllegalArgumentException("hash type unrecognized: " + hexDigest);
        };

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this complements. Lets update the comment something like if hash function is not part of request then use hash length to determine the hashing algorithm. BLAKE3 is too specific.

@@ -1906,7 +1898,7 @@ public boolean pollOperation(String operationName, ExecutionStage.Value stage) {

protected CacheCapabilities getCacheCapabilities() {
return CacheCapabilities.newBuilder()
.addDigestFunctions(digestUtil.getDigestFunction())
.addAllDigestFunctions(DigestUtil.getSupportedDigestFunctions())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current change mostly focus on supporting multiple hashing algorithm for write requests but this change will signal client that buildfarm support multiple hashing algorithm for all apis including ExecutionService.Execute API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how would we make the change just for write? If get Capabilities doesn't say it accepts SHA and BLAKE, the client will cancel the request if it uses a different algorithm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be internal patch until support for multiple hash function is enabled for all APIs.

Copy link
Contributor

@amishra-u amishra-u left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure these checks passes.
https://buildkite.com/bazel/buildfarm-farmer/builds/8542

Comment on lines +312 to +316
} else {
// Infer the digest function based on the hash format
// Should go here when it's not BLAKE3
String hash = maybeDigestFunction;
builder.setDigestFunction(DigestUtil.HashFunction.forHash(hash).getDigestFunction());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this complements. Lets update the comment something like if hash function is not part of request then use hash length to determine the hashing algorithm. BLAKE3 is too specific.

Comment on lines +80 to +82
// It is always calling the same isValidHexDigest function with the same parameters.
// So it will always either return SHA256, or throw an exception.
// With the new method we still can't distinguish between SHA256 and BLAKE3, but at least it's better...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct. Yes it calls the same function but on different instance of DigestFunction, so for each call, the value for hash.bits will be different.

You may leave this code as it is, as it doesn't change the code behavior.

private static final int MAX_ENTRIES = 10000;
private static final Cache<String, DigestFunction.Value> digest_to_function_map =
Caffeine.newBuilder()
.expireAfterWrite(2, TimeUnit.MINUTES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there will be small changes at multiple places in the code. Lets try to do it proper way so it can be upstreamed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants