[MCLEAN-109] Add METRO hash implementation #58

gnodet · 2023-04-05T09:45:32Z

This PR provides new METRO / METRO_MM hash algorithms based on Zero-Allocation-Hashing Metro hash implementation.
Given it's the fastest, the default is changed to it and the website modified accordingly.
This is not very clear when/if people should change.
Also, I would have assumed the MM versions of the hashes provide better performances, but the added perf test does not seem to indicate the
same:

Benchmark           Mode  Cnt  Score   Error   Units
PerfTest.METRO     thrpt    3  0.419 ± 0.445  ops/ms
PerfTest.METRO_MM  thrpt    3  0.364 ± 0.687  ops/ms
PerfTest.SHA1      thrpt    3  0.190 ± 0.050  ops/ms
PerfTest.SHA256    thrpt    3  0.147 ± 0.004  ops/ms
PerfTest.XX        thrpt    3  0.382 ± 0.582  ops/ms
PerfTest.XXMM      thrpt    3  0.333 ± 0.050  ops/ms

qweek · 2023-04-10T08:22:03Z

Hi, gnodet

I don't mind adding a new hashing algorithms, but wouldn't make it the default.
Last time original MetroHash repository was updated 5 years ago and is currently unmaintained.
jandrewrogers/MetroHash
Also, according to hash comparison benchmark MetroHash have same quality problems
rurban/smhasher
metrohash64_2: UB, LongNeighbors
rurban/smhasher/blob/master/doc/metrohash64_2.txt
We chose XX as most known, mature and fast option (with quality score 10)
Cyan4973/xxHash

Hash speed is highly dependent on operating system, as far as I remember in our tests on Linux, Memory Mapped version was faster, and on Windows it was slower (that's why we use both). It may also depend on the file system, CPU, etc
Unfortunately, Zero-Allocation-Hashing repository does not contain performance benchmark for Metro
OpenHFT/Zero-Allocation-Hashing#28
Smhasher benchmark is not relevant because it use C++ versions, but it says
"Fastest hash functions on x86_64 without quality problems are:
xxh3low
wyhash
ahash64
t1ha2_atonce
komihash
FarmHash (not portable, too machine specific: 64 vs 32bit, old gcc, ...)
halftime_hash128
Spooky32
pengyhash
nmhash32
mx3
MUM/mir (different results on 32/64-bit archs, lots of bad seeds to filter out)
fasthash32"
P.S. it seemed to me that we tested version with XX3, but I'm not sure that it got into final release

qweek · 2023-04-11T12:32:16Z

I have a couple more comments to your benchmark.

Throughput benchmark does not show you real picture, because it depends on IO performance and total load on the disk. This means that results will most likely contain outliers, and averaged throughput will show the wrong value. It is better in this case to use Mode.SampleTime with percentiles.
Benchmark results also depends on file size, disk cache, etc. Your benchmark includes small test suite of 149 java files with total size less than 1 MB.
I ran similar benchmark on Linux with random generated files 128B .. 128KB, 16 files for each size (total size ~4 MB).

Benchmark                  Mode    Cnt    Score      Error  Units
PerfTest.XX                sample  3232    2.783 ±   0.016  ms/op
PerfTest.XX:XX·p0.00       sample          2.49             ms/op
PerfTest.XX:XX·p0.50       sample          2.793            ms/op
PerfTest.XX:XX·p0.90       sample          2.99             ms/op
PerfTest.XX:XX·p0.95       sample          3.127            ms/op
PerfTest.XX:XX·p0.99       sample          3.502            ms/op
PerfTest.XX:XX·p0.999      sample          5.278            ms/op
PerfTest.XX:XX·p0.9999     sample          5.833            ms/op
PerfTest.XX:XX·p1.00       sample          5.833            ms/op

Benchmark                  Mode    Cnt    Score      Error  Units
PerfTest.XXMM              sample  2701    3.332 ±   0.014  ms/op
PerfTest.XXMM:XXMM·p0.00   sample          3.211            ms/op
PerfTest.XXMM:XXMM·p0.50   sample          3.297            ms/op
PerfTest.XXMM:XXMM·p0.90   sample          3.457            ms/op
PerfTest.XXMM:XXMM·p0.95   sample          3.502            ms/op
PerfTest.XXMM:XXMM·p0.99   sample          3.715            ms/op
PerfTest.XXMM:XXMM·p0.999  sample          6.555            ms/op
PerfTest.XXMM:XXMM·p0.9999 sample         11.682            ms/op
PerfTest.XXMM:XXMM·p1.00   sample         11.682            ms/op

In this benchmark XX with Memory Mapped file was 6 - 29% slower

With random generated files 128KB .. 128MB, 16 files for each size (total size ~4 GB, which is more realistic suite in our case)

Benchmark                  Mode    Cnt    Score     Error     Units
PerfTest.XX                sample  3      3836.39 ± 309.251   ms/op
PerfTest.XX:XX·p0.00       sample         3821.011            ms/op
PerfTest.XX:XX·p0.50       sample         3833.594            ms/op
PerfTest.XX:XX·p0.90       sample         3854.565            ms/op
PerfTest.XX:XX·p0.95       sample         3854.565            ms/op
PerfTest.XX:XX·p0.99       sample         3854.565            ms/op
PerfTest.XX:XX·p0.999      sample         3854.565            ms/op
PerfTest.XX:XX·p0.9999     sample         3854.565            ms/op
PerfTest.XX:XX·p1.00       sample         3854.565            ms/op

Benchmark                  Mode    Cnt    Score      Error    Units
PerfTest.XXMM              sample  7      1523.132 ± 233.943  ms/op
PerfTest.XXMM:XXMM·p0.00   sample         1413.48             ms/op
PerfTest.XXMM:XXMM·p0.50   sample         1533.018            ms/op
PerfTest.XXMM:XXMM·p0.90   sample         1663.042            ms/op
PerfTest.XXMM:XXMM·p0.95   sample         1663.042            ms/op
PerfTest.XXMM:XXMM·p0.99   sample         1663.042            ms/op
PerfTest.XXMM:XXMM·p0.999  sample         1663.042            ms/op
PerfTest.XXMM:XXMM·p0.9999 sample         1663.042            ms/op
PerfTest.XXMM:XXMM·p1.00   sample         1663.042            ms/op

In this benchmark XX with Memory Mapped file was 132 - 170% faster

You can adjust benchmark according to your load, but main problem is that test run takes a lot of time

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.security.SecureRandom;
import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

public static class HashState {
    private static final String USER_DIRECTORY = System.getProperty("user.dir");
    private static List<Path> paths;

    @Setup(Level.Iteration)
    public void setup() {
	paths = new ArrayList<>();

        startFileSize = Integer.parseInt(System.getProperty("from", "128"));
        endFileSize = Integer.parseInt(System.getProperty("to", "131072"));

        for (int fileSize = startFileSize; fileSize <= endFileSize; fileSize *= 2) {
            for (int i = 0; i < 16; i++) {
                String fileName = fileSize + "B_" + UUID.randomUUID() + ".bin";
                Path filePath = Paths.get(USER_DIRECTORY, "target", fileName);

                generateRandomBinaryFile(filePath, fileSize);
                paths.add(filePath);
            }
        }
    }

    private static void generateRandomBinaryFile(Path path, int fileSize) {
        SecureRandom random = new SecureRandom();
        byte[] buffer = new byte[fileSize];
        random.nextBytes(buffer);

        try {
            Files.createDirectories(path.getParent());
            Files.write(path, buffer);
            System.out.println("Random binary file generated successfully: " + path);
        } catch (IOException e) {
            System.err.println("Error while generating random binary file: " + e.getMessage());
        }
    }

    @TearDown(Level.Iteration)
    public void deleteAllPaths() {
        for (Path path : paths) {
            try {
                Files.delete(path);
                System.out.println("Deleted file: " + path);
            } catch (IOException e) {
                System.err.println("Error while deleting file: " + path + ", " + e.getMessage());
            }
        }
    }
}

code generated by GPT-4 ©

gnodet · 2023-04-11T13:33:02Z

It would be nice to provide some guidelines for choosing an algorithm. The fact they are missing is the original driver for this PR...

qweek · 2023-04-11T14:04:25Z

XX algorithm is enough for most cases.
MM optimized version may be useful for big projects with lots of module, files and dependencies.
SHA is only needed if you stick to the classic time-tested and NSA verified algorithms (it always loses in speed)

olamy · 2023-04-27T06:29:11Z

maybe would be good to add a comment regarding usage of *-MM implementations which need some extra setup with jdk17+
In top directory create a file .mvn/jvm.config with a line --add-opens java.base/sun.nio.ch=ALL-UNNAMED

olamy · 2023-05-08T22:47:40Z

@gnodet happy to merge it?

src/main/java/org/apache/maven/buildcache/hash/Zah.java

gnodet · 2023-06-13T10:14:11Z

@olamy

maybe would be good to add a comment regarding usage of *-MM implementations which need some extra setup with jdk17+
In top directory create a file .mvn/jvm.config with a line --add-opens java.base/sun.nio.ch=ALL-UNNAMED

Do you recall why you indicate some requirements on jdk17+ ? The tests seem to run fine without this flag, so I'm not sure it's actually required.
On a side note, I just noticed the build is broken on jdk >= 20, not sure why... but that seems completely unrelated to this issue.

olamy · 2023-06-13T10:31:47Z

@olamy

maybe would be good to add a comment regarding usage of *-MM implementations which need some extra setup with jdk17+
In top directory create a file .mvn/jvm.config with a line --add-opens java.base/sun.nio.ch=ALL-UNNAMED

Do you recall why you indicate some requirements on jdk17+ ? The tests seem to run fine without this flag, so I'm not sure it's actually required. On a side note, I just noticed the build is broken on jdk >= 20, not sure why... but that seems completely unrelated to this issue.

it's not requirement on jdk17. I'm just saying with 1.0.1 using <hashAlgorithm>XXMM</hashAlgorithm> and jdk 17 without this flag give me this error:

Exception in thread "main" java.lang.IllegalAccessError: class net.openhft.hashing.LongHashFunction (in unnamed module @0x3e3861d7) cannot access class sun.nio.ch.DirectBuffer (in module java.base) because module java.base does not export sun.nio.ch to unnamed module @0x3e3861d7
	at net.openhft.hashing.LongHashFunction.hashByteBuffer(LongHashFunction.java:536)
	at net.openhft.hashing.LongHashFunction.hashBytes(LongHashFunction.java:530)
	at org.apache.maven.buildcache.hash.XX$Checksum.digest(XX.java:83)
	at org.apache.maven.buildcache.hash.HashChecksum.digest(HashChecksum.java:58)

olamy@pop-os:~/dev/sources/jetty/jetty.project$ mvn -v
Apache Maven 3.9.2 (c9616018c7a021c1c39be70fb2843d6f5f9b8a1c)
Maven home: /home/olamy/.sdkman/candidates/maven/current
Java version: 17.0.7, vendor: Eclipse Adoptium, runtime: /home/olamy/.sdkman/candidates/java/17.0.7-tem
Default locale: en_AU, platform encoding: UTF-8
OS name: "linux", version: "6.2.6-76060206-generic", arch: "amd64", family: "unix"

I'm using https://github.com/eclipse/jetty.project.git branch jetty-12.0.x-build-cache just remove the last entry from .mvn/jvm.config

gnodet force-pushed the metro-hash branch from f2542d1 to 3b68b16 Compare April 5, 2023 09:55

AlexanderAshitkin approved these changes Apr 26, 2023

View reviewed changes

olamy added the enhancement New feature or request label Apr 27, 2023

olamy approved these changes May 8, 2023

View reviewed changes

gnodet marked this pull request as draft June 12, 2023 13:24

Add METRO hash implementation, fix the web site

7f54361

gnodet force-pushed the metro-hash branch from 3b68b16 to 7f54361 Compare June 12, 2023 13:35

bmarwell requested changes Jun 12, 2023

View reviewed changes

src/main/java/org/apache/maven/buildcache/hash/Zah.java Outdated Show resolved Hide resolved

Use an enum instead of a boolean to leave more room in the future

62396c2

gnodet marked this pull request as ready for review June 12, 2023 18:30

gnodet requested a review from bmarwell June 12, 2023 18:30

bmarwell approved these changes Jun 13, 2023

View reviewed changes

Add info about using MM algo on JDK 17+

fa5dcf6

gnodet changed the title ~~Add METRO hash implementation, change the default, fix the web site~~ Add METRO hash implementation, fix the doc Jun 14, 2023

gnodet changed the title ~~Add METRO hash implementation, fix the doc~~ [MCLEAN-109] Add METRO hash implementation Jun 14, 2023

gnodet merged commit 9a842bd into apache:master Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MCLEAN-109] Add METRO hash implementation #58

[MCLEAN-109] Add METRO hash implementation #58

gnodet commented Apr 5, 2023 •

edited

Loading

qweek commented Apr 10, 2023 •

edited

Loading

qweek commented Apr 11, 2023 •

edited

Loading

gnodet commented Apr 11, 2023

qweek commented Apr 11, 2023

olamy commented Apr 27, 2023

olamy commented May 8, 2023

gnodet commented Jun 13, 2023

olamy commented Jun 13, 2023

[MCLEAN-109] Add METRO hash implementation #58

[MCLEAN-109] Add METRO hash implementation #58

Conversation

gnodet commented Apr 5, 2023 • edited Loading

qweek commented Apr 10, 2023 • edited Loading

qweek commented Apr 11, 2023 • edited Loading

gnodet commented Apr 11, 2023

qweek commented Apr 11, 2023

olamy commented Apr 27, 2023

olamy commented May 8, 2023

gnodet commented Jun 13, 2023

olamy commented Jun 13, 2023

gnodet commented Apr 5, 2023 •

edited

Loading

qweek commented Apr 10, 2023 •

edited

Loading

qweek commented Apr 11, 2023 •

edited

Loading