Performance tests for hash functions #3918

filimonov · 2018-12-24T12:02:31Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

Build/Testing/Packaging Improvement

Short description (up to few sentences):

Performance tests for hash functions ( #3905 ).

Detailed description (optional):

Summary in a table form (tested on i5-2400 CPU @ 3.10GHz workstation with 8Gb Ram)

javaHash & hiveHash have very poor quality (lot of collisions even in small dataset).

hash	year of publish	empty1	empty4	10char1	10char4	1024char1	1024char4
cityHash64	2011	801.9	2,219.4	328.7	968.8	16.1	37.1
farmHash64	2014	814.3	1,928.3	318.1	923.1	19.1	39.0
metroHash64	2015	658.6	1,883.4	293.7	939.0	18.3	38.9
murmurHash2_32	2008	920.2	2,438.1	297.5	877.4	15.5	33.9
murmurHash2_64	2008	798.5	1,707.0	293.9	955.5	15.9	38.2
murmurHash3_32	2012	739.0	2,033.2	252.9	784.6	13.7	32.8
murmurHash3_64	2012	448.2	1,612.5	230.6	716.5	16.3	37.2
murmurHash3_128	2012	537.3	1,316.9	237.0	747.5	16.1	37.4
javaHash	?	1,326.7	2,174.9	274.2	844.9	5.2	20.6
hiveHash	?	1,305.0	2,790.2	276.2	860.1	5.2	20.8
xxHash32	2012	807.2	1,800.8	273.9	865.5	16.2	33.8
xxHash64	2012	711.5	1,998.0	263.8	859.8	18.7	39.9

Legend:

number is Mbytes per sec (higher is better)
empty1 - empty string on in one thread (hash init / finalize costs)
empty4 - empty string on in 4 threads (hash init / finalize costs)
10char1 - 10 char string (numeric) on in one thread
10empty4 - 10 char string (numeric) on in 4 threads
1024char1- 1024 char string ("Lorem ipsum...") on in one thread
1024char4 - 1024 char string ("Lorem ipsum...") on in 4 threads

blinkov · 2018-12-24T12:06:26Z

This table could have been published in docs, but at the moment it lacks the description at least about used hardware.

filimonov · 2018-12-24T14:26:35Z

This table could have been published in docs, but at the moment it lacks the description at least about used hardware.

I was checking it on quite old workstation. May be it would be better to check it on better cpu, and also with some extra metrics.

Also results are quite confusing, cause i was expecting to get Gb/sec even for long strings. See
https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests/
https://github.com/Cyan4973/xxHash#benchmarks

Maybe I have some issue with compilation flags (i was experimenting a bit).

alexey-milovidov · 2018-12-24T14:59:58Z

Testing with set of strings with constant length is non-representative, because it makes branch predictor too happy. Better to test on real dataset with different distribution of length of strings.

Numbers in MB/sec looks like MB of source data per second, where source data is something like system.numbers table. It means, that this is not MB/sec, but MHashes/sec divided by 8.

expecting to get Gb/sec even for long strings

For long (kilobyte) strings, you can easily get tens of gigabytes per second on single CPU core.
Example: https://github.com/Bulat-Ziganshin/FARSH
But testing on long strings is totally irrelevant for ClickHouse.

You can find a test of hash functions for hash tables with string keys inside our repository.
(that was evaluated on multiple representative datasets; and the winner is ClickHouse own hash function, that is intentionally not exported as SQL function to avoid the possibility of hash interference)

filimonov · 2018-12-31T09:20:41Z

Numbers in MB/sec looks like MB of source data per second, where source data is something like system.numbers table. It means, that this is not MB/sec, but MHashes/sec divided by 8.

Yep. Looks like that is the reason of those weird results.

Testing with set of strings with constant length is non-representative, because it makes branch predictor too happy. Better to test on real dataset with different distribution of length of strings.

Yes, it gives quite rough estimation of the speed. But to make it possible to create reproducible performance tests some 'predefined' dataset should be published.

alexey-milovidov · 2019-01-01T04:05:42Z

https://gist.githubusercontent.com/alexey-milovidov/811ce0a62cc142227e4910e525c06116/raw/dfa39ccdf9f475ccc8abd6395de959f978cbb7b2/datasets_example.txt

filimonov · 2019-01-02T14:59:21Z

https://gist.githubusercontent.com/alexey-milovidov/811ce0a62cc142227e4910e525c06116/raw/dfa39ccdf9f475ccc8abd6395de959f978cbb7b2/datasets_example.txt

I can make benchmarks or can rewrite that tests to use that tables. Should I?

BTW: also some synthetic data can be used, that can be useful to create data with needed properties.
I was playing with something like that:

CREATE TABLE ascii_random_data1
 ENGINE=MergeTree
 PARTITION BY tuple()
 ORDER BY tuple() AS
 WITH
   arrayStringConcat(
       arrayMap( x -> reinterpretAsString( toUInt8( rand(x) % 96 + 0x20 ) ), range( 1024 ) ) 
   ) as str1024,
   substring(str1024, 1, 512 + rand() % 512 ) as str
  select str from numbers(5000000);

It looks like farmHash64 gives the best (or almost the best) performance in most of the checked cases.
In singlethread mode xxHash64 is as good as farmHash64 for long strings (or a bit better), but about 40% slower for short strings (<8 chars).
In multithread mode xxHash64 is 5-20% behind the winner (usually it's also farmHash64).

P.S. There is also https://github.com/rurban/smhasher

alexey-milovidov · 2019-01-02T19:35:02Z

I can make benchmarks or can rewrite that tests to use that tables. Should I?

This is worth doing. Tests with
MobilePhoneModel, PageCharset, Params, URLDomain, UTMSource, Referer, URL, Title
columns should be representable.

This dataset will soon became our official dataset for benchmarks.

BTW, we also have isolated test for hash function performance in hash tables: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Interpreters/tests/hash_map_string_3.cpp
(This is not the same as just performance, because it depends on some combination of performance and quality.)

Also we have presentation about hash functions in ClickHouse:
https://www.youtube.com/watch?v=EoX82TEz2sQ
(that also has a topic about quality evaluation)

alexey-milovidov · 2019-01-02T19:41:12Z

It looks like farmHash64 gives the best (or almost the best) performance in most of the checked cases.

This is quite interesting, because I was not sure, that we have enabled CPU dispatching in FarmHash in our build. BTW, FarmHash is intended to be not stable across different CPUs (the user should not save the result anywhere or use hash as a key - that's why we don't document it - to avoid misusage). There is a variant of FarmHash named "FingerprintHash" that is portable.

PS. If you want to go further, you can also consider adding HighwayHash https://github.com/google/highwayhash as an alternative to SipHash.

halayli · 2019-02-08T16:09:47Z

I would also consider t1ha hash family. t1ha_aes performs particularly well on aes supported cpus.

https://github.com/leo-yuriev/t1ha

Performance tests for hash functions

67ad598

filimonov closed this Dec 24, 2018

Performance tests for hash functions2

c07b065

filimonov reopened this Dec 24, 2018

alexey-milovidov merged commit 3687980 into ClickHouse:master Dec 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tests for hash functions #3918

Performance tests for hash functions #3918

filimonov commented Dec 24, 2018 •

edited

Loading

blinkov commented Dec 24, 2018

filimonov commented Dec 24, 2018 •

edited

Loading

alexey-milovidov commented Dec 24, 2018 •

edited

Loading

filimonov commented Dec 31, 2018 •

edited

Loading

alexey-milovidov commented Jan 1, 2019

filimonov commented Jan 2, 2019 •

edited

Loading

alexey-milovidov commented Jan 2, 2019 •

edited

Loading

alexey-milovidov commented Jan 2, 2019 •

edited

Loading

halayli commented Feb 8, 2019 •

edited

Loading

Performance tests for hash functions #3918

Performance tests for hash functions #3918

Conversation

filimonov commented Dec 24, 2018 • edited Loading

Category (leave one):

Short description (up to few sentences):

Detailed description (optional):

Legend:

blinkov commented Dec 24, 2018

filimonov commented Dec 24, 2018 • edited Loading

alexey-milovidov commented Dec 24, 2018 • edited Loading

filimonov commented Dec 31, 2018 • edited Loading

alexey-milovidov commented Jan 1, 2019

filimonov commented Jan 2, 2019 • edited Loading

alexey-milovidov commented Jan 2, 2019 • edited Loading

alexey-milovidov commented Jan 2, 2019 • edited Loading

halayli commented Feb 8, 2019 • edited Loading

filimonov commented Dec 24, 2018 •

edited

Loading

filimonov commented Dec 24, 2018 •

edited

Loading

alexey-milovidov commented Dec 24, 2018 •

edited

Loading

filimonov commented Dec 31, 2018 •

edited

Loading

filimonov commented Jan 2, 2019 •

edited

Loading

alexey-milovidov commented Jan 2, 2019 •

edited

Loading

alexey-milovidov commented Jan 2, 2019 •

edited

Loading

halayli commented Feb 8, 2019 •

edited

Loading