[RFC] cache secret array in assembly code for aarch64 SVE #693

hzhuang1 · 2022-03-14T12:33:49Z

@Cyan4973 @easyaspi314
I try to cache a whole secret array in assembly code. And the performance is better. The maximum performance data could read 18xxx. How do you think about this idea? (https://github.com/hzhuang1/xxHash/tree/dirty_v0.2.1)
Of course, current code is still ugly. Let's discuss the idea first.

=== benchmarking 4 hash functions ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3 , 3516, 6130, 9465, 12771, 15576, 17283, 18393, 14563, 13763, 13596, 13689, 13672, 13775, 12043, 5684, 4912, 4834, 4923, 4874
XXH32 , 1338, 1436, 1486, 1521, 1534, 1542, 1546, 1536, 1505, 1507, 1507, 1507, 1508, 1458, 1264, 1206, 1213, 1212, 1208
XXH64 , 2509, 2799, 2973, 3066, 3122, 3102, 3154, 3136, 3057, 3063, 3066, 3066, 3064, 2899, 2197, 2005, 2009, 2007, 1993
XXH128 , 3392, 5864, 9421, 12495, 15354, 17187, 18306, 14565, 13733, 13822, 14105, 14172, 14208, 12229, 5779, 5016, 4898, 4898, 4851
Throughput small inputs of fixed size (from 2 to 2 bytes):
xxh3 , 72821732

easyaspi314 · 2022-03-14T19:22:07Z

Ok I'm having a little trouble following the assembly code (I'll need some time to digest it 😅), but by caching do you mean something like a ring buffer?

void XXH3_accumulate(...)
{
    u64 secret_cache[8];
    for (int i = 0; i < 8; i++) {
        secret_cache[i] = XXH_read64(secret);
        secret += XXH_SECRET_CONSUME_RATE;
    }
    int cache_idx = 0;
    for (int i = 0; i < nbStripes; i++) {
        for (int j = 0; j < 8; j++) {
            XXH_accumulate_512_round(..., secret_cache[(cache_idx + j) % 8];
        }
        secret_cache[cache_idx] = XXH_read64(secret);
        secret += XXH_SECRET_CONSUME_RATE;
        cache_idx = (cache_idx + 1) % 8;
        ...
    }
}

Edit I see it now.

hzhuang1 · 2022-03-15T00:22:25Z

Ok I'm having a little trouble following the assembly code (I'll need some time to digest it sweat_smile), but by caching do you mean something like a ring buffer?

No, it's not a ring buffer. There're 31 Z registers in SVE. So use some to them to store the value of secret array.
The logic is that secret array is shared among accumulating and scrambling. SVE costs a lot on accessing I/O. If I could make accessing I/O only once, it could improve performance. And it's proved by the performance result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] cache secret array in assembly code for aarch64 SVE #693

[RFC] cache secret array in assembly code for aarch64 SVE #693

hzhuang1 commented Mar 14, 2022 •

edited

Loading

easyaspi314 commented Mar 14, 2022 •

edited

Loading

hzhuang1 commented Mar 15, 2022

[RFC] cache secret array in assembly code for aarch64 SVE #693

[RFC] cache secret array in assembly code for aarch64 SVE #693

Comments

hzhuang1 commented Mar 14, 2022 • edited Loading

easyaspi314 commented Mar 14, 2022 • edited Loading

hzhuang1 commented Mar 15, 2022

hzhuang1 commented Mar 14, 2022 •

edited

Loading

easyaspi314 commented Mar 14, 2022 •

edited

Loading