Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

membuffer: compare the keys of ART by chunk #1482

Merged
merged 5 commits into from
Nov 7, 2024

Conversation

you06
Copy link
Contributor

@you06 you06 commented Oct 28, 2024

ref pingcap/tidb#55287

In order to perform path compression in ART, we need to find the longest common prefix the keys in many places, when the common prefix is long, this can be slow.

This PR compares the keys by chunk. Because a chunk is a uint64 which can contains 8 bytes, this can improve the performance a lot when the common prefix is long.

  • Long Common Prefix
# before
BenchmarkMemBufferSetGetLongKey
BenchmarkMemBufferSetGetLongKey/ART
BenchmarkMemBufferSetGetLongKey/ART-32           1000000               874.3 ns/op          2149 B/op          0 allocs/op

# this PR
BenchmarkMemBufferSetGetLongKey
BenchmarkMemBufferSetGetLongKey/ART
BenchmarkMemBufferSetGetLongKey/ART-32           1000000               524.5 ns/op          2149 B/op          0 allocs/op
  • Common Cases
# before
BenchmarkPut
BenchmarkPut/ART
BenchmarkPut/ART-32              1000000                91.70 ns/op          335 B/op          0 allocs/op
BenchmarkPutRandom
BenchmarkPutRandom/ART
BenchmarkPutRandom/ART-32                1000000               182.1 ns/op           404 B/op          0 allocs/op
BenchmarkGet
BenchmarkGet/ART
BenchmarkGet/ART-32                      1000000                31.31 ns/op            0 B/op          0 allocs/op
BenchmarkGetRandom
BenchmarkGetRandom/ART
BenchmarkGetRandom/ART-32                1000000               142.7 ns/op    

# this PR
BenchmarkPut
BenchmarkPut/ART
BenchmarkPut/ART-32              1000000                89.18 ns/op          335 B/op          0 allocs/op
BenchmarkPutRandom
BenchmarkPutRandom/ART
BenchmarkPutRandom/ART-32                1000000               175.8 ns/op           404 B/op          0 allocs/op
BenchmarkGet
BenchmarkGet/ART
BenchmarkGet/ART-32                      1000000                31.74 ns/op            0 B/op          0 allocs/op
BenchmarkGetRandom
BenchmarkGetRandom/ART
BenchmarkGetRandom/ART-32                1000000               132.5 ns/op   

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 28, 2024
ekexium
ekexium previously approved these changes Oct 29, 2024
Copy link
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor grammar tweaks would make these comments better. May let LLM do this. I commented on the wrong PR 😆

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Oct 29, 2024
@ekexium ekexium dismissed their stale review October 29, 2024 04:36

Approval by mistake. I haven't reviewed this one

@ti-chi-bot ti-chi-bot bot removed the approved label Oct 29, 2024
Copy link
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to compare it with something like this?

    // Use unsafe for faster byte comparison
    p1 := unsafe.Pointer(&l1Key[depth])
    p2 := unsafe.Pointer(&l2Key[depth])
    
    // Compare 8 bytes at a time
    remaining := minLen - depth
    for remaining >= 8 {
        if *(*uint64)(p1) != *(*uint64)(p2) {
            // Find first different byte using trailing zeros
            xor := *(*uint64)(p1) ^ *(*uint64)(p2)
            return depth + uint32(bits.TrailingZeros64(xor) >> 3)
        }
        p1 = unsafe.Pointer(uintptr(p1) + 8)
        p2 = unsafe.Pointer(uintptr(p2) + 8)
        depth += 8
        remaining -= 8
}

internal/unionstore/art/art_node.go Outdated Show resolved Hide resolved
internal/unionstore/art/art_node.go Outdated Show resolved Hide resolved
@you06
Copy link
Contributor Author

you06 commented Oct 30, 2024

Would you like to compare it with something like this?

The unsafe implementation.

  • Long Common Prefix

That's about +15% speed.

BenchmarkMemBufferSetGetLongKey
BenchmarkMemBufferSetGetLongKey/ART
BenchmarkMemBufferSetGetLongKey/ART-32           1000000               458.1 ns/op          2149 B/op          0 allocs/op

The common cases aren't that stable on my PC, and even the master code runs much slower than when the result in the description now (there's not much difference anyway).

The chunk code runs faster primarily due to better CPU pipeline utilization and more accurate prediction through chunk comparison.

The unsafe implementation runs faster, the only concern is it seems to only work on little-endian machines, and I'm not sure if we need to worry about big-endian architectures...

@ekexium
Copy link
Contributor

ekexium commented Oct 30, 2024

The unsafe implementation runs faster, the only concern is it seems to only work on little-endian machines, and I'm not sure if we need to worry about big-endian architectures...

We can check internal/goarch, and fallback to other methods for big-endian architectures.

@you06
Copy link
Contributor Author

you06 commented Oct 30, 2024

We can check internal/goarch, and fallback to other methods for big-endian architectures.

That's an internal package, it'll get compile error like.

package github.com/tikv/client-go/v2/internal/unionstore
        imports github.com/tikv/client-go/v2/internal/unionstore/art
        internal/unionstore/art/art_node.go:20:2: use of internal package internal/goarch not allowed

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 30, 2024
return 0
}

p1 := unsafe.Pointer(&l1Key[depth])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On some architectures (e.g., older ARM architectures), unaligned memory access is significantly slower than aligned access.

Skip considering it is generally safe and keeps code simpler.

Copy link
Contributor

@ekexium ekexium Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is said that unaligned access may even crash. Shall we consider an alignment check like
isAligned := (uintptr(p1) & 7) == 0 && (uintptr(p2) & 7) == 0?

Can we add a GOARCH check and only do this optimization for selected architectures?

We can use runtime.GOARCH directly. @you06

@ti-chi-bot ti-chi-bot bot added lgtm approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 31, 2024
@cfzjywxk cfzjywxk requested a review from ekexium October 31, 2024 06:42
@ti-chi-bot ti-chi-bot bot removed the lgtm label Nov 1, 2024
Signed-off-by: you06 <you1474600@gmail.com>

compare byte slices in chunk

Signed-off-by: you06 <you1474600@gmail.com>

remove unrelated test

Signed-off-by: you06 <you1474600@gmail.com>

compare by unsafe casting

Signed-off-by: you06 <you1474600@gmail.com>

remove unused chunk cast func

Signed-off-by: you06 <you1474600@gmail.com>

only compare by chunk for amd64 and arm64

Signed-off-by: you06 <you1474600@gmail.com>
@you06 you06 force-pushed the art/chunk-compare branch from eecef6b to 30733e7 Compare November 1, 2024 07:14
Signed-off-by: you06 <you1474600@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 1, 2024
@@ -355,10 +349,49 @@ func (an *artNode) asNode256(a *artAllocator) *node256 {
// longestCommonPrefix returns the length of the longest common prefix of two keys.
// the LCP is calculated from the given depth, you need to guarantee l1Key[:depth] equals to l2Key[:depth] before calling this function.
func longestCommonPrefix(l1Key, l2Key artKey, depth uint32) uint32 {
switch runtime.GOARCH {
Copy link
Contributor

@ekexium ekexium Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can introduce a global boolean and do the comparison only once

Signed-off-by: you06 <you1474600@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 1, 2024
@you06
Copy link
Contributor Author

you06 commented Nov 1, 2024

/hold

The client-go's test seems ok, but the memory is violated when running with tidb.

fatal error: checkptr: pointer arithmetic result points to invalid allocation

goroutine 21334 gp=0xc002051a40 m=12 mp=0xc00088aa08 [running]:
runtime.throw({0x97a1a64?, 0x216ae1d?})
	GOROOT/src/runtime/panic.go:1067 +0x48 fp=0xc004e40608 sp=0xc004e405d8 pc=0x21dc848
runtime.checkptrArithmetic(0xc005f7bc0f?, {0xc004e406a8, 0x1, 0x219e3e9?})
	GOROOT/src/runtime/checkptr.go:69 +0x9c fp=0xc004e40638 sp=0xc004e40608 pc=0x216afbc
github.com/tikv/client-go/v2/internal/unionstore/art.longestCommonPrefixByChunk({0xc00603a9d8, 0x8, 0x18?}, {0xc004f97138, 0x9, 0x21e7bd9?}, 0x0)
	external/com_github_tikv_client_go_v2/internal/unionstore/art/art_node.go:388 +0xfc fp=0xc004e406c0 sp=0xc004e40638 pc=0x3db30dc
github.com/tikv/client-go/v2/internal/unionstore/art.longestCommonPrefix({0xc00603a9d8, 0x8, 0x8}, {0xc004f97138, 0x9, 0x14}, 0x0)
	external/com_github_tikv_client_go_v2/internal/unionstore/art/art_node.go:355 +0x69 fp=0xc004e40710 sp=0xc004e406c0 pc=0x3db2f69
github.com/tikv/client-go/v2/internal/unionstore/art.(*nodeBase).match(...)
	external/com_github_tikv_client_go_v2/internal/unionstore/art/art_node.go:314

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 1, 2024
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Nov 1, 2024

/hold

The client-go's test seems ok, but the memory is violated when running with tidb.

fatal error: checkptr: pointer arithmetic result points to invalid allocation

goroutine 21334 gp=0xc002051a40 m=12 mp=0xc00088aa08 [running]:
runtime.throw({0x97a1a64?, 0x216ae1d?})
	GOROOT/src/runtime/panic.go:1067 +0x48 fp=0xc004e40608 sp=0xc004e405d8 pc=0x21dc848
runtime.checkptrArithmetic(0xc005f7bc0f?, {0xc004e406a8, 0x1, 0x219e3e9?})
	GOROOT/src/runtime/checkptr.go:69 +0x9c fp=0xc004e40638 sp=0xc004e40608 pc=0x216afbc
github.com/tikv/client-go/v2/internal/unionstore/art.longestCommonPrefixByChunk({0xc00603a9d8, 0x8, 0x18?}, {0xc004f97138, 0x9, 0x21e7bd9?}, 0x0)
	external/com_github_tikv_client_go_v2/internal/unionstore/art/art_node.go:388 +0xfc fp=0xc004e406c0 sp=0xc004e40638 pc=0x3db30dc
github.com/tikv/client-go/v2/internal/unionstore/art.longestCommonPrefix({0xc00603a9d8, 0x8, 0x8}, {0xc004f97138, 0x9, 0x14}, 0x0)
	external/com_github_tikv_client_go_v2/internal/unionstore/art/art_node.go:355 +0x69 fp=0xc004e40710 sp=0xc004e406c0 pc=0x3db2f69
github.com/tikv/client-go/v2/internal/unionstore/art.(*nodeBase).match(...)
	external/com_github_tikv_client_go_v2/internal/unionstore/art/art_node.go:314

@you06 Is this a new bug introduced in #1477?

you06 added 2 commits November 5, 2024 13:02
Signed-off-by: you06 <you1474600@gmail.com>
Signed-off-by: you06 <you1474600@gmail.com>
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Nov 6, 2024
@you06
Copy link
Contributor Author

you06 commented Nov 7, 2024

@you06 Is this a new bug introduced in #1477?

#1477 introduced a bug in the baseIter.next.

The error arithmetic checkptr: pointer arithmetic result points to invalid allocation is caused by unsafe.Pointer(uintptr(p1) + 8), we should use unsafe.Add instead.

The CI in TiDB can pass after fixing both bugs.

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 7, 2024
Copy link

ti-chi-bot bot commented Nov 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, ekexium

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

ti-chi-bot bot commented Nov 7, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-10-29 04:29:28.293159262 +0000 UTC m=+324081.132314807: ☑️ agreed by ekexium.
  • 2024-10-31 06:25:53.999172374 +0000 UTC m=+503866.838327915: ☑️ agreed by cfzjywxk.
  • 2024-11-01 06:35:21.329078835 +0000 UTC m=+590834.168234380: ✖️🔁 reset by you06.
  • 2024-11-06 08:04:46.324380548 +0000 UTC m=+1028199.163536093: ☑️ agreed by ekexium.
  • 2024-11-07 08:41:01.626543941 +0000 UTC m=+1116774.465699487: ☑️ agreed by cfzjywxk.

@cfzjywxk cfzjywxk removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024
@ti-chi-bot ti-chi-bot bot merged commit c154447 into tikv:master Nov 7, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants