Simplify IDGenerator and reduce collission rate #223

dkropachev · 2024-08-01T04:42:58Z

Since offset is always truncated to s.numBuckets, atomic.Add could be used instead of atomic.Load + atomic.Swap

sylwiaszunejko · 2024-08-01T06:48:28Z

internal/streams/streams.go

-	}
-	offset = (offset + 1) % s.numBuckets
+	// Reduce collision by offsetting the starting point
+	offset := atomic.AddUint32(&s.offset, s.numBuckets/10+1)


why s.numBuckets/10?

Devisor should be big enough to provide some cycles before reaching numbucket, and small enough to have significant gap between invocation to ensure that collision chance is actually reduced.
And resulted number should devide numbuckets with some remainder, to ensure it does not land on the same indexes all the time.

Let me put it into comment and I think we need to have more logic on how to pick correct offest

Please take a look

Lorak-mmk

The change with atomic.AddUint32 seems fine and will be an improvement.

I have some questions about offset change:

Why do this? Have we identified it as a problem somewhere? Given that each subsequent request will be given next bucket, the problem you described could only happen when all the buckets are nearly full - and in such conditions I doubt that stream assingment will be the bottleneck.
If we do have some benchmark / test case that showcases the problem, did you verify that your patch fixes the problem using such test case? I don't see any such information in the PR description.
I'm asking those questions because this change makes the code more complicated - so we should have a good reason for it.
I think the implementation isn't correct for all bucket counts. By correct I mean that all buckets will be used. I checked that with the following Python program (btw such verification should be a test case in this PR):

Verificator

def findOffsetStep(buckets):
    match buckets:
        case 1:
            return 1
        case 2:
            return 1
        case 3:
            return 2
    
    if buckets < 8:
        offsetStep = buckets // 2
        if offsetStep == 0:
            return 1
        if buckets%2 != 0:
            return offsetStep
        return offsetStep + 1
        
    for offsetStep in [13, 11, 9, 7, 5, 3]:
        if buckets // offsetStep < 4:
            # Ensure that it provides at least 4 cycles before reaching `buckets`
            continue
        if buckets % offsetStep != 0:
            return offsetStep
        offsetStep += 1
        if buckets%offsetStep != 0:
            return offsetStep
    if buckets % 2 != 0:
        return 2
    return 1


def verify_for_buckets(buckets):
    hits = {}
    offset = findOffsetStep(buckets)
    index = 0
    while True:
        hits.setdefault(index, 0)
        hits[index] += 1
        index = (index + offset) % buckets
        if index == 0:
            break
    if len(hits) != buckets:
        print(f'Failed for {buckets}: len {len(hits)}')

if __name__ == '__main__':
    for i in range(1, 512):
        print(f'Test {i}')
        verify_for_buckets(i)

Lorak-mmk · 2024-08-05T19:42:58Z

internal/streams/streams.go

+// findOffsetStep returns the offset step to use for the given number of buckets.
+// `offset` provides a way to reduce collision by offsetting the starting point
+// `offsetStep` should be small enough to provide some cycles before reaching `buckets`,
+// and big enough to have significant gap between invocations to ensure that collision chance is actually reduced.
+// and it should not divide `buckets` without remainder, to ensure it does not land on the same indexes all the time.
+func findOffsetStep(buckets int) int {
+	switch buckets {
+	case 1:
+		return 1
+	case 2:
+		return 1
+	case 3:
+		return 2
+	}
+	if buckets < 8 {
+		offsetStep := buckets / 2
+		if offsetStep == 0 {
+			return 1
+		}
+		if buckets%2 != 0 {
+			return offsetStep
+		}
+		return offsetStep + 1
+	}
+	for _, offsetStep := range []int{13, 11, 9, 7, 5, 3} {
+		if buckets/offsetStep < 4 {
+			// Ensure that it provides at least 4 cycles before reaching `buckets`
+			continue
+		}
+		if buckets%offsetStep != 0 {
+			return offsetStep
+		}
+
+		offsetStep++
+		if buckets%offsetStep != 0 {
+			return offsetStep
+		}
+	}
+	if buckets%2 != 0 {
+		return 2
+	}
+	return 1
+}


This code has a lot of non-obvious conditionals and magic numbers. There should by comments explaining not only what the code does, but also how and why.

Agree, it suppose to reduce collissions in the case when buckets are highly populated, but due to it's complexiity I start thinking that it does not worth it, let's just remove it and go back to +1

Since offset is always truncated to s.numBuckets, atomic.Add could be used instead of atomic.Load + atomic.Swap Also offsetting bucket by 1 does not give lot's of space for collission avoidance. If first call hits bucket that is fully in use, second call going to share same bucket with first call.

dkropachev · 2024-08-08T11:00:39Z

The change with atomic.AddUint32 seems fine and will be an improvement.

I have some questions about offset change:

Why do this? Have we identified it as a problem somewhere? Given that each subsequent request will be given next bucket, the problem you described could only happen when all the buckets are nearly full - and in such conditions I doubt that stream assingment will be the bottleneck.

If we do have some benchmark / test case that showcases the problem, did you verify that your patch fixes the problem using such test case? I don't see any such information in the PR description.

I'm asking those questions because this change makes the code more complicated - so we should have a good reason for it.

I think the implementation isn't correct for all bucket counts. By correct I mean that all buckets will be used. I checked that with the following Python program (btw such verification should be a test case in this PR):

Verificator

Thanks, let's just remove it.

internal/streams/streams.go

dkropachev requested a review from sylwiaszunejko August 1, 2024 04:43

sylwiaszunejko reviewed Aug 1, 2024

View reviewed changes

sylwiaszunejko assigned dkropachev Aug 1, 2024

dkropachev force-pushed the dk/fix-idgenerator-getstream branch from de853b0 to e85296e Compare August 1, 2024 13:35

dkropachev requested a review from sylwiaszunejko August 1, 2024 13:35

Lorak-mmk self-requested a review August 1, 2024 14:56

Lorak-mmk requested changes Aug 5, 2024

View reviewed changes

dkropachev force-pushed the dk/fix-idgenerator-getstream branch from e85296e to a0de29d Compare August 8, 2024 10:54

sylwiaszunejko reviewed Aug 8, 2024

View reviewed changes

internal/streams/streams.go Show resolved Hide resolved

dkropachev requested review from Lorak-mmk and sylwiaszunejko August 8, 2024 11:34

Lorak-mmk approved these changes Aug 8, 2024

View reviewed changes

sylwiaszunejko approved these changes Aug 8, 2024

View reviewed changes

dkropachev merged commit 77db780 into scylladb:master Aug 8, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify IDGenerator and reduce collission rate #223

Simplify IDGenerator and reduce collission rate #223

dkropachev commented Aug 1, 2024 •

edited

Loading

sylwiaszunejko Aug 1, 2024

dkropachev Aug 1, 2024 •

edited

Loading

dkropachev Aug 1, 2024

Lorak-mmk left a comment

Lorak-mmk Aug 5, 2024

dkropachev Aug 8, 2024

dkropachev commented Aug 8, 2024

Simplify IDGenerator and reduce collission rate #223

Simplify IDGenerator and reduce collission rate #223

Conversation

dkropachev commented Aug 1, 2024 • edited Loading

sylwiaszunejko Aug 1, 2024

Choose a reason for hiding this comment

dkropachev Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

dkropachev Aug 1, 2024

Choose a reason for hiding this comment

Lorak-mmk left a comment

Choose a reason for hiding this comment

Lorak-mmk Aug 5, 2024

Choose a reason for hiding this comment

dkropachev Aug 8, 2024

Choose a reason for hiding this comment

dkropachev commented Aug 8, 2024

dkropachev commented Aug 1, 2024 •

edited

Loading

dkropachev Aug 1, 2024 •

edited

Loading