Data skew in low-concurrency scenarios #44

robertsong9972 · 2022-04-15T03:04:36Z

Data skew in low-concurrency scenarios

According to the source code, it will set the step to 0 once the time increased, it will cause a problem when we use the id as the sharding key for sharding strategy.
The code affects

if now == n.time {
		n.step = (n.step + 1) & n.stepMask

		if n.step == 0 {
			for now <= n.time {
				now = time.Since(n.epoch).Nanoseconds() / 1000000
			}
		}
	} else {
		n.step = 0
	}

Consider the following circumstance:

Scenario

table:order_tab_00000000-order_tab_00001000;
An api for user to create new order: create_order;
use the above logic to generate id when QPS is less than 10;

Impact

After calculating with mod all of the result of id generated will be less than 10, which means most of the records will be stored into order_tab_00000000-order_tab_00000009, it caused data skew;

Improve

Make the step round by round, and do not set it to 0;

Code May be referenced

	n.mu.Lock()
	now := time.Since(n.epoch).Nanoseconds() / 1000000
	n.step = (n.step + 1) & n.stepMask
	if now == n.time && n.step == 0 {
		for now <= n.time {
			now = time.Since(n.epoch).Nanoseconds() / 1000000
		}
	}
	n.time = now
	r := ID((now)<<n.timeShift |
		(n.node << n.nodeShift) |
		(n.step),
	)
	n.mu.Unlock()
	return r

patchthecode · 2022-11-18T15:07:59Z

does this mean this lib is prone to errors?

bwmarrin · 2022-11-23T15:32:42Z

@robertsong9972

Thanks for reporting an issue :)

However, I'm struggling to follow what you're trying to explain and how the current method is causing a problem and why it should be changed :( I'm not sure if there's any more explanation you can provide or if there's someone else that might be familiar with the topic you're discussing and can help clarify the issue.

Are you saying the library can generate non-unique ID's with low concurrency? Or are you saying that the way the IDs would sort would be a problem? What does "Make the step round by round" mean?

"After calculating with mod all of the result of id generated will be less than 10, which means most of the records will be stored into order_tab_00000000-order_tab_00000009, it caused data skew;"

You're doing something to truncate the numbers or reduce them? None of the IDs would be less than 10 normally.

I'm also not really clear on the importance of the "order_tab_00000000-order_tab_00000009" tables.

I'm just all around confused, sorry. :(

bwmarrin · 2022-11-23T15:35:27Z

@patchthecode - I'm not sure :(. Though I've not heard any other reports of the library not generating properly unique IDs. If you run into any issues or if you can help explain the above concern I'd be happy to look into it.

bwmarrin · 2022-11-23T16:28:07Z

I think I'm putting together what's being asked here, maybe.

@robertsong9972

Are you talking about data sharding via MongoDB (or maybe some other DB)?

By "order_tab_00000000-order_tab_00001000" are you saying you have 1001 order tables?

Are you taking the modulus of the snowflake ID and 1001 (the number of tables you have) to try and get a random
number between 0 and 1000?

If you're dividing by 1001 I think it should work but if you happen to be using 1000 I could see how that would return 0 since the last four digits of a showflake using 12 stepbits would be 0000 and the entire number would divide into 1000 leaving no remainder. But if you divide by 1001 that shouldn't be an issue. Another option would be to change the step bits to a smaller value.

By "round by round" do you mean just allow the step count to continue to count up and only reset to zero once it overflows (hits the max number it could be) ? If we did that the library performance under heavy load would be impacted negatively and since there's other ways to solve the problem I'm assuming you're working on I'd rather not do that.

Finally, would just using a rand(1000) method to get a random number from 0 to 1000 to tell the DB what shard to shove the data into work?

If I'm interpreting your issue wrong - please let me know! Hopefully this helps some, I know it's a late attempt to help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data skew in low-concurrency scenarios #44

Data skew in low-concurrency scenarios #44

robertsong9972 commented Apr 15, 2022

patchthecode commented Nov 18, 2022

bwmarrin commented Nov 23, 2022

bwmarrin commented Nov 23, 2022

bwmarrin commented Nov 23, 2022

Data skew in low-concurrency scenarios #44

Data skew in low-concurrency scenarios #44

Comments

robertsong9972 commented Apr 15, 2022

Data skew in low-concurrency scenarios

Scenario

Impact

Improve

Code May be referenced

patchthecode commented Nov 18, 2022

bwmarrin commented Nov 23, 2022

bwmarrin commented Nov 23, 2022

bwmarrin commented Nov 23, 2022