-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranger: fix prefix index when charset is UTF-8 #7194
Conversation
if length != types.UnspecifiedLength && length < len(v.GetBytes()) { | ||
v.SetBytes(v.GetBytes()[:length]) | ||
// In case of UTF8, prefix should be cut by characters rather than bytes | ||
if v.Kind() == types.KindString || v.Kind() == types.KindBytes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other types, should we consider length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt if it is possible to have prefix index on other types...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For string columns, indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:
-
Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY key parts.
-
Prefixes must be specified for BLOB and TEXT key parts. Additionally, BLOB and TEXT columns can be indexed only for InnoDB, MyISAM, and BLACKHOLE tables.
util/ranger/ranger.go
Outdated
"github.com/pingcap/tidb/util/codec" | ||
"unicode/utf8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to the upper group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use gofmt
.
@@ -178,7 +178,7 @@ func (s *testPlanSuite) TestDAGPlanBuilderSimpleCase(c *C) { | |||
// Test index filter condition push down. | |||
{ | |||
sql: "select * from t use index(e_d_c_str_prefix) where t.c_str = 'abcdefghijk' and t.d_str = 'd' and t.e_str = 'e'", | |||
best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\",\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))", | |||
best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"abcdefghij\",\"e\" \"d\" \"abcdefghij\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You see, now prefix index is set by string when charset is UTF-8 rather than bytes.
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@birdstorm Please fix the CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address comments.
/run-all-tests |
/run-unit-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
util/ranger/ranger.go
Outdated
// In case of UTF8, prefix should be cut by characters rather than bytes | ||
if v.Kind() == types.KindString || v.Kind() == types.KindBytes { | ||
colCharset := tp.Charset | ||
if colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about:
colValue := v.GetBytes()
isUTF8Charset := colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4
if isUTF8Charset && length != types.UnspecifiedLength && utf8.RuneCount(colValue) > length {
...
} else if length != types.UnspecifiedLength && len(colValue) > length {
v.SetBytes(colValue[:length])
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zz-jason Actually I just copied the same logic from
Line 133 in e0034f9
func (c *index) truncateIndexValuesIfNeeded(indexedValues []types.Datum) []types.Datum { |
I left it the same because it seems hard to decide where to put the reusable code. Should both logic be changed at the moment or I could move on to another PR to solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, I think we can change both logic here in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. In addition, isUTF8Charset must be on the outer if-statement to maintain the correct logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zz-jason Should we cherry-pick this PR to the release-2.0? |
Yes, and I think this PR needs a release note. |
@zz-jason yes, I have it in issue description. BTW, please note that this PR might affect speed of prefix index because it has an extra Rune than before (and it might be pretty SLOW). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What have you changed? (mandatory)
Fix #7115, this PR will fix prefix index when charset is UTF-8. Previously the index was cut by bytes rather than characters.
What is the type of the changes? (mandatory)
How has this PR been tested? (mandatory)
Unit Test
Does this PR affect documentation (docs/docs-cn) update? (mandatory)
NO
Does this PR affect tidb-ansible update? (mandatory)
NO
Does this PR need to be added to the release notes? (mandatory)
Refer to a related PR or issue link (optional)
Benchmark result if necessary (optional)
Add a few positive/negative examples (optional)