ranger: fix prefix index when charset is UTF-8 #7194

birdstorm · 2018-07-30T14:45:05Z

What have you changed? (mandatory)

Fix #7115, this PR will fix prefix index when charset is UTF-8. Previously the index was cut by bytes rather than characters.

What is the type of the changes? (mandatory)

Bug fix (non-breaking change which fixes an issue)

How has this PR been tested? (mandatory)

Unit Test

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

NO

Does this PR affect tidb-ansible update? (mandatory)

NO

Does this PR need to be added to the release notes? (mandatory)

release note:
fix prefix index, when the charset is utf8 or utf8mb4, truncate it from runes rather than bytes.

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

shenli · 2018-07-30T16:37:16Z

util/ranger/ranger.go

-	if length != types.UnspecifiedLength && length < len(v.GetBytes()) {
-		v.SetBytes(v.GetBytes()[:length])
+	// In case of UTF8, prefix should be cut by characters rather than bytes
+	if v.Kind() == types.KindString || v.Kind() == types.KindBytes {


For other types, should we consider length?

I doubt if it is possible to have prefix index on other types...

For string columns, indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:

Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY key parts.

Prefixes must be specified for BLOB and TEXT key parts. Additionally, BLOB and TEXT columns can be indexed only for InnoDB, MyISAM, and BLACKHOLE tables.

shenli · 2018-07-30T16:37:31Z

util/ranger/ranger.go

 	"github.com/pingcap/tidb/util/codec"
+	"unicode/utf8"


Move this to the upper group.

Use gofmt .

shenli · 2018-07-30T16:37:53Z

plan/physical_plan_test.go

@@ -178,7 +178,7 @@ func (s *testPlanSuite) TestDAGPlanBuilderSimpleCase(c *C) {
 		// Test index filter condition push down.
 		{
 			sql:  "select * from t use index(e_d_c_str_prefix) where t.c_str = 'abcdefghijk' and t.d_str = 'd' and t.e_str = 'e'",
-			best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\",\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))",
+			best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"abcdefghij\",\"e\" \"d\" \"abcdefghij\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))",


Why is this changed?

You see, now prefix index is set by string when charset is UTF-8 rather than bytes.

shenli · 2018-07-30T16:38:07Z

/run-all-tests

jackysp

LGTM

shenli · 2018-07-31T02:28:49Z

@birdstorm Please fix the CI.

winkyao

Please address comments.

zhexuany · 2018-07-31T04:34:54Z

/run-all-tests

winoros · 2018-07-31T05:46:31Z

/run-unit-test

winoros

lgtm

zz-jason · 2018-07-31T06:34:00Z

util/ranger/ranger.go

+	// In case of UTF8, prefix should be cut by characters rather than bytes
+	if v.Kind() == types.KindString || v.Kind() == types.KindBytes {
+		colCharset := tp.Charset
+		if colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4 {


how about:

colValue := v.GetBytes() isUTF8Charset := colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4 if isUTF8Charset && length != types.UnspecifiedLength && utf8.RuneCount(colValue) > length { ... } else if length != types.UnspecifiedLength && len(colValue) > length { v.SetBytes(colValue[:length]) }

@zz-jason Actually I just copied the same logic from

tidb/table/tables/index.go

Line 133 in e0034f9

func (c *index) truncateIndexValuesIfNeeded(indexedValues []types.Datum) []types.Datum {

I left it the same because it seems hard to decide where to put the reusable code. Should both logic be changed at the moment or I could move on to another PR to solve this?

Uh, I think we can change both logic here in this PR.

Done. In addition, isUTF8Charset must be on the outer if-statement to maintain the correct logic.

zhexuany

LGTM

zz-jason

LGTM

shenli · 2018-07-31T07:20:57Z

@zz-jason Should we cherry-pick this PR to the release-2.0?

zz-jason · 2018-07-31T07:27:01Z

Yes, and I think this PR needs a release note.

birdstorm · 2018-07-31T07:31:41Z

@zz-jason yes, I have it in issue description. BTW, please note that this PR might affect speed of prefix index because it has an extra Rune than before (and it might be pretty SLOW).

zhexuany

LGTM

birdstorm added 3 commits July 30, 2018 21:39

fix prefix index when charset is UTF-8

979f987

add prefix index tests and fix previous incorrect tests.

e2320b8

format

2f2b51b

birdstorm requested review from winkyao and winoros July 30, 2018 14:45

birdstorm mentioned this pull request Jul 30, 2018

Fix utf8 prefix index pingcap/tispark#400

Merged

shenli added the type/bugfix This PR fixes a bug. label Jul 30, 2018

shenli reviewed Jul 30, 2018

View reviewed changes

jackysp previously approved these changes Jul 31, 2018

View reviewed changes

shenli added the sig/planner SIG: Planner label Jul 31, 2018

winkyao reviewed Jul 31, 2018

View reviewed changes

address comments

d81c10e

birdstorm dismissed jackysp’s stale review via d81c10e July 31, 2018 04:16

Merge branch 'master' into birdstorm/fix_utf8_prefix

91effa8

winoros added the status/all tests passed label Jul 31, 2018

winoros previously approved these changes Jul 31, 2018

View reviewed changes

add unit-test.

e0034f9

winoros dismissed their stale review via e0034f9 July 31, 2018 06:22

zz-jason reviewed Jul 31, 2018

View reviewed changes

zhexuany previously approved these changes Jul 31, 2018

View reviewed changes

address comments

afa3245

birdstorm dismissed zhexuany’s stale review via afa3245 July 31, 2018 07:13

zz-jason approved these changes Jul 31, 2018

View reviewed changes

Merge branch 'master' into birdstorm/fix_utf8_prefix

7224e7c

winoros changed the title ~~Fix prefix index when charset is UTF-8~~ ranger: fix prefix index when charset is UTF-8 Jul 31, 2018

zz-jason added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jul 31, 2018

zhexuany approved these changes Jul 31, 2018

View reviewed changes

Merge branch 'master' into birdstorm/fix_utf8_prefix

fd2d6e6

zhexuany merged commit 42bba99 into master Jul 31, 2018

zhexuany deleted the birdstorm/fix_utf8_prefix branch July 31, 2018 08:14

birdstorm added a commit to birdstorm/tidb that referenced this pull request Aug 1, 2018

ranger: fix prefix index when charset is UTF-8 (pingcap#7194)

d69275e

birdstorm mentioned this pull request Aug 1, 2018

ranger: fix prefix index when charset is UTF-8 (#7194) #7231

Merged

coocood pushed a commit that referenced this pull request Aug 2, 2018

ranger: fix prefix index when charset is UTF-8 (#7194) (#7231)

7ccecdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ranger: fix prefix index when charset is UTF-8 #7194

ranger: fix prefix index when charset is UTF-8 #7194

birdstorm commented Jul 30, 2018

shenli Jul 30, 2018

birdstorm Jul 31, 2018

birdstorm Jul 31, 2018

shenli Jul 30, 2018

winkyao Jul 31, 2018 •

edited

Loading

shenli Jul 30, 2018

birdstorm Jul 31, 2018

shenli commented Jul 30, 2018

jackysp left a comment

shenli commented Jul 31, 2018

winkyao left a comment

zhexuany commented Jul 31, 2018

winoros commented Jul 31, 2018

winoros left a comment

zz-jason Jul 31, 2018

birdstorm Jul 31, 2018 •

edited

Loading

zz-jason Jul 31, 2018

birdstorm Jul 31, 2018

zhexuany left a comment

zz-jason left a comment

shenli commented Jul 31, 2018

zz-jason commented Jul 31, 2018

birdstorm commented Jul 31, 2018

zhexuany left a comment

ranger: fix prefix index when charset is UTF-8 #7194

ranger: fix prefix index when charset is UTF-8 #7194

Conversation

birdstorm commented Jul 30, 2018

What have you changed? (mandatory)

What is the type of the changes? (mandatory)

How has this PR been tested? (mandatory)

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

Does this PR affect tidb-ansible update? (mandatory)

Does this PR need to be added to the release notes? (mandatory)

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winkyao Jul 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shenli commented Jul 30, 2018

jackysp left a comment

Choose a reason for hiding this comment

shenli commented Jul 31, 2018

winkyao left a comment

Choose a reason for hiding this comment

zhexuany commented Jul 31, 2018

winoros commented Jul 31, 2018

winoros left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

birdstorm Jul 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhexuany left a comment

Choose a reason for hiding this comment

zz-jason left a comment

Choose a reason for hiding this comment

shenli commented Jul 31, 2018

zz-jason commented Jul 31, 2018

birdstorm commented Jul 31, 2018

zhexuany left a comment

Choose a reason for hiding this comment

winkyao Jul 31, 2018 •

edited

Loading

birdstorm Jul 31, 2018 •

edited

Loading