Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranger: fix prefix index when charset is UTF-8 #7194

Merged
merged 9 commits into from
Jul 31, 2018

Conversation

birdstorm
Copy link
Contributor

What have you changed? (mandatory)

Fix #7115, this PR will fix prefix index when charset is UTF-8. Previously the index was cut by bytes rather than characters.

What is the type of the changes? (mandatory)

  • Bug fix (non-breaking change which fixes an issue)

How has this PR been tested? (mandatory)

Unit Test

Does this PR affect documentation (docs/docs-cn) update? (mandatory)

NO

Does this PR affect tidb-ansible update? (mandatory)

NO

Does this PR need to be added to the release notes? (mandatory)

release note:
fix prefix index, when the charset is utf8 or utf8mb4, truncate it from runes rather than bytes.

Refer to a related PR or issue link (optional)

Benchmark result if necessary (optional)

Add a few positive/negative examples (optional)

@shenli shenli added the type/bugfix This PR fixes a bug. label Jul 30, 2018
if length != types.UnspecifiedLength && length < len(v.GetBytes()) {
v.SetBytes(v.GetBytes()[:length])
// In case of UTF8, prefix should be cut by characters rather than bytes
if v.Kind() == types.KindString || v.Kind() == types.KindBytes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other types, should we consider length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt if it is possible to have prefix index on other types...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For string columns, indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:

  • Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY key parts.

  • Prefixes must be specified for BLOB and TEXT key parts. Additionally, BLOB and TEXT columns can be indexed only for InnoDB, MyISAM, and BLACKHOLE tables.

"github.com/pingcap/tidb/util/codec"
"unicode/utf8"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the upper group.

Copy link
Contributor

@winkyao winkyao Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use gofmt .

@@ -178,7 +178,7 @@ func (s *testPlanSuite) TestDAGPlanBuilderSimpleCase(c *C) {
// Test index filter condition push down.
{
sql: "select * from t use index(e_d_c_str_prefix) where t.c_str = 'abcdefghijk' and t.d_str = 'd' and t.e_str = 'e'",
best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\",\"e\" \"d\" \"[97 98 99 100 101 102 103 104 105 106]\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))",
best: "IndexLookUp(Index(t.e_d_c_str_prefix)[[\"e\" \"d\" \"abcdefghij\",\"e\" \"d\" \"abcdefghij\"]], Table(t)->Sel([eq(test.t.c_str, abcdefghijk)]))",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You see, now prefix index is set by string when charset is UTF-8 rather than bytes.

@shenli
Copy link
Member

shenli commented Jul 30, 2018

/run-all-tests

jackysp
jackysp previously approved these changes Jul 31, 2018
Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shenli
Copy link
Member

shenli commented Jul 31, 2018

@birdstorm Please fix the CI.

@shenli shenli added the sig/planner SIG: Planner label Jul 31, 2018
Copy link
Contributor

@winkyao winkyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address comments.

@zhexuany
Copy link
Contributor

/run-all-tests

@winoros
Copy link
Member

winoros commented Jul 31, 2018

/run-unit-test

winoros
winoros previously approved these changes Jul 31, 2018
Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

// In case of UTF8, prefix should be cut by characters rather than bytes
if v.Kind() == types.KindString || v.Kind() == types.KindBytes {
colCharset := tp.Charset
if colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

colValue := v.GetBytes()
isUTF8Charset := colCharset == charset.CharsetUTF8 || colCharset == charset.CharsetUTF8MB4
if isUTF8Charset && length != types.UnspecifiedLength && utf8.RuneCount(colValue) > length {
	...
} else if length != types.UnspecifiedLength && len(colValue) > length {
	v.SetBytes(colValue[:length])
}

Copy link
Contributor Author

@birdstorm birdstorm Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zz-jason Actually I just copied the same logic from

func (c *index) truncateIndexValuesIfNeeded(indexedValues []types.Datum) []types.Datum {

I left it the same because it seems hard to decide where to put the reusable code. Should both logic be changed at the moment or I could move on to another PR to solve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I think we can change both logic here in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. In addition, isUTF8Charset must be on the outer if-statement to maintain the correct logic.

zhexuany
zhexuany previously approved these changes Jul 31, 2018
Copy link
Contributor

@zhexuany zhexuany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shenli
Copy link
Member

shenli commented Jul 31, 2018

@zz-jason Should we cherry-pick this PR to the release-2.0?

@winoros winoros changed the title Fix prefix index when charset is UTF-8 ranger: fix prefix index when charset is UTF-8 Jul 31, 2018
@zz-jason
Copy link
Member

Yes, and I think this PR needs a release note.

@zz-jason zz-jason added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jul 31, 2018
@birdstorm
Copy link
Contributor Author

@zz-jason yes, I have it in issue description. BTW, please note that this PR might affect speed of prefix index because it has an extra Rune than before (and it might be pretty SLOW).

Copy link
Contributor

@zhexuany zhexuany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhexuany zhexuany merged commit 42bba99 into master Jul 31, 2018
@zhexuany zhexuany deleted the birdstorm/fix_utf8_prefix branch July 31, 2018 08:14
birdstorm added a commit to birdstorm/tidb that referenced this pull request Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/planner SIG: Planner type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prefix index breaks for ASCII 0xFF(ÿ)
7 participants