Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collations Module Integration #9018

Merged
merged 10 commits into from
Nov 3, 2021
Merged

Collations Module Integration #9018

merged 10 commits into from
Nov 3, 2021

Conversation

king-11
Copy link
Contributor

@king-11 king-11 commented Oct 19, 2021

Description

Integration of Collation Module into Vitess Engine

Adds in the collation module into the Vitess Code which can then be used to replace instances of the weight_string function used in the vitess codebase. The major change that this PR makes is with the NullSafeComparator

func NullsafeCompare(v1, v2 sqltypes.Value) (int, error) {

which now takes in additional collation info parameter which is used by the collations.LookupByID to find out the collation and then use it to collate the varchar strings based on their collation set.

Update Function Calls

So given the NullSafeComparator needs access to collation info we intend to provide the information available to it by updating the function calls and trying to fetch charset info from the result or Fields available at the least distant parent function calls and trickle it down the succeeding calls.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@king-11 king-11 changed the title Collation Support Integration Collations Module Integration Oct 19, 2021
go/vt/vtgate/engine/memory_sort.go Outdated Show resolved Hide resolved
go/vt/vtgate/engine/merge_sort.go Outdated Show resolved Hide resolved
go/vt/vtgate/vindexes/consistent_lookup.go Outdated Show resolved Hide resolved
go/vt/wrangler/vdiff.go Outdated Show resolved Hide resolved
@vmg
Copy link
Collaborator

vmg commented Oct 20, 2021

So I ran some integration tests for this yesterday, and we got some bad results. Here's a log:

 collations_test.go:175: exec: CREATE TABLE unicode (rowid INTEGER PRIMARY KEY, str_coll_utf8mb4_0900_ai_ci CHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci, str_coll_utf8mb4_unicode_ci CHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, str_coll_utf8mb4_unicode_520_ci CHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci);
    collations_test.go:175: exec: INSERT INTO unicode VALUES (0 , _utf8mb4 X'61626320c3a6c3b8c3a520e697a5e69cace8aa9e' COLLATE "utf8mb4_0900_ai_ci", _utf8mb4 X'61626320c3a6c3b8c3a520e697a5e69cace8aa9e' COLLATE "utf8mb4_unicode_ci", _utf8mb4 X'61626320c3a6c3b8c3a520e697a5e69cace8aa9e' COLLATE "utf8mb4_unicode_520_ci")
    collations_test.go:175: exec: SELECT str_coll_utf8mb4_0900_ai_ci, WEIGHT_STRING(str_coll_utf8mb4_0900_ai_ci), str_coll_utf8mb4_unicode_ci, WEIGHT_STRING(str_coll_utf8mb4_unicode_ci), str_coll_utf8mb4_unicode_520_ci, WEIGHT_STRING(str_coll_utf8mb4_unicode_520_ci) FROM unicode
    collations_test.go:269: field0(str_coll_utf8mb4_0900_ai_ci) charsetnr = 33
    collations_test.go:269: field1(WEIGHT_STRING(str_coll_utf8mb4_0900_ai_ci)) charsetnr = 63
    collations_test.go:269: field2(str_coll_utf8mb4_unicode_ci) charsetnr = 33
    collations_test.go:269: field3(WEIGHT_STRING(str_coll_utf8mb4_unicode_ci)) charsetnr = 63
    collations_test.go:269: field4(str_coll_utf8mb4_unicode_520_ci) charsetnr = 33
    collations_test.go:269: field5(WEIGHT_STRING(str_coll_utf8mb4_unicode_520_ci)) charsetnr = 63

As you can see, we're creating a simple table where each column has a different charset + collation and inserting the same string on it. When performing a SELECT on the different columns, we can see that the charsetnr field in the returned Fields (i.e. the field that this PR is using for detecting the collation for a column) is always set to the default collation for the connection, not to the actual collation from the original table.

This is bad! It looks like a bug in MySQL but this behavior is not going to change in the future, so we need to work around it.

@harshit-gangal had some ideas on using Vitess' table schemas to detect these collations instead of depending on the Field data. Harshit, could you please explain these to @king-11? How confident are you feeling about this approach?

@systay
Copy link
Collaborator

systay commented Oct 21, 2021

General comment - this PR all about collations and not so much about charsets. We should use the word collation and not charset for this. https://dev.mysql.com/doc/refman/8.0/en/charset-general.html

@systay
Copy link
Collaborator

systay commented Oct 21, 2021

Another general comment - since we need to do planner work to make sure that the engine primitives and evalengine get the right info, what about focusing this PR to make it work well with grouping (orderedAggregate)? We can work on the rest of the primitives once we are happy with grouping.

@vmg
Copy link
Collaborator

vmg commented Oct 21, 2021

@systay: Yes, but do note that the field in Request.Fields that contains the numeric identifier for the collation is actually called charsetnr, even though it's a collation identifier. MySQL is silly like that.

@vmg
Copy link
Collaborator

vmg commented Oct 21, 2021

what about focusing this PR to make it work well with grouping (orderedAggregate)? We can work on the rest of the primitives once we are happy with grouping.

This sounds like a great idea. Let's make grouping work perfectly and leave the other callsites for another time. This means that everywhere else we don't pass a collation ID to the comparison function. That should be an easy way to unblock @king-11.

@systay
Copy link
Collaborator

systay commented Oct 21, 2021

@systay: Yes, but do note that the field in Request.Fields that contains the numeric identifier for the collation is actually called charsetnr, even though it's a collation identifier. MySQL is silly like that.

My reading of https://dev.mysql.com/doc/internals/en/character-set.html#packet-Protocol::CharacterSet makes me think that it is the charset, and not the collation the charsetnr is representing. With charsets come a default collation, which is how we can get the collation from a charset. 🤷

You are right, it's all very confusing. Anyway - does it make sense to not perpetuate this confusion more than necessary in our code base? :)

@king-11 king-11 force-pushed the collate branch 2 times, most recently from 6cea81e to 720ad5e Compare October 24, 2021 10:17
go/vt/vtgate/evalengine/arithmetic.go Outdated Show resolved Hide resolved
go/vt/vtgate/evalengine/arithmetic.go Outdated Show resolved Hide resolved
go/vt/vtgate/evalengine/arithmetic.go Show resolved Hide resolved
@king-11 king-11 force-pushed the collate branch 2 times, most recently from 96dd315 to 42e5731 Compare October 28, 2021 12:51
@king-11
Copy link
Contributor Author

king-11 commented Nov 1, 2021

@systay @frouioui awaiting your review made the changes and added in tests

@systay systay added Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature) release notes labels Nov 2, 2021
Copy link
Member

@frouioui frouioui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing more than what @systay wrote to add. His comments need to be addressed, otherwise lgtm

Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
reduce map size and also check with a constant not literal
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Collate returns +ve and -ve value add switch case on it
to return -1, 1 or 0 only
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
Signed-off-by: Lakshya Singh <lakshay.singh1108@gmail.com>
@king-11
Copy link
Contributor Author

king-11 commented Nov 3, 2021

@systay made the changes as per your review.

Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@systay systay merged commit ab2f0b4 into vitessio:main Nov 3, 2021
@shlomi-noach
Copy link
Contributor

I'm not sure how, but this seems to have broken main. The following line:

collation := collations.FromID(collationID)

Does not compile:

# vitess.io/vitess/go/vt/vtgate/evalengine
go/vt/vtgate/evalengine/arithmetic.go:227:16: undefined: collations.FromID

right now unit tests and some vreplication tests are failing.

@systay
Copy link
Collaborator

systay commented Nov 3, 2021

#9131 contains the fix for main ☝️

@king-11 king-11 deleted the collate branch November 3, 2021 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants