Vindex varchar support #1649

ashudeep-sharma-zz · 2016-04-24T20:39:14Z

This Pull Request contains the following change:

Adding Vindex support for varchar and varbinary Column Type.

This change is

sougou · 2016-04-25T00:13:28Z

go/vt/vtgate/vindexes/varchar_hash.go

+// GetBytes returns bytes for a given value
+func GetBytes(key interface{}) ([]byte, error) {
+ var buf bytes.Buffer
+ enc := gob.NewEncoder(&buf)


I don't think gob hashes the input. You can use this function instead: https://github.com/youtube/vitess/blob/master/go/vt/vtgate/vindexes/hash.go#L118. Should look something like this:

func binHash(key interface{}) ([]byte, error) { source, ok := key.([]byte) if !ok { return nil, fmt.Errorf("unexpected data type for binHash: %T", key) } dest := make([]byte, len(source)) block3DES.Encrypt(dest, source) return dest, nil }

…re of in GetBytes()

ashudeep-sharma-zz · 2016-04-25T10:16:27Z

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions.

go/vt/vtgate/vindexes/varchar_hash.go, line 42 [r3] (raw file):
Made the changes to convert the data into lowercase before hashing. I have added a test-case with UpperCase,LowerCase and mixedCase letters to validate it. Let me know if anything else needs to be added.

go/vt/vtgate/vindexes/varchar_hash.go, line 48 [r3] (raw file):
created the binHash Function which gets the bytes, convert it into lowercase and then apply block3DES.Encrypt().

Comments from Reviewable

sougou · 2016-04-25T16:36:51Z

go/vt/vtgate/vindexes/varchar_hash.go, line 54 [r5] (raw file):
I don't think this will work as expected, because gob encodes the type info into the bytes. For example, int(1) and uint(1) will encode differently even though they're the same value. So, it's better if we asserted the value into the specific types we want to allow (like I showed in the binHash example). For now, I think allowing only []byte is enough. If we want to allow more types, we can extend it.

Comments from Reviewable

sougou · 2016-04-25T16:53:17Z

Reviewed 2 of 4 files at r4.
Review status: 2 of 4 files reviewed at latest revision, 1 unresolved discussion.

go/vt/vtgate/vindexes/varchar_hash.go, line 54 [r5] (raw file):
Sorry, my previous comment got sent out prematurely.

So, I think I confused you with my original comment. Basically, the binHash implementation I gave should have just replaced GetBytes. Everything else was fine. The older getHash function for varchar that performed the lower-casing was also fine, except that it would work only if you lower-cased before hashing.

Comments from Reviewable

ashudeep-sharma-zz · 2016-04-25T22:03:07Z

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions.

go/vt/vtgate/vindexes/utf8cihash.go, line 54 [r5] (raw file):
I have made the changes as per your suggestions, the binhash function now accepts a byte array and all the additional checks for normalization ha been moved to external functions

go/vt/vtgate/vindexes/varchar_hash.go, line 42 [r3] (raw file):
Done.

Comments from Reviewable

sougou · 2016-04-25T22:32:55Z

LGTM
I'll look at how to make it utf8 collation compliant after the submit.

Reviewed 6 of 6 files at r6.
Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

sougou · 2016-04-26T22:17:53Z

I've been doing some reading and testing about unicode collations. The good news is that I've found the perfect Go library to perform the correct normalization. It produces a normalized 'key' that is the lexical equivalent of the string.
The bad news is that the 3DES hashing function doesn't work as expected. It basically works in 64-bit blocks. Each call to block3DES.Encrypt only encrypts the first 64 bits. It actually crashes for short strings like "T". So, I have the following options:

Produce a 64-bit keyspace id using the first 4 characters of the input; 4 bytes of input produces 8 bytes (64-bits) of normalized key.
Prefix the string with the hash of the first 4 characters.
Hash the entire input by dividing it into 64-bit chunks.

I'll provide an initial implementation using option 1. If that's insufficient, we can look at the other options.

enisoc · 2016-04-26T22:37:05Z

Option 1 seems likely to bite us down the road. It may work as a custom vindex plugin for a given user, but it shouldn't be shipped along with Vitess core in that case because it has misleading properties. I could imagine wanting to shard on a varchar that often starts with the same 4 characters, for example an app-specific entity ID string like "uid-1234".

For option 3, how would you combine the resulting encrypted chunks? Would you just concatenate and have variable-length keyspace IDs?

Can we just use an irreversible fixed-width hash instead, since normalization is not reversible anyway?

sougou · 2016-04-26T22:51:21Z

Hmm. Fixed prefix would cause problems even for option 3 because all keys will start with the hash of the first 4 characters.
We'll have to go for stronger encryption, which won't be cheap. Or the simpler solution will be to not hash at all, instead go for non-uniform sharding. I'm leaning towards the latter.
@ashudeep-sharma what do you think?

enisoc · 2016-04-26T22:59:58Z

What about option 4: checksum the whole normalized key with something like MD5 or SHA-N?

sougou · 2016-04-27T00:55:23Z

MD5 looks good. I also looked up CRC, which they say is more efficient, but less random.
If people can put up with non-uniform sharding, some benefit can be derived from the unicode ordering. I'm leaning towards MD5, while we wait for @ashudeep-sharma's response.

sougou · 2016-04-27T01:47:37Z

Since vindex behavior should never be changed (once people start using it), I'm going to rename these to more accurately represent what they do:
varbinaryHash -> binaryMD5, vindex name: "binary_md5".
UTF8ciHash -> UnicodeLooseMD5, vindex name: "unicode_loose_md5"
The Loose term was coined by Go to indicate the most permissive matching. The technically correct term would be L1 (http://www.unicode.org/reports/tr10/#Multi_Level_Comparison), but it may be harder to explain.
Open to opinions, suggestions.

ashudeep-sharma-zz · 2016-04-27T04:40:07Z

The first option would definitely be a problem for use-cases where-in teams use some prefix, in our case lets say a tracking-id always starts from "FMCP".I tried searching for some other options but i think MD5 looks more promising.

ashudeep-sharma-zz · 2016-04-27T05:13:39Z

In my earlier companies, we have used Jenkins Hash which ensures evenness in distribution, but came across this post, it seems like it is not that optimized in GO.
http://stackoverflow.com/questions/23436358/bob-jenkins-hash-getting-bad-performance

sougou · 2016-04-27T05:17:32Z

I'm working on the change. You can see progress here https://github.com/sougou/vitess/commits/v3.
I'll send the PR out once I resolve the []byte vs string issue @ashudeep-sharma encountered earlier today.

* Increase retries and timeout for expectation (vitessio#1628) This still flakes with the current setup, but the total wait time today is only a maximum of 50ms. With increasing the retries and the time between them, we now retry for up to 5 seconds. Anecdotally, Actions can see stalls up to like 2 seconds so we need a large retry time window here to improve test reliability. This also means we can remove a lot of other sleeps here since the expectation will retry anyway, improving the happy path test performance. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * boost: Implement logic to mimic global routing in Vitess for Boost (vitessio#1636) * boost: Implement logic to mimic global routing in Vitess for Boost In Vitess if there's no keyspace currently active, it's still possible to route a query to the right keyspace. We don't currently support this in Boost which means there's a mismatch between how it runs directly vs. when Boost is used. We need this for supporting planetscale/boost-internal#21 but also need to relax some constraints on Singularity as well as reverting the changes in planetscale/api-bb#6463 & planetscale/api-bb#6476. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Improve error message for no keyspace Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Handle routing cached queries with no keyspace When you install a query with no keyspace configured, we want to be able to match that whether you have a keyspace selected or not. This allows that to work by excluding the keyspace from the hash but matching it after the fact against the given keyspace, or allowing it if it's empty. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> --------- Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * boost: Retry leadership election if it fails (vitessio#1649) When we have an etcd blip right at the moment we also want to do the leadership election here, we don't correctly retry. The change here sets up a separate context we can cancel and then retry so we close the old watcher and spin up a new one in case this happens. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Collations & Evalengine Backport (vitessio#1646) * sqlparser: Update tooling Signed-off-by: Vicent Marti <vmg@strn.cat> * BugFix: Escaping Percentage and Underscore require special handling (vitessio#11823) * test: add failing end to end test for escaped % and _ Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: preserve escaping of \ and % in the parser Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: add end to end test and planner test to verify that % and _ handling is correct in Vitess on vtgate level Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: add test to evaluation engine to verify that backslash and underscore and handled correctly even when escaped Signed-off-by: Manan Gupta <manan@planetscale.com> * feat: remove unrequired changes Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: fix sorting in UCA900 collations (vitessio#12555) * collations: fix sorting in UCA900 collations When using the fast iterator to _compare_ two strings with an UCA collation, we need to keep in mind that the weights in the collation are in BIG ENDIAN (this is the output format for the weight strings, so we store the weights this way), so comparing them directly will not result in the proper collation order. They need to be byte-swapped before they can be compared with an arithmetic operation! Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: comment Signed-off-by: Vicent Marti <vmg@strn.cat> --------- Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: Refactorings & fixes (vitessio#12554) * evalengine: add a compiler struct Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: refactor arithmetic ops Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: rename type check APIs Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/json: extract APIs Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/bit: add comment for behavior Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/call: make interface an Expr Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/json: extract conversion API Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/compare: fix multi-comparison behavior Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/convert: extract decimal precision API Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/bit: extract error Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/bytes: extract toNumericHex Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine/integration: move all tests into their own package Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: expose the charset package Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: remove unused file Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: add license headers Signed-off-by: Vicent Marti <vmg@strn.cat> --------- Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: upgrade to MySQL 8.0.32 (vitessio#12557) * collations: update from MySQL 8.0.32 Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: use an array to store collations Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: remove Init from Collation interface Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: perform eager initialiation and simplify APIs Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: apply utf8mb3 <-> utf8 in both directions Signed-off-by: Vicent Marti <vmg@strn.cat> * collations: expose the Binary collation Signed-off-by: Vicent Marti <vmg@strn.cat> --------- Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: More cleanup (vitessio#12573) * evalengine: fix multi-comparisons Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: add a limit for REPEAT Signed-off-by: Vicent Marti <vmg@strn.cat> * evalengine: Additional fixes for edge cases This solves edge cases around decimal promotion which follows a conversion to int64 (not uint64) and then is casted to uint64 for all bitwise operations. Also adds a bunch of tests for the hex() function and implemented conversions for non string like types. It also found more bugs in MySQL CASE handling, yay. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> --------- Signed-off-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com> * boost: update collation APIs Signed-off-by: Vicent Marti <vmg@strn.cat> --------- Signed-off-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * boost: Improve upquery generation (vitessio#1650) This fixes upquery generation for a union with group bys where there's a literal projection by ensuring we match the generation of the query to what we do without a join. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * [boost planner] Union on top of union (vitessio#1645) * Add tests for broken unions Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * attempt to allow building of unions on top of unions Signed-off-by: Andres Taylor <andres@planetscale.com> * use the right columns even for UNION ALL Signed-off-by: Andres Taylor <andres@planetscale.com> * fix derived tables and projections after TopK Signed-off-by: Andres Taylor <andres@planetscale.com> * use group by with count(*) instead of distinct Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> * flownode: implement proper ordering trace for Readers Signed-off-by: Vicent Marti <vmg@strn.cat> * check one the right node Signed-off-by: Andres Taylor <andres@planetscale.com> * allow expressions from inside derived table to be handled correctly Signed-off-by: Andres Taylor <andres@planetscale.com> * boost: fix indeterministic union parents Signed-off-by: Vicent Marti <vmg@strn.cat> * Removed unused distinct flownode Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> --------- Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Andres Taylor <andres@planetscale.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Signed-off-by: Vicent Marti <vmg@strn.cat> Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Fix test expectation Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> --------- Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Vicent Marti <vmg@strn.cat> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: Andres Taylor <andres@planetscale.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Vicent Martí <42793+vmg@users.noreply.github.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: Andres Taylor <andres@planetscale.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Vicent Marti <vmg@strn.cat>

ashudeep-sharma-zz added 3 commits April 24, 2016 00:26

VarChar/VarBinary Column Support Changes in Vindex

347cf71

Adding tests for varchar and varbinary Vindexes

03c10ff

Removing Commented Code

1c773b1

googlebot added the cla: yes label Apr 24, 2016

sougou reviewed Apr 25, 2016
View reviewed changes

ashudeep-sharma-zz added 2 commits April 25, 2016 15:39

Incorporating Review Comments

d0265aa

Removed the bytes.ToLower in block3DES.Encrypt func, already taken ca…

33880c5

…re of in GetBytes()

Incorporating Review Comments

ac81c13

sougou merged commit 2ce36f3 into master Apr 25, 2016

ashudeep-sharma-zz deleted the vindex_varchar_support branch April 26, 2016 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vindex varchar support #1649

Vindex varchar support #1649

ashudeep-sharma-zz commented Apr 24, 2016

sougou Apr 25, 2016

ashudeep-sharma-zz commented Apr 25, 2016

sougou commented Apr 25, 2016

sougou commented Apr 25, 2016

ashudeep-sharma-zz commented Apr 25, 2016

sougou commented Apr 25, 2016

sougou commented Apr 26, 2016

enisoc commented Apr 26, 2016

sougou commented Apr 26, 2016

enisoc commented Apr 26, 2016

sougou commented Apr 27, 2016

sougou commented Apr 27, 2016

ashudeep-sharma-zz commented Apr 27, 2016

ashudeep-sharma-zz commented Apr 27, 2016

sougou commented Apr 27, 2016

Vindex varchar support #1649

Vindex varchar support #1649

Conversation

ashudeep-sharma-zz commented Apr 24, 2016

sougou Apr 25, 2016

Choose a reason for hiding this comment

ashudeep-sharma-zz commented Apr 25, 2016

sougou commented Apr 25, 2016

sougou commented Apr 25, 2016

ashudeep-sharma-zz commented Apr 25, 2016

sougou commented Apr 25, 2016

sougou commented Apr 26, 2016

enisoc commented Apr 26, 2016

sougou commented Apr 26, 2016

enisoc commented Apr 26, 2016

sougou commented Apr 27, 2016

sougou commented Apr 27, 2016

ashudeep-sharma-zz commented Apr 27, 2016

ashudeep-sharma-zz commented Apr 27, 2016

sougou commented Apr 27, 2016