Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vindex varchar support #1649

Merged
merged 6 commits into from
Apr 25, 2016
Merged

Vindex varchar support #1649

merged 6 commits into from
Apr 25, 2016

Conversation

ashudeep-sharma-zz
Copy link
Contributor

This Pull Request contains the following change:

  • Adding Vindex support for varchar and varbinary Column Type.

This change is Reviewable

// GetBytes returns bytes for a given value
func GetBytes(key interface{}) ([]byte, error) {
var buf bytes.Buffer
enc := gob.NewEncoder(&buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think gob hashes the input. You can use this function instead: https://github.com/youtube/vitess/blob/master/go/vt/vtgate/vindexes/hash.go#L118. Should look something like this:

func binHash(key interface{}) ([]byte, error) {
  source, ok := key.([]byte)
  if !ok {
    return nil, fmt.Errorf("unexpected data type for binHash: %T", key)
  }
  dest := make([]byte, len(source))
  block3DES.Encrypt(dest, source)
  return dest, nil
}

@ashudeep-sharma-zz
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions.


go/vt/vtgate/vindexes/varchar_hash.go, line 42 [r3] (raw file):
Made the changes to convert the data into lowercase before hashing. I have added a test-case with UpperCase,LowerCase and mixedCase letters to validate it. Let me know if anything else needs to be added.


go/vt/vtgate/vindexes/varchar_hash.go, line 48 [r3] (raw file):
created the binHash Function which gets the bytes, convert it into lowercase and then apply block3DES.Encrypt().


Comments from Reviewable

@sougou
Copy link
Contributor

sougou commented Apr 25, 2016

go/vt/vtgate/vindexes/varchar_hash.go, line 54 [r5] (raw file):
I don't think this will work as expected, because gob encodes the type info into the bytes. For example, int(1) and uint(1) will encode differently even though they're the same value. So, it's better if we asserted the value into the specific types we want to allow (like I showed in the binHash example). For now, I think allowing only []byte is enough. If we want to allow more types, we can extend it.


Comments from Reviewable

@sougou
Copy link
Contributor

sougou commented Apr 25, 2016

Reviewed 2 of 4 files at r4.
Review status: 2 of 4 files reviewed at latest revision, 1 unresolved discussion.


go/vt/vtgate/vindexes/varchar_hash.go, line 54 [r5] (raw file):
Sorry, my previous comment got sent out prematurely.

So, I think I confused you with my original comment. Basically, the binHash implementation I gave should have just replaced GetBytes. Everything else was fine. The older getHash function for varchar that performed the lower-casing was also fine, except that it would work only if you lower-cased before hashing.


Comments from Reviewable

@ashudeep-sharma-zz
Copy link
Contributor Author

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions.


go/vt/vtgate/vindexes/utf8cihash.go, line 54 [r5] (raw file):
I have made the changes as per your suggestions, the binhash function now accepts a byte array and all the additional checks for normalization ha been moved to external functions


go/vt/vtgate/vindexes/varchar_hash.go, line 42 [r3] (raw file):
Done.


Comments from Reviewable

@sougou
Copy link
Contributor

sougou commented Apr 25, 2016

LGTM
I'll look at how to make it utf8 collation compliant after the submit.


Reviewed 6 of 6 files at r6.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

Approved with PullApprove

@sougou sougou merged commit 2ce36f3 into master Apr 25, 2016
@ashudeep-sharma-zz ashudeep-sharma-zz deleted the vindex_varchar_support branch April 26, 2016 04:28
@sougou
Copy link
Contributor

sougou commented Apr 26, 2016

I've been doing some reading and testing about unicode collations. The good news is that I've found the perfect Go library to perform the correct normalization. It produces a normalized 'key' that is the lexical equivalent of the string.
The bad news is that the 3DES hashing function doesn't work as expected. It basically works in 64-bit blocks. Each call to block3DES.Encrypt only encrypts the first 64 bits. It actually crashes for short strings like "T". So, I have the following options:

  1. Produce a 64-bit keyspace id using the first 4 characters of the input; 4 bytes of input produces 8 bytes (64-bits) of normalized key.
  2. Prefix the string with the hash of the first 4 characters.
  3. Hash the entire input by dividing it into 64-bit chunks.

I'll provide an initial implementation using option 1. If that's insufficient, we can look at the other options.

@enisoc
Copy link
Member

enisoc commented Apr 26, 2016

Option 1 seems likely to bite us down the road. It may work as a custom vindex plugin for a given user, but it shouldn't be shipped along with Vitess core in that case because it has misleading properties. I could imagine wanting to shard on a varchar that often starts with the same 4 characters, for example an app-specific entity ID string like "uid-1234".

For option 3, how would you combine the resulting encrypted chunks? Would you just concatenate and have variable-length keyspace IDs?

Can we just use an irreversible fixed-width hash instead, since normalization is not reversible anyway?

@sougou
Copy link
Contributor

sougou commented Apr 26, 2016

Hmm. Fixed prefix would cause problems even for option 3 because all keys will start with the hash of the first 4 characters.
We'll have to go for stronger encryption, which won't be cheap. Or the simpler solution will be to not hash at all, instead go for non-uniform sharding. I'm leaning towards the latter.
@ashudeep-sharma what do you think?

@enisoc
Copy link
Member

enisoc commented Apr 26, 2016

What about option 4: checksum the whole normalized key with something like MD5 or SHA-N?

@sougou
Copy link
Contributor

sougou commented Apr 27, 2016

MD5 looks good. I also looked up CRC, which they say is more efficient, but less random.
If people can put up with non-uniform sharding, some benefit can be derived from the unicode ordering. I'm leaning towards MD5, while we wait for @ashudeep-sharma's response.

@sougou
Copy link
Contributor

sougou commented Apr 27, 2016

Since vindex behavior should never be changed (once people start using it), I'm going to rename these to more accurately represent what they do:
varbinaryHash -> binaryMD5, vindex name: "binary_md5".
UTF8ciHash -> UnicodeLooseMD5, vindex name: "unicode_loose_md5"
The Loose term was coined by Go to indicate the most permissive matching. The technically correct term would be L1 (http://www.unicode.org/reports/tr10/#Multi_Level_Comparison), but it may be harder to explain.
Open to opinions, suggestions.

@ashudeep-sharma-zz
Copy link
Contributor Author

The first option would definitely be a problem for use-cases where-in teams use some prefix, in our case lets say a tracking-id always starts from "FMCP".I tried searching for some other options but i think MD5 looks more promising.

@ashudeep-sharma-zz
Copy link
Contributor Author

In my earlier companies, we have used Jenkins Hash which ensures evenness in distribution, but came across this post, it seems like it is not that optimized in GO.
http://stackoverflow.com/questions/23436358/bob-jenkins-hash-getting-bad-performance

@sougou
Copy link
Contributor

sougou commented Apr 27, 2016

I'm working on the change. You can see progress here https://github.com/sougou/vitess/commits/v3.
I'll send the PR out once I resolve the []byte vs string issue @ashudeep-sharma encountered earlier today.

frouioui added a commit to planetscale/vitess that referenced this pull request Nov 21, 2023
* Increase retries and timeout for expectation (vitessio#1628)

This still flakes with the current setup, but the total wait time today
is only a maximum of 50ms. With increasing the retries and the time
between them, we now retry for up to 5 seconds.

Anecdotally, Actions can see stalls up to like 2 seconds so we need a
large retry time window here to improve test reliability.

This also means we can remove a lot of other sleeps here since the
expectation will retry anyway, improving the happy path test
performance.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* boost: Implement logic to mimic global routing in Vitess for Boost (vitessio#1636)

* boost: Implement logic to mimic global routing in Vitess for Boost

In Vitess if there's no keyspace currently active, it's still possible
to route a query to the right keyspace. We don't currently support this
in Boost which means there's a mismatch between how it runs directly vs.
when Boost is used.

We need this for supporting
planetscale/boost-internal#21 but also need to
relax some constraints on Singularity as well as reverting the changes
in planetscale/api-bb#6463 &
planetscale/api-bb#6476.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Improve error message for no keyspace

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Handle routing cached queries with no keyspace

When you install a query with no keyspace configured, we want to be able
to match that whether you have a keyspace selected or not. This allows
that to work by excluding the keyspace from the hash but matching it
after the fact against the given keyspace, or allowing it if it's empty.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* boost: Retry leadership election if it fails (vitessio#1649)

When we have an etcd blip right at the moment we also want to do the
leadership election here, we don't correctly retry.

The change here sets up a separate context we can cancel and then retry
so we close the old watcher and spin up a new one in case this happens.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Collations & Evalengine Backport (vitessio#1646)

* sqlparser: Update tooling

Signed-off-by: Vicent Marti <vmg@strn.cat>

* BugFix: Escaping Percentage and Underscore require special handling (vitessio#11823)

* test: add failing end to end test for escaped % and _

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: preserve escaping of \ and % in the parser

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add end to end test and planner test to verify that % and _ handling is correct in Vitess on vtgate level

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: add test to evaluation engine to verify that backslash and underscore and handled correctly even when escaped

Signed-off-by: Manan Gupta <manan@planetscale.com>

* feat: remove unrequired changes

Signed-off-by: Manan Gupta <manan@planetscale.com>

Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: fix sorting in UCA900 collations (vitessio#12555)

* collations: fix sorting in UCA900 collations

When using the fast iterator to _compare_ two strings with an UCA
collation, we need to keep in mind that the weights in the collation are
in BIG ENDIAN (this is the output format for the weight strings, so we
store the weights this way), so comparing them directly will not result
in the proper collation order. They need to be byte-swapped before they
can be compared with an arithmetic operation!

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: comment

Signed-off-by: Vicent Marti <vmg@strn.cat>

---------

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: Refactorings & fixes (vitessio#12554)

* evalengine: add a compiler struct

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: refactor arithmetic ops

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: rename type check APIs

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/json: extract APIs

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/bit: add comment for behavior

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/call: make interface an Expr

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/json: extract conversion API

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/compare: fix multi-comparison behavior

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/convert: extract decimal precision API

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/bit: extract error

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/bytes: extract toNumericHex

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine/integration: move all tests into their own package

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: expose the charset package

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: remove unused file

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: add license headers

Signed-off-by: Vicent Marti <vmg@strn.cat>

---------

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: upgrade to MySQL 8.0.32 (vitessio#12557)

* collations: update from MySQL 8.0.32

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: use an array to store collations

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: remove Init from Collation interface

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: perform eager initialiation and simplify APIs

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: apply utf8mb3 <-> utf8 in both directions

Signed-off-by: Vicent Marti <vmg@strn.cat>

* collations: expose the Binary collation

Signed-off-by: Vicent Marti <vmg@strn.cat>

---------

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: More cleanup (vitessio#12573)

* evalengine: fix multi-comparisons

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: add a limit for REPEAT

Signed-off-by: Vicent Marti <vmg@strn.cat>

* evalengine: Additional fixes for edge cases

This solves edge cases around decimal promotion which follows a
conversion to int64 (not uint64) and then is casted to uint64 for all
bitwise operations.

Also adds a bunch of tests for the hex() function and implemented
conversions for non string like types.

It also found more bugs in MySQL CASE handling, yay.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com>

* boost: update collation APIs

Signed-off-by: Vicent Marti <vmg@strn.cat>

---------

Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com>
Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* boost: Improve upquery generation (vitessio#1650)

This fixes upquery generation for a union with group bys
where there's a literal projection by ensuring we match the generation
of the query to what we do without a join.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* [boost planner] Union on top of union (vitessio#1645)

* Add tests for broken unions

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* attempt to allow building of unions on top of unions

Signed-off-by: Andres Taylor <andres@planetscale.com>

* use the right columns even for UNION ALL

Signed-off-by: Andres Taylor <andres@planetscale.com>

* fix derived tables and projections after TopK

Signed-off-by: Andres Taylor <andres@planetscale.com>

* use group by with count(*) instead of distinct

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>

* flownode: implement proper ordering trace for Readers

Signed-off-by: Vicent Marti <vmg@strn.cat>

* check one the right node

Signed-off-by: Andres Taylor <andres@planetscale.com>

* allow expressions from inside derived table to be handled correctly

Signed-off-by: Andres Taylor <andres@planetscale.com>

* boost: fix indeterministic union parents

Signed-off-by: Vicent Marti <vmg@strn.cat>

* Removed unused distinct flownode

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Signed-off-by: Andres Taylor <andres@planetscale.com>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com>
Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr>
Co-authored-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Fix test expectation

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Signed-off-by: Vicent Marti <vmg@strn.cat>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Andres Taylor <andres@planetscale.com>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Co-authored-by: Vicent Martí <42793+vmg@users.noreply.github.com>
Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com>
Co-authored-by: Andres Taylor <andres@planetscale.com>
Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr>
Co-authored-by: Vicent Marti <vmg@strn.cat>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants