Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for vitess' hashing algorithm #20971

Closed
bezmax opened this issue Nov 11, 2020 · 9 comments · Fixed by #23493
Closed

Support for vitess' hashing algorithm #20971

bezmax opened this issue Nov 11, 2020 · 9 comments · Fixed by #23493
Labels
feature/accepted This feature request is accepted by product managers type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@bezmax
Copy link
Contributor

bezmax commented Nov 11, 2020

Feature Request

Is your feature request related to a problem? Please describe:
As a part of the migration from Vitess to TiDB we need to have a safe way to replicate the changes across two databases and safely go back in case of a disaster. For this process to work and be scalable, it is imperative to be able to efficiently calculate which shard on Vitess' side the data belongs into - so that we can synchronize the data and/or move it back.

Vitess shards data explicitly - provided a column it's value is transformed (using different hashing algorithms), and the hashed value is used to determine which shard to put data into. One of the algorithms for hashing, that Vitess names simply "hash", is uses DES encryption in ECB mode (no iv) with a null-key.

Describe the feature you'd like:
A function in TiDB that is able to replicate described above vitess hashing algorithm: vitess_hash(int|number|string): string.

Describe alternatives you've considered:

  • Currently we are reading every row into memory to determine if it belongs to a shard we are interested in. This is inefficient and not scalable.
  • We have thought about implementing a mysql-compatible des_encrypt function instead. However, mysql does not support ECB mode, therefore the syntax and/or behaviour of this des_encrypt would be different from what is in MySQL specification. Given that DES in general is deprecated - there's no reason to overcomplicate this, and we would rather implement a vitess-specific function instead.

Teachability, Documentation, Adoption, Migration Strategy:
Any company that decides to do a live or semi-live migration from Vitess to TiDB (or vice-versa) would find this function extremely useful.

@bezmax
Copy link
Contributor Author

bezmax commented Nov 11, 2020

I have prepared a set of POC/Draft PRs implementing this feature. Please take a look.

In order PRs need to be merged:

  1. Adding vitess_hash function code to tipb tipb#198
  2. Adding vitess_hash function to ast parser#1089
  3. expression: Implementation of Vitess hashing algorithm. #23493

@bezmax bezmax changed the title Support for vitess' hashing algorithm Draft: Support for vitess' hashing algorithm Nov 11, 2020
@bezmax bezmax changed the title Draft: Support for vitess' hashing algorithm Support for vitess' hashing algorithm Nov 11, 2020
@zz-jason
Copy link
Member

there are lots of vindexes defined in Vitess, see Predefined Vindexes. Should we support all of them?

@bezmax
Copy link
Contributor Author

bezmax commented Nov 11, 2020

@zz-jason The most important ones are "Functional Unique" vindex types, as those can be used as primary vindexes. From what I see, most of them are already supported in one way or another in TiDB:

  • binary, numeric, null - noop
  • binary_md5 - md5 is supported by TiDB
  • numeric_static_map - can be implemented using control flow operators
  • region_experimental, region_json - this is a very niche type of vindexes used in geo-partitioning. Argument could be made for implementing it, but I can also argue that hash vindex is magnitudes more widespread.
  • reverse_bits - TiDB supports bitwise operators
  • hash - this is what this feature request is about
  • unicode_loose_md5 - not exactly sure, wouldn't be surprised if it can be implemented with some collation + md5, or string-transformation + md5.
  • xxhash, unicode_loose_xxhash - might be worth adding support for xxhash into TiDB by itself, not in the scope of this feature request.

@zz-jason
Copy link
Member

We have thought about implementing a mysql-compatible des_encrypt function instead. However, mysql does not support ECB mode, therefore the syntax and/or behaviour of this des_encrypt would be different from what is in MySQL specification. Given that DES in general is deprecated - there's no reason to overcomplicate this, and we would rather implement a vitess-specific function instead.

seems the following is also a workaround instead of introducing a vitess specific function:

  1. introduce a new block_encryption_mode like DES_ECB
  2. support ECB mode for des_encrypt()

@tirsen
Copy link
Contributor

tirsen commented Nov 11, 2020

there are lots of vindexes defined in Vitess, see Predefined Vindexes. Should we support all of them?

The hash vindex is generally the one used to assign shards to rows. Other vindexes are for global indexes ("lookup vindexes") and geo partitioning and other things. We don't need to support those.

@tirsen
Copy link
Contributor

tirsen commented Nov 11, 2020

We have thought about implementing a mysql-compatible des_encrypt function instead. However, mysql does not support ECB mode, therefore the syntax and/or behaviour of this des_encrypt would be different from what is in MySQL specification. Given that DES in general is deprecated - there's no reason to overcomplicate this, and we would rather implement a vitess-specific function instead.

seems the following is also a workaround instead of introducing a vitess specific function:

  1. introduce a new block_encryption_mode like DES_ECB
  2. support ECB mode for des_encrypt()

I would be very wary of supporting DES. It is officially retired https://www.cryptomathic.com/news-events/blog/3des-is-officially-being-retired and is being removed from MySQL.

@zz-jason
Copy link
Member

Since there isn’t a perfect solution to this case, supporting a vitess_hash() builtin function LGTM.

@zz-jason zz-jason added the feature/accepted This feature request is accepted by product managers label Nov 12, 2020
@bezmax
Copy link
Contributor Author

bezmax commented Nov 16, 2020

@zz-jason @kolbe What are the next steps to get my PRs into the master branch? Should I assign them to someone specific? Or is it just a matter of time? Thanks!

@SunRunAway
Copy link
Contributor

We'll take a look shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/accepted This feature request is accepted by product managers type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
4 participants