-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning: specify collation when parquet value to string datum #38391
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/run-integration-br-test |
/cc @lance6716 @D3Hunter |
@@ -458,7 +458,7 @@ func setDatumByString(d *types.Datum, v string, meta *parquet.SchemaElement) { | |||
ts = ts.UTC() | |||
v = ts.Format(utcTimeLayout) | |||
} | |||
d.SetString(v, "") | |||
d.SetString(v, "utf8mb4_bin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many places need to consider string encodings, one is string data in parquet file, the other one is string variables in the memory of lightning process which read by parquet reader. Since golang string is always assumed utf8-encoded I think this PR is OK. But I'm not sure if parquet file has another encoding for string data and go-parquet reader wrongly cast it to golang string without encode/decode.
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 62d9688
|
/merge |
In response to a cherrypick label: new pull request created: #38487. |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created: #38488. |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created: #38489. |
TiDB MergeCI notify🔴 Bad News! [1] CI still failing after this pr merged.
|
What problem does this PR solve?
Issue Number: close #38351
Problem Summary:
What is changed and how it works?
For parquet parser, when setting a value into the string datum, use the "utf8mb4_bin" collation instead of an empty collation. This will make the string conversion logic not report errors, thus improving the performance.
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.