Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table: check valid UTF8 value for UTF8 column #1818

Merged
merged 4 commits into from
Oct 14, 2016

Conversation

coocood
Copy link
Member

@coocood coocood commented Oct 11, 2016

We need to check if the input data is valid UTF8 string if the column charset is UTF8,
Otherwise, it may cause more serious error in the future, and hard to fix.

This prevents future issue like #1808

We need to check if the input data is valid UTF8 string if the column charset is UTF8,
Otherwise, it may cause more serious error in the future, and hard to fix.
@shenli
Copy link
Member

shenli commented Oct 13, 2016

Will this PR hurt performance?

@coocood
Copy link
Member Author

coocood commented Oct 13, 2016

@shenli Yes when we insert or update utf8 column, but that's something we have to pay.
I checked out the Go implements, which check the string one byte at a time. Rust implementation checks string one word(64bit) at a time, which should be more efficient, we can optimize it if that becomes a bottleneck.

utf8.Valid adds about 2 ns for a byte, in another word, it can process 500MB data in a second, which is a very small proportion in our insert operation.

@shenli
Copy link
Member

shenli commented Oct 13, 2016

LGTM

@@ -212,6 +213,9 @@ func (t *Table) UpdateRecord(ctx context.Context, h int64, oldData []types.Datum
}
currentData[i] = defaultVal
}
if col.Charset == "utf8" && !utf8.Valid(currentData[i].GetBytes()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check col.Tp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only String types use 'utf8' charset.

@coocood
Copy link
Member Author

coocood commented Oct 14, 2016

@zimulala PTAL

@zimulala
Copy link
Contributor

LGTM

@zimulala zimulala merged commit ff1c7b1 into master Oct 14, 2016
@zimulala zimulala deleted the coocood/check-utf8-validity branch October 14, 2016 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants