-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert column to destination charset in DML applications #27
Conversation
Tests approved to run |
@jbielick |
@shlomi-noach I was wondering about that, but I found some interesting behavior and the docker image for
One thing I was playing around with was instead of For example:
Since we're sending the row + column data to replace as binary (is this correct?), there's no prior context that the charset was
The patch I'm working with locally that passes all tests on MySQL 5.5 is this: diff --git a/go/logic/inspect.go b/go/logic/inspect.go
index 8b78c76..e53565c 100644
--- a/go/logic/inspect.go
+++ b/go/logic/inspect.go
@@ -192,7 +192,7 @@ func (this *Inspector) inspectOriginalAndGhostTables() (err error) {
this.migrationContext.MappedSharedColumns.SetEnumValues(column.Name, column.EnumValues)
}
if column.Name == mappedColumn.Name && column.Charset != mappedColumn.Charset {
- this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name)
+ this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset)
}
}
diff --git a/go/sql/builder.go b/go/sql/builder.go
index c219f42..cd980b6 100644
--- a/go/sql/builder.go
+++ b/go/sql/builder.go
@@ -42,8 +42,8 @@ func buildColumnsPreparedValues(columns *ColumnList) []string {
token = fmt.Sprintf("ELT(?, %s)", column.EnumValues)
} else if column.Type == JSONColumnType {
token = "convert(? using utf8mb4)"
- } else if column.charsetConversion {
- token = fmt.Sprintf("convert(? using %s)", column.Charset)
+ } else if column.charsetConversion != nil {
+ token = fmt.Sprintf("convert(convert(? using %s) using %s)", column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)
} else {
token = "?"
}
@@ -116,8 +116,8 @@ func BuildSetPreparedClause(columns *ColumnList) (result string, err error) {
setToken = fmt.Sprintf("%s=ELT(?, %s)", EscapeName(column.Name), column.EnumValues)
} else if column.Type == JSONColumnType {
setToken = fmt.Sprintf("%s=convert(? using utf8mb4)", EscapeName(column.Name))
- } else if column.charsetConversion {
- setToken = fmt.Sprintf("%s=convert(? using %s)", EscapeName(column.Name), column.Charset)
+ } else if column.charsetConversion != nil {
+ setToken = fmt.Sprintf("%s=convert(convert(? using %s) using %s)", EscapeName(column.Name), column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)
} else {
setToken = fmt.Sprintf("%s=?", EscapeName(column.Name))
}
diff --git a/go/sql/types.go b/go/sql/types.go
index f7256cf..cc57e52 100644
--- a/go/sql/types.go
+++ b/go/sql/types.go
@@ -32,6 +32,11 @@ type TimezoneConversion struct {
ToTimezone string
}
+type CharsetConversion struct {
+ ToCharset string
+ FromCharset string
+}
+
type Column struct {
Name string
IsUnsigned bool
@@ -40,7 +45,7 @@ type Column struct {
EnumValues string
timezoneConversion *TimezoneConversion
enumToTextConversion bool
- charsetConversion bool
+ charsetConversion *CharsetConversion
// add Octet length for binary type, fix bytes with suffix "00" get clipped in mysql binlog.
// https://github.com/github/gh-ost/issues/909
BinaryOctetLength uint
@@ -212,12 +217,12 @@ func (this *ColumnList) SetEnumValues(columnName string, enumValues string) {
this.GetColumn(columnName).EnumValues = enumValues
}
-func (this *ColumnList) SetCharsetConversion(columnName string) {
- this.GetColumn(columnName).charsetConversion = true
+func (this *ColumnList) SetCharsetConversion(columnName string, fromCharset string, toCharset string) {
+ this.GetColumn(columnName).charsetConversion = &CharsetConversion{FromCharset: fromCharset, ToCharset: toCharset}
}
func (this *ColumnList) IsCharsetConversion(columnName string) bool {
- return this.GetColumn(columnName).charsetConversion
+ return this.GetColumn(columnName).charsetConversion != nil
}
func (this *ColumnList) String() string { Are the changes above preferable? I can push to see what happens in CI first—can always remove that commit if not preferred. |
Not sure how |
@shlomi-noach sounds good—I'll ignore that version. Are you okay if I keep the changes adding the double-convert? i.e. |
About the double-convert. You know, I just happened to be looking into the same a couple weeks back. I have to say I don't have clear conclusions. Things don't always behave consistently. Sometimes you have to I have a question that's bugging me: in what way is the test https://github.com/openark/gh-ost/tree/master/localtests/alter-charset-all-dml lacking? Why did it not catch the issue you're solving here? |
Related work in Vitess: vitessio/vitess#8322 |
Agreed. Some would say the binary convert works for all cases, but it doesn't work for this case (unless I'm doing something wrong)! Converting
Great question. What I concluded was that all of the characters used in that test are either:
AND/OR (and this is the more important aspect solved here)
Those 1-byte characters, which can be validly stored in latin1 columns, are what's missing from that test. Because the |
0b16a1e
to
d53534f
Compare
this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be inserted in the altered table when the column containing these characters is changed to utf8(mb4). since these characters cannot be inserted, gh-ost fails.
addresses github#290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. However, when applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement and due to the possibility of migrating text column data containing invalid characters in the destination charset, a conversion step is often necessary. For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage. Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed. I only point out the "Note:" above because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert The former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. convert text to original charset, then destination converting text first to the original charset and then to the destination charset produces the most consistent results, as inserting the binary into a utf8-charset column may encounter an error if there is no prior context of latin1 encoding. mysql> select hex(convert(char(189) using utf8mb4)); +---------------------------------------+ | hex(convert(char(189) using utf8mb4)) | +---------------------------------------+ | | +---------------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4)); +-------------------------------------------------------------+ | hex(convert(convert(char(189) using latin1) using utf8mb4)) | +-------------------------------------------------------------+ | C2BD | +-------------------------------------------------------------+ 1 row in set (0.00 sec) as seen in this failure on 5.5.62 Error 1300: Invalid utf8mb4 character string: 'BD'; query= replace /* gh-ost `test`.`_gh_ost_test_gho` */ into `test`.`_gh_ost_test_gho` (`id`, `t`) values (?, convert(? using utf8mb4))
d53534f
to
799041e
Compare
ignored_versions added with |
Why do I need to keep clicking "Approve to run"? The author of this PR should be clear to go indefinitely once I "approved to run" just once. /rant. |
The changes look good to me. Now, how this works is: I can merge this downstream, but I prefer to have this PR run first on GitHub's test flow in production, seeing that there's a fair amount of risk in this PR (namely what if some text now gets transformed incorrectly). My suggestion is (and sorry for the bureaucracy) : push the branch again to github#1003, and I'll comment there to kick the discussion with the GitHub maintainers, so they can then run this through their test cycle. Makes sense? |
@shlomi-noach can do. Will push shortly and tag you. |
Hi. Can someone please share a build that solves that char set issue? We need to convert a table from Latin1 to UTF8. |
Related issue: github#290
Description
Avoid DML apply errors by
convert
ing character data whencharset
changes for a column.Background
When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.
For example, when migrating a table/column from
latin1
toutf8mb4
, thelatin1
column may contain characters that are invalid single-byteutf8
characters. Characters in the\x80-\xFF
range are most common. When written toutf8mb4
column without conversion, they fail as they do not exist in theutf8
codepage.Converting these texts/characters to the destination charset using
convert(? using {charset})
will convert appropriately and the update/replace will succeed.Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this:
latin1text-to-utf8mb4
andlatin1text-to-ut8mb4-insert
—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated intoconvert-utf8mb4
. They were helpful to do TDD.script/cibuild
returns with no formatting errors, build errors or unit test errors.