Convert column to destination charset in DML applications #27

jbielick · 2021-07-12T13:07:42Z

Related issue: github#290

Description

Avoid DML apply errors by converting character data when charset changes for a column.

Background

When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.

For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage.

Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed.

Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated into convert-utf8mb4. They were helpful to do TDD.

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

shlomi-noach · 2021-07-12T13:10:14Z

Tests approved to run

shlomi-noach · 2021-07-12T14:00:24Z

 Error 1300: Invalid utf8mb4 character string: 'BD'; query=
			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
				`test`.`_gh_ost_test_gho`
					(`id`, `t`)
				values
					(?, convert(? using utf8mb4))

@jbielick utf8mb4 does not exist in 5.5. Please add a file named ignore_versions with the content (5.5), similarly to https://github.com/openark/gh-ost/blob/master/localtests/tz/ignore_versions

jbielick · 2021-07-12T14:57:24Z

@shlomi-noach I was wondering about that, but I found some interesting behavior and the docker image for 5.5.62 does have utf8mb4. Is the dbdeployer version different (compiled without?)?

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.5.62    |
+-----------+
1 row in set (0.00 sec)

mysql> show charset like '%utf8%';
+---------+---------------+--------------------+--------+
| Charset | Description   | Default collation  | Maxlen |
+---------+---------------+--------------------+--------+
| utf8    | UTF-8 Unicode | utf8_general_ci    |      3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci |      4 |
+---------+---------------+--------------------+--------+
2 rows in set (0.00 sec)

One thing I was playing around with was instead of convert({binary} using utf8mb4) is first converting to the original charset and then the destination convert(convert({binary} using oldcharset) using newcharset) so that mysql knows what the original charset is and can do replacements more accurately. Does it make sense to do this? Crude example below.

For example:

char(189) is ½. In latin1 this is stored as \xBD. In UTF8(mb4) this is \xC2BD.

Since we're sending the row + column data to replace as binary (is this correct?), there's no prior context that the charset was latin1 (in this case). So when it's writing the data to the utf8 column it just errors. I honestly don't know why this works correctly in MySQL >=5.6.

mysql> select hex(convert(char(189) using utf8mb4));
+---------------------------------------+
| hex(convert(char(189) using utf8mb4)) |
+---------------------------------------+
|                                       |
+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4));
+-------------------------------------------------------------+
| hex(convert(convert(char(189) using latin1) using utf8mb4)) |
+-------------------------------------------------------------+
| C2BD                                                        |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

The patch I'm working with locally that passes all tests on MySQL 5.5 is this:

diff --git a/go/logic/inspect.go b/go/logic/inspect.go
index 8b78c76..e53565c 100644
--- a/go/logic/inspect.go
+++ b/go/logic/inspect.go
@@ -192,7 +192,7 @@ func (this *Inspector) inspectOriginalAndGhostTables() (err error) {
 			this.migrationContext.MappedSharedColumns.SetEnumValues(column.Name, column.EnumValues)
 		}
 		if column.Name == mappedColumn.Name && column.Charset != mappedColumn.Charset {
-			this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name)
+			this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset)
 		}
 	}
 
diff --git a/go/sql/builder.go b/go/sql/builder.go
index c219f42..cd980b6 100644
--- a/go/sql/builder.go
+++ b/go/sql/builder.go
@@ -42,8 +42,8 @@ func buildColumnsPreparedValues(columns *ColumnList) []string {
 			token = fmt.Sprintf("ELT(?, %s)", column.EnumValues)
 		} else if column.Type == JSONColumnType {
 			token = "convert(? using utf8mb4)"
-		} else if column.charsetConversion {
-			token = fmt.Sprintf("convert(? using %s)", column.Charset)
+		} else if column.charsetConversion != nil {
+			token = fmt.Sprintf("convert(convert(? using %s) using %s)", column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)
 		} else {
 			token = "?"
 		}
@@ -116,8 +116,8 @@ func BuildSetPreparedClause(columns *ColumnList) (result string, err error) {
 			setToken = fmt.Sprintf("%s=ELT(?, %s)", EscapeName(column.Name), column.EnumValues)
 		} else if column.Type == JSONColumnType {
 			setToken = fmt.Sprintf("%s=convert(? using utf8mb4)", EscapeName(column.Name))
-		} else if column.charsetConversion {
-			setToken = fmt.Sprintf("%s=convert(? using %s)", EscapeName(column.Name), column.Charset)
+		} else if column.charsetConversion != nil {
+			setToken = fmt.Sprintf("%s=convert(convert(? using %s) using %s)", EscapeName(column.Name), column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)
 		} else {
 			setToken = fmt.Sprintf("%s=?", EscapeName(column.Name))
 		}
diff --git a/go/sql/types.go b/go/sql/types.go
index f7256cf..cc57e52 100644
--- a/go/sql/types.go
+++ b/go/sql/types.go
@@ -32,6 +32,11 @@ type TimezoneConversion struct {
 	ToTimezone string
 }
 
+type CharsetConversion struct {
+	ToCharset   string
+	FromCharset string
+}
+
 type Column struct {
 	Name                 string
 	IsUnsigned           bool
@@ -40,7 +45,7 @@ type Column struct {
 	EnumValues           string
 	timezoneConversion   *TimezoneConversion
 	enumToTextConversion bool
-	charsetConversion    bool
+	charsetConversion    *CharsetConversion
 	// add Octet length for binary type, fix bytes with suffix "00" get clipped in mysql binlog.
 	// https://github.com/github/gh-ost/issues/909
 	BinaryOctetLength uint
@@ -212,12 +217,12 @@ func (this *ColumnList) SetEnumValues(columnName string, enumValues string) {
 	this.GetColumn(columnName).EnumValues = enumValues
 }
 
-func (this *ColumnList) SetCharsetConversion(columnName string) {
-	this.GetColumn(columnName).charsetConversion = true
+func (this *ColumnList) SetCharsetConversion(columnName string, fromCharset string, toCharset string) {
+	this.GetColumn(columnName).charsetConversion = &CharsetConversion{FromCharset: fromCharset, ToCharset: toCharset}
 }
 
 func (this *ColumnList) IsCharsetConversion(columnName string) bool {
-	return this.GetColumn(columnName).charsetConversion
+	return this.GetColumn(columnName).charsetConversion != nil
 }
 
 func (this *ColumnList) String() string {

Are the changes above preferable? I can push to see what happens in CI first—can always remove that commit if not preferred.

shlomi-noach · 2021-07-13T06:10:40Z

I was wondering about that, but I found some interesting behavior and the docker image for 5.5.62 does have utf8mb4. Is the dbdeployer version different (compiled without?)?
I honestly don't know why this works correctly in MySQL >=5.6.

Not sure how 5.5.62 would have utf8mb4 as it was introduced in 5.6. I'd say this is a non-interesting problem to dwell on. We don't expect 5.5 to support utf8mb4 therefore we shouldn't spend time wondering about it.

jbielick · 2021-07-13T13:07:48Z

@shlomi-noach sounds good—I'll ignore that version. Are you okay if I keep the changes adding the double-convert? i.e. fmt.Sprintf("%s=convert(convert(? using %s) using %s)", EscapeName(column.Name), column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)

shlomi-noach · 2021-07-13T13:44:57Z

About the double-convert. You know, I just happened to be looking into the same a couple weeks back. I have to say I don't have clear conclusions. Things don't always behave consistently. Sometimes you have to convert(convert(? using binary) using utf8mb4), and sometimes without using binary and sometimes it's like you suggested. I think something is misleading here: the actual character used in the original column. I think, and I'm not sure, that in a latin1 column there can be more than one way to write, say, the character á. But honestly I did not find conclusive evidence.

I have a question that's bugging me: in what way is the test https://github.com/openark/gh-ost/tree/master/localtests/alter-charset-all-dml lacking? Why did it not catch the issue you're solving here?

shlomi-noach · 2021-07-13T13:46:19Z

Related work in Vitess: vitessio/vitess#8322

jbielick · 2021-07-13T14:16:24Z

Things don't always behave consistently. Sometimes you have to convert(convert(? using binary) using utf8mb4), and sometimes without using binary and sometimes it's like you suggested.

Agreed. Some would say the binary convert works for all cases, but it doesn't work for this case (unless I'm doing something wrong)! Converting char(189) to binary first and then utf8mb4 yields nothing—no bytes.

select convert(convert(0xBD using binary) using utf8mb4);
=> (EMPTY)
select convert(convert(0xBD using latin1) using utf8mb4);
=> ½

I have a question that's bugging me: in what way is the test https://github.com/openark/gh-ost/tree/master/localtests/alter-charset-all-dml lacking? Why did it not catch the issue you're solving here?

Great question. What I concluded was that all of the characters used in that test are either:

Already utf8mb4 when written to the latin1 column because the latin1 column charset + utf8mb4 connection settings. Sending their binary straight into the utf8mb4 column causes no issues because they're already multi-byte (where applicable).

AND/OR (and this is the more important aspect solved here)

All of the characters used in that test have direct utf8 replacements, no "conversion" is necessary. latin1 allows (and defines) characters in the \x80-\xFF range. If you send a one-byte character in that range to a utf8 column it's going to error. Those characters, mapped to the utf8 codepage, must be two-bytes (first half are prefixed with C2, latter half prefixed with C3 see the jump from 7F to C2A0).

Those 1-byte characters, which can be validly stored in latin1 columns, are what's missing from that test. Because the REPLACE statements generated from the binlog events send the text as binary, the latin1 context is gone and no convert( occurs, causing the error trying to insert that 1-byte character sequence into the utf8 column.

this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be inserted in the altered table when the column containing these characters is changed to utf8(mb4). since these characters cannot be inserted, gh-ost fails.

addresses github#290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. However, when applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement and due to the possibility of migrating text column data containing invalid characters in the destination charset, a conversion step is often necessary. For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage. Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed. I only point out the "Note:" above because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert The former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. convert text to original charset, then destination converting text first to the original charset and then to the destination charset produces the most consistent results, as inserting the binary into a utf8-charset column may encounter an error if there is no prior context of latin1 encoding. mysql> select hex(convert(char(189) using utf8mb4)); +---------------------------------------+ | hex(convert(char(189) using utf8mb4)) | +---------------------------------------+ | | +---------------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4)); +-------------------------------------------------------------+ | hex(convert(convert(char(189) using latin1) using utf8mb4)) | +-------------------------------------------------------------+ | C2BD | +-------------------------------------------------------------+ 1 row in set (0.00 sec) as seen in this failure on 5.5.62 Error 1300: Invalid utf8mb4 character string: 'BD'; query= replace /* gh-ost `test`.`_gh_ost_test_gho` */ into `test`.`_gh_ost_test_gho` (`id`, `t`) values (?, convert(? using utf8mb4))

jbielick · 2021-07-13T15:39:11Z

ignored_versions added with (5.5)

shlomi-noach · 2021-07-14T06:50:13Z

Why do I need to keep clicking "Approve to run"? The author of this PR should be clear to go indefinitely once I "approved to run" just once. /rant.

shlomi-noach · 2021-07-14T06:54:16Z

The changes look good to me. Now, how this works is: I can merge this downstream, but I prefer to have this PR run first on GitHub's test flow in production, seeing that there's a fair amount of risk in this PR (namely what if some text now gets transformed incorrectly).

My suggestion is (and sorry for the bureaucracy) : push the branch again to github#1003, and I'll comment there to kick the discussion with the GitHub maintainers, so they can then run this through their test cycle. Makes sense?

jbielick · 2021-07-14T13:18:18Z

@shlomi-noach can do. Will push shortly and tag you.

yoavnativ · 2021-11-07T23:59:07Z

Hi. Can someone please share a build that solves that char set issue? We need to convert a table from Latin1 to UTF8.

jbielick mentioned this pull request Jul 12, 2021

Convert column to destination charset in DML applications github/gh-ost#1003

Closed

2 tasks

jbielick mentioned this pull request Jul 12, 2021

gh-ost aborts on trying to UPDATE, errors on invalid character github/gh-ost#290

Open

jbielick force-pushed the latin1-utf8mb4-oa branch 2 times, most recently from 0b16a1e to d53534f Compare July 13, 2021 15:36

jbielick added 2 commits July 13, 2021 11:38

jbielick force-pushed the latin1-utf8mb4-oa branch from d53534f to 799041e Compare July 13, 2021 15:38

jbielick closed this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert column to destination charset in DML applications #27

Convert column to destination charset in DML applications #27

jbielick commented Jul 12, 2021

shlomi-noach commented Jul 12, 2021

shlomi-noach commented Jul 12, 2021

jbielick commented Jul 12, 2021

shlomi-noach commented Jul 13, 2021

jbielick commented Jul 13, 2021

shlomi-noach commented Jul 13, 2021

shlomi-noach commented Jul 13, 2021

jbielick commented Jul 13, 2021 •

edited

Loading

jbielick commented Jul 13, 2021

shlomi-noach commented Jul 14, 2021

shlomi-noach commented Jul 14, 2021

jbielick commented Jul 14, 2021

yoavnativ commented Nov 7, 2021

Convert column to destination charset in DML applications #27

Convert column to destination charset in DML applications #27

Conversation

jbielick commented Jul 12, 2021

Description

Background

shlomi-noach commented Jul 12, 2021

shlomi-noach commented Jul 12, 2021

jbielick commented Jul 12, 2021

shlomi-noach commented Jul 13, 2021

jbielick commented Jul 13, 2021

shlomi-noach commented Jul 13, 2021

shlomi-noach commented Jul 13, 2021

jbielick commented Jul 13, 2021 • edited Loading

jbielick commented Jul 13, 2021

shlomi-noach commented Jul 14, 2021

shlomi-noach commented Jul 14, 2021

jbielick commented Jul 14, 2021

yoavnativ commented Nov 7, 2021

jbielick commented Jul 13, 2021 •

edited

Loading