Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert column to destination charset in DML applications #27

Closed
wants to merge 2 commits into from

Conversation

jbielick
Copy link

Related issue: github#290

Description

Avoid DML apply errors by converting character data when charset changes for a column.

Background

When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.

For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage.

Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed.

Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated into convert-utf8mb4. They were helpful to do TDD.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@shlomi-noach
Copy link

Tests approved to run

@shlomi-noach
Copy link

 Error 1300: Invalid utf8mb4 character string: 'BD'; query=
			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
				`test`.`_gh_ost_test_gho`
					(`id`, `t`)
				values
					(?, convert(? using utf8mb4))

@jbielick utf8mb4 does not exist in 5.5. Please add a file named ignore_versions with the content (5.5), similarly to https://github.com/openark/gh-ost/blob/master/localtests/tz/ignore_versions

@jbielick
Copy link
Author

@shlomi-noach I was wondering about that, but I found some interesting behavior and the docker image for 5.5.62 does have utf8mb4. Is the dbdeployer version different (compiled without?)?

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.5.62    |
+-----------+
1 row in set (0.00 sec)

mysql> show charset like '%utf8%';
+---------+---------------+--------------------+--------+
| Charset | Description   | Default collation  | Maxlen |
+---------+---------------+--------------------+--------+
| utf8    | UTF-8 Unicode | utf8_general_ci    |      3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci |      4 |
+---------+---------------+--------------------+--------+
2 rows in set (0.00 sec)

One thing I was playing around with was instead of convert({binary} using utf8mb4) is first converting to the original charset and then the destination convert(convert({binary} using oldcharset) using newcharset) so that mysql knows what the original charset is and can do replacements more accurately. Does it make sense to do this? Crude example below.

For example:

char(189) is ½. In latin1 this is stored as \xBD. In UTF8(mb4) this is \xC2BD.

Since we're sending the row + column data to replace as binary (is this correct?), there's no prior context that the charset was latin1 (in this case). So when it's writing the data to the utf8 column it just errors. I honestly don't know why this works correctly in MySQL >=5.6.

mysql> select hex(convert(char(189) using utf8mb4));
+---------------------------------------+
| hex(convert(char(189) using utf8mb4)) |
+---------------------------------------+
|                                       |
+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4));
+-------------------------------------------------------------+
| hex(convert(convert(char(189) using latin1) using utf8mb4)) |
+-------------------------------------------------------------+
| C2BD                                                        |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

The patch I'm working with locally that passes all tests on MySQL 5.5 is this:

diff --git a/go/logic/inspect.go b/go/logic/inspect.go
index 8b78c76..e53565c 100644
--- a/go/logic/inspect.go
+++ b/go/logic/inspect.go
@@ -192,7 +192,7 @@ func (this *Inspector) inspectOriginalAndGhostTables() (err error) {
 			this.migrationContext.MappedSharedColumns.SetEnumValues(column.Name, column.EnumValues)
 		}
 		if column.Name == mappedColumn.Name && column.Charset != mappedColumn.Charset {
-			this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name)
+			this.migrationContext.MappedSharedColumns.SetCharsetConversion(column.Name, column.Charset, mappedColumn.Charset)
 		}
 	}
 
diff --git a/go/sql/builder.go b/go/sql/builder.go
index c219f42..cd980b6 100644
--- a/go/sql/builder.go
+++ b/go/sql/builder.go
@@ -42,8 +42,8 @@ func buildColumnsPreparedValues(columns *ColumnList) []string {
 			token = fmt.Sprintf("ELT(?, %s)", column.EnumValues)
 		} else if column.Type == JSONColumnType {
 			token = "convert(? using utf8mb4)"
-		} else if column.charsetConversion {
-			token = fmt.Sprintf("convert(? using %s)", column.Charset)
+		} else if column.charsetConversion != nil {
+			token = fmt.Sprintf("convert(convert(? using %s) using %s)", column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)
 		} else {
 			token = "?"
 		}
@@ -116,8 +116,8 @@ func BuildSetPreparedClause(columns *ColumnList) (result string, err error) {
 			setToken = fmt.Sprintf("%s=ELT(?, %s)", EscapeName(column.Name), column.EnumValues)
 		} else if column.Type == JSONColumnType {
 			setToken = fmt.Sprintf("%s=convert(? using utf8mb4)", EscapeName(column.Name))
-		} else if column.charsetConversion {
-			setToken = fmt.Sprintf("%s=convert(? using %s)", EscapeName(column.Name), column.Charset)
+		} else if column.charsetConversion != nil {
+			setToken = fmt.Sprintf("%s=convert(convert(? using %s) using %s)", EscapeName(column.Name), column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)
 		} else {
 			setToken = fmt.Sprintf("%s=?", EscapeName(column.Name))
 		}
diff --git a/go/sql/types.go b/go/sql/types.go
index f7256cf..cc57e52 100644
--- a/go/sql/types.go
+++ b/go/sql/types.go
@@ -32,6 +32,11 @@ type TimezoneConversion struct {
 	ToTimezone string
 }
 
+type CharsetConversion struct {
+	ToCharset   string
+	FromCharset string
+}
+
 type Column struct {
 	Name                 string
 	IsUnsigned           bool
@@ -40,7 +45,7 @@ type Column struct {
 	EnumValues           string
 	timezoneConversion   *TimezoneConversion
 	enumToTextConversion bool
-	charsetConversion    bool
+	charsetConversion    *CharsetConversion
 	// add Octet length for binary type, fix bytes with suffix "00" get clipped in mysql binlog.
 	// https://github.com/github/gh-ost/issues/909
 	BinaryOctetLength uint
@@ -212,12 +217,12 @@ func (this *ColumnList) SetEnumValues(columnName string, enumValues string) {
 	this.GetColumn(columnName).EnumValues = enumValues
 }
 
-func (this *ColumnList) SetCharsetConversion(columnName string) {
-	this.GetColumn(columnName).charsetConversion = true
+func (this *ColumnList) SetCharsetConversion(columnName string, fromCharset string, toCharset string) {
+	this.GetColumn(columnName).charsetConversion = &CharsetConversion{FromCharset: fromCharset, ToCharset: toCharset}
 }
 
 func (this *ColumnList) IsCharsetConversion(columnName string) bool {
-	return this.GetColumn(columnName).charsetConversion
+	return this.GetColumn(columnName).charsetConversion != nil
 }
 
 func (this *ColumnList) String() string {

Are the changes above preferable? I can push to see what happens in CI first—can always remove that commit if not preferred.

@shlomi-noach
Copy link

I was wondering about that, but I found some interesting behavior and the docker image for 5.5.62 does have utf8mb4. Is the dbdeployer version different (compiled without?)?
I honestly don't know why this works correctly in MySQL >=5.6.

Not sure how 5.5.62 would have utf8mb4 as it was introduced in 5.6. I'd say this is a non-interesting problem to dwell on. We don't expect 5.5 to support utf8mb4 therefore we shouldn't spend time wondering about it.

@jbielick
Copy link
Author

@shlomi-noach sounds good—I'll ignore that version. Are you okay if I keep the changes adding the double-convert? i.e. fmt.Sprintf("%s=convert(convert(? using %s) using %s)", EscapeName(column.Name), column.charsetConversion.FromCharset, column.charsetConversion.ToCharset)

@shlomi-noach
Copy link

About the double-convert. You know, I just happened to be looking into the same a couple weeks back. I have to say I don't have clear conclusions. Things don't always behave consistently. Sometimes you have to convert(convert(? using binary) using utf8mb4), and sometimes without using binary and sometimes it's like you suggested. I think something is misleading here: the actual character used in the original column. I think, and I'm not sure, that in a latin1 column there can be more than one way to write, say, the character á. But honestly I did not find conclusive evidence.

I have a question that's bugging me: in what way is the test https://github.com/openark/gh-ost/tree/master/localtests/alter-charset-all-dml lacking? Why did it not catch the issue you're solving here?

@shlomi-noach
Copy link

Related work in Vitess: vitessio/vitess#8322

@jbielick
Copy link
Author

jbielick commented Jul 13, 2021

Things don't always behave consistently. Sometimes you have to convert(convert(? using binary) using utf8mb4), and sometimes without using binary and sometimes it's like you suggested.

Agreed. Some would say the binary convert works for all cases, but it doesn't work for this case (unless I'm doing something wrong)! Converting char(189) to binary first and then utf8mb4 yields nothing—no bytes.

select convert(convert(0xBD using binary) using utf8mb4);
=> (EMPTY)
select convert(convert(0xBD using latin1) using utf8mb4);
=> ½

I have a question that's bugging me: in what way is the test https://github.com/openark/gh-ost/tree/master/localtests/alter-charset-all-dml lacking? Why did it not catch the issue you're solving here?

Great question. What I concluded was that all of the characters used in that test are either:

  • Already utf8mb4 when written to the latin1 column because the latin1 column charset + utf8mb4 connection settings. Sending their binary straight into the utf8mb4 column causes no issues because they're already multi-byte (where applicable).

AND/OR (and this is the more important aspect solved here)

  • All of the characters used in that test have direct utf8 replacements, no "conversion" is necessary. latin1 allows (and defines) characters in the \x80-\xFF range. If you send a one-byte character in that range to a utf8 column it's going to error. Those characters, mapped to the utf8 codepage, must be two-bytes (first half are prefixed with C2, latter half prefixed with C3 see the jump from 7F to C2A0).

Those 1-byte characters, which can be validly stored in latin1 columns, are what's missing from that test. Because the REPLACE statements generated from the binlog events send the text as binary, the latin1 context is gone and no convert( occurs, causing the error trying to insert that 1-byte character sequence into the utf8 column.

@jbielick jbielick force-pushed the latin1-utf8mb4-oa branch 2 times, most recently from 0b16a1e to d53534f Compare July 13, 2021 15:36
this test assumes a latin1-encoded table with content containing bytes
in the \x80-\xFF, which are invalid single-byte characters in utf8 and
cannot be inserted in the altered table when the column containing these
characters is changed to utf8(mb4).

since these characters cannot be inserted, gh-ost fails.
addresses github#290

Note: there is currently no issue backfilling the ghost table when the
characterset changes, likely because it's a insert-into-select-from and
it all occurs within mysql.

However, when applying DML events (UPDATE, DELETE, etc) the values are
sprintf'd into a prepared statement and due to the possibility of
migrating text column data containing invalid characters in the
destination charset, a conversion step is often necessary.

For example, when migrating a table/column from latin1 to utf8mb4, the
latin1 column may contain characters that are invalid single-byte utf8
characters. Characters in the \x80-\xFF range are most common. When
written to utf8mb4 column without conversion, they fail as they do not
exist in the utf8 codepage.

Converting these texts/characters to the destination charset using
convert(? using {charset}) will convert appropriately and the
update/replace will succeed.

I only point out the "Note:" above because there are two tests added
for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert

The former is a test that fails prior to this commit. The latter is a
test that succeeds prior to this comment. Both are affected by the code
in this commit.

convert text to original charset, then destination

converting text first to the original charset and then to the
destination charset produces the most consistent results, as inserting
the binary into a utf8-charset column may encounter an error if there is
no prior context of latin1 encoding.

mysql> select hex(convert(char(189) using utf8mb4));
+---------------------------------------+
| hex(convert(char(189) using utf8mb4)) |
+---------------------------------------+
|                                       |
+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4));
+-------------------------------------------------------------+
| hex(convert(convert(char(189) using latin1) using utf8mb4)) |
+-------------------------------------------------------------+
| C2BD                                                        |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

as seen in this failure on 5.5.62

 Error 1300: Invalid utf8mb4 character string: 'BD'; query=
			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
				`test`.`_gh_ost_test_gho`
					(`id`, `t`)
				values
					(?, convert(? using utf8mb4))
@jbielick
Copy link
Author

ignored_versions added with (5.5)

@shlomi-noach
Copy link

Why do I need to keep clicking "Approve to run"? The author of this PR should be clear to go indefinitely once I "approved to run" just once. /rant.

@shlomi-noach
Copy link

The changes look good to me. Now, how this works is: I can merge this downstream, but I prefer to have this PR run first on GitHub's test flow in production, seeing that there's a fair amount of risk in this PR (namely what if some text now gets transformed incorrectly).

My suggestion is (and sorry for the bureaucracy) : push the branch again to github#1003, and I'll comment there to kick the discussion with the GitHub maintainers, so they can then run this through their test cycle. Makes sense?

@jbielick
Copy link
Author

@shlomi-noach can do. Will push shortly and tag you.

@yoavnativ
Copy link

Hi. Can someone please share a build that solves that char set issue? We need to convert a table from Latin1 to UTF8.

@jbielick jbielick closed this Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants