Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert column to destination charset in DML applications #27

Closed
wants to merge 2 commits into from

Commits on Jul 13, 2021

  1. add failing test for github#290, invalid utf8 char

    this test assumes a latin1-encoded table with content containing bytes
    in the \x80-\xFF, which are invalid single-byte characters in utf8 and
    cannot be inserted in the altered table when the column containing these
    characters is changed to utf8(mb4).
    
    since these characters cannot be inserted, gh-ost fails.
    jbielick committed Jul 13, 2021
    Configuration menu
    Copy the full SHA
    b3f487c View commit details
    Browse the repository at this point in the history
  2. copy and update text using convert when charset changes

    addresses github#290
    
    Note: there is currently no issue backfilling the ghost table when the
    characterset changes, likely because it's a insert-into-select-from and
    it all occurs within mysql.
    
    However, when applying DML events (UPDATE, DELETE, etc) the values are
    sprintf'd into a prepared statement and due to the possibility of
    migrating text column data containing invalid characters in the
    destination charset, a conversion step is often necessary.
    
    For example, when migrating a table/column from latin1 to utf8mb4, the
    latin1 column may contain characters that are invalid single-byte utf8
    characters. Characters in the \x80-\xFF range are most common. When
    written to utf8mb4 column without conversion, they fail as they do not
    exist in the utf8 codepage.
    
    Converting these texts/characters to the destination charset using
    convert(? using {charset}) will convert appropriately and the
    update/replace will succeed.
    
    I only point out the "Note:" above because there are two tests added
    for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert
    
    The former is a test that fails prior to this commit. The latter is a
    test that succeeds prior to this comment. Both are affected by the code
    in this commit.
    
    convert text to original charset, then destination
    
    converting text first to the original charset and then to the
    destination charset produces the most consistent results, as inserting
    the binary into a utf8-charset column may encounter an error if there is
    no prior context of latin1 encoding.
    
    mysql> select hex(convert(char(189) using utf8mb4));
    +---------------------------------------+
    | hex(convert(char(189) using utf8mb4)) |
    +---------------------------------------+
    |                                       |
    +---------------------------------------+
    1 row in set, 1 warning (0.00 sec)
    
    mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4));
    +-------------------------------------------------------------+
    | hex(convert(convert(char(189) using latin1) using utf8mb4)) |
    +-------------------------------------------------------------+
    | C2BD                                                        |
    +-------------------------------------------------------------+
    1 row in set (0.00 sec)
    
    as seen in this failure on 5.5.62
    
     Error 1300: Invalid utf8mb4 character string: 'BD'; query=
    			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
    				`test`.`_gh_ost_test_gho`
    					(`id`, `t`)
    				values
    					(?, convert(? using utf8mb4))
    jbielick committed Jul 13, 2021
    Configuration menu
    Copy the full SHA
    799041e View commit details
    Browse the repository at this point in the history