Add 'character set' specification for XRAS ingestors #122
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a character set specification for XRAS ingestors to handle 4-byte UTF8 data ingested from Postgres.
See also PR ubccr/xdmod-xsede#31.
Description
The XRAS data has text likely copied from word processing files that contains UTF8 characters. Postgres UTF8 character sets support 4-bytes but MySQL UTF8 supports only 3 bytes unless the UTF8MB4 character set is used.
From the MySQL LOAD DATA INFILE documentation:
When loading the data on one host into a database on another host, the server uses the character set indicated by the
character_set_database
system variable (set toutf8
in our case) to interpret the information in the file. SET NAMES and the setting ofcharacter_set_client
do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause.To mitigate the risk of affecting other data, this change will only be applied to ingestor classes whose name start with "XRAS".
Configurable support for specifying a character set will also be added to ETLv2.
Motivation and Context
Fix ingestion failures for XRAS data.
Tests performed
Ingestion has been failing for XRAS data for some time. A full ingestion was performed before the changes were made (verifying failure of several XRAS tables) and after the change was made (verifying success of all tables). In the case of XRAS supporting grants, the table data was verified to be correct.
Create a backup of the table without the new character set conversion:
Run-ingestion and verify data:
Types of changes
Checklist: