Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'character set' specification for XRAS ingestors #122

Merged
merged 1 commit into from
May 4, 2017

Conversation

smgallo
Copy link
Contributor

@smgallo smgallo commented May 2, 2017

Add a character set specification for XRAS ingestors to handle 4-byte UTF8 data ingested from Postgres.

See also PR ubccr/xdmod-xsede#31.

Description

The XRAS data has text likely copied from word processing files that contains UTF8 characters. Postgres UTF8 character sets support 4-bytes but MySQL UTF8 supports only 3 bytes unless the UTF8MB4 character set is used.

From the MySQL LOAD DATA INFILE documentation:

When loading the data on one host into a database on another host, the server uses the character set indicated by the character_set_database system variable (set to utf8 in our case) to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause.

To mitigate the risk of affecting other data, this change will only be applied to ingestor classes whose name start with "XRAS".

Configurable support for specifying a character set will also be added to ETLv2.

Motivation and Context

Fix ingestion failures for XRAS data.

Tests performed

Ingestion has been failing for XRAS data for some time. A full ingestion was performed before the changes were made (verifying failure of several XRAS tables) and after the change was made (verifying success of all tables). In the case of XRAS supporting grants, the table data was verified to be correct.

Create a backup of the table without the new character set conversion:

create table modw_xras.people_orig like modw_xras.people;
insert into modw_xras.people_orig select * from modw_xras.people;

Run-ingestion and verify data:

$ php ~/xdmod-6.6-test/lib/dw_extract_transform_load.php -l 0
$ ~/xdmod-6.6-test/share/tools/dev/verify_table_data.php -c modw_xras -s modw_xras -t people=people_orig -v debug -n 2;
2017-05-02 15:24:08 [notice] Compare tables src=modw_xras.people, dest=modw_xras.people_orig
2017-05-02 15:24:08 [info] 11 columns
2017-05-02 15:24:08 [info] Row counts: modw_xras.people = 48,380; modw_xras.people_orig = 48,380
2017-05-02 15:24:08 [debug] 
SELECT src.*
FROM `modw_xras`.`people` src
LEFT OUTER JOIN `modw_xras`.`people_orig` dest ON (src.allocations_process_id <=> dest.allocations_process_id
AND src.country <=> dest.country
AND src.email <=> dest.email
AND src.first_name <=> dest.first_name
AND src.last_name <=> dest.last_name
AND src.middle_name <=> dest.middle_name
AND src.organization_id <=> dest.organization_id
AND src.person_id <=> dest.person_id
AND src.phone <=> dest.phone
AND src.position <=> dest.position
AND src.username <=> dest.username)
WHERE dest.allocations_process_id IS NULL
LIMIT 2
2017-05-02 15:24:09 [notice] Identical

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@smgallo smgallo added bug Bugfixes Category:ETL Extract Transform Load labels May 2, 2017
@smgallo smgallo added this to the v6.7.0 milestone May 2, 2017
@smgallo smgallo requested review from plessbd and jpwhite4 May 2, 2017 19:29
@smgallo smgallo merged commit 9d3ca5e into ubccr:xdmod6.7 May 4, 2017
@smgallo smgallo deleted the utf8mb4 branch May 4, 2017 11:32
tyearke added a commit that referenced this pull request May 5, 2017
@tyearke tyearke modified the milestones: v7.0.0, v6.7.0 Jun 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugfixes Category:ETL Extract Transform Load
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants