Fix issues with character encoding in non-UTF-8 environments #36

jrober84 · 2019-11-05T16:35:35Z

If LANG on the system is not UTF-8 and the output contains special characters, then mob_recon fails to read the mob_typer files with the following error 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 824: ordinal not in range(128)' . This does not occur when LC_ALL=UTF-8 is explicitly set for the environment. All MOB-suite functions which read data from disk should check character encoding and handle these cases.

kbessonov1984 · 2019-11-05T17:07:05Z

The failure occurs at line 200 of mob_recon link where readlines() function takes system character encoding that is non-UTF8. Will try to fix it with pandas dataframe function or open(filename,'r',encoding='utf-8') with encoding parameter.

kbessonov1984 · 2019-11-07T15:17:45Z

We had addressed this issue in the new release 2.0.2 by providing explicit UTF-8 encoding in all open text file functions (both write and read). In addition in mob_recon module we've changed code to use pandas dataframe with utf-8 encoding. See line 191

jrober84 · 2019-11-15T16:05:38Z

By having non-ASCII characters in the Host range files host_range_literature_plasmidDB.csv, host_range_ncbirefseq_plasmidDB.csv, this means that any stray print statements involving the data from these files has a chance to error out due to non-ascii characters. I have stripped the non-ascii characters from both files and we will need to update the code to properly address non-ascii characters at run time for all input data to be safe.

kbessonov1984 · 2019-11-16T02:18:54Z

I did not realize about potential issues of printing non-ascii characters on systems with non-unicode encodings ... I've tried to re-create such behaviour in a Linux container with POSIX locale but so far did not see any issues during the tests. The caveat was that these host range tests did not target hits with non-ascii characters (e.g. IncI1-Iγ). I would like to create a couple of test functions that would test correct behaviour of the host range database in a non-ascii environments.

Issue:
During host range database production, the non-ascii characters (i.e. Ê, é, ú, ñ) were non-intentionally introduced during copy and paste from the literature sources (e.g. Spanish and German author names). The non-ASCII characters in host_range_ncbirefseq_plasmidDB.csv and host_range_literature_plasmidDB.csv could cause unpredictable behaviour on systems not supporting the unicode characters during file reading or resulting rendering.

Partial solution:
I've went through the files replacing the non-ASCII characters with their equivalents.

I've used grep to highlight and find characters outside ascii hexadecimal character range x00-\x7f.

Also in vi it is possible to search for non-ascii characters via /[^\x00-\x7F] command.

grep --color='auto' -P -n "[\x80-\xFF]" host_range_ncbirefseq_plasmidDB.csv
11856:CP022018,8798Ê,57.11525347,-,-,-,-,-,-,-,-,N

I've patched both files and submitted commit 5edbabf0b to the string_encoding branch

jrober84 · 2019-12-23T15:00:40Z

closing this for now but will re-open if it pops up again

jrober84 added the bug Something isn't working label Nov 5, 2019

jrober84 mentioned this issue Nov 6, 2019

String encoding #37

Merged

kbessonov1984 closed this as completed Nov 7, 2019

jrober84 reopened this Nov 15, 2019

jrober84 closed this as completed Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues with character encoding in non-UTF-8 environments #36

Fix issues with character encoding in non-UTF-8 environments #36

jrober84 commented Nov 5, 2019

kbessonov1984 commented Nov 5, 2019

kbessonov1984 commented Nov 7, 2019 •

edited

Loading

jrober84 commented Nov 15, 2019

kbessonov1984 commented Nov 16, 2019 •

edited

Loading

jrober84 commented Dec 23, 2019

Fix issues with character encoding in non-UTF-8 environments #36

Fix issues with character encoding in non-UTF-8 environments #36

Comments

jrober84 commented Nov 5, 2019

kbessonov1984 commented Nov 5, 2019

kbessonov1984 commented Nov 7, 2019 • edited Loading

jrober84 commented Nov 15, 2019

kbessonov1984 commented Nov 16, 2019 • edited Loading

jrober84 commented Dec 23, 2019

kbessonov1984 commented Nov 7, 2019 •

edited

Loading

kbessonov1984 commented Nov 16, 2019 •

edited

Loading