Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with character encoding in non-UTF-8 environments #36

Closed
jrober84 opened this issue Nov 5, 2019 · 5 comments
Closed

Fix issues with character encoding in non-UTF-8 environments #36

jrober84 opened this issue Nov 5, 2019 · 5 comments
Labels
bug Something isn't working

Comments

@jrober84
Copy link
Collaborator

jrober84 commented Nov 5, 2019

If LANG on the system is not UTF-8 and the output contains special characters, then mob_recon fails to read the mob_typer files with the following error 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 824: ordinal not in range(128)' . This does not occur when LC_ALL=UTF-8 is explicitly set for the environment. All MOB-suite functions which read data from disk should check character encoding and handle these cases.

@jrober84 jrober84 added the bug Something isn't working label Nov 5, 2019
@kbessonov1984
Copy link
Collaborator

The failure occurs at line 200 of mob_recon link where readlines() function takes system character encoding that is non-UTF8. Will try to fix it with pandas dataframe function or open(filename,'r',encoding='utf-8') with encoding parameter.

@jrober84 jrober84 mentioned this issue Nov 6, 2019
@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Nov 7, 2019

We had addressed this issue in the new release 2.0.2 by providing explicit UTF-8 encoding in all open text file functions (both write and read). In addition in mob_recon module we've changed code to use pandas dataframe with utf-8 encoding. See line 191

@jrober84
Copy link
Collaborator Author

By having non-ASCII characters in the Host range files host_range_literature_plasmidDB.csv, host_range_ncbirefseq_plasmidDB.csv, this means that any stray print statements involving the data from these files has a chance to error out due to non-ascii characters. I have stripped the non-ascii characters from both files and we will need to update the code to properly address non-ascii characters at run time for all input data to be safe.

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Nov 16, 2019

I did not realize about potential issues of printing non-ascii characters on systems with non-unicode encodings ... I've tried to re-create such behaviour in a Linux container with POSIX locale but so far did not see any issues during the tests. The caveat was that these host range tests did not target hits with non-ascii characters (e.g. IncI1-Iγ). I would like to create a couple of test functions that would test correct behaviour of the host range database in a non-ascii environments.

Issue:
During host range database production, the non-ascii characters (i.e. Ê, é, ú, ñ) were non-intentionally introduced during copy and paste from the literature sources (e.g. Spanish and German author names). The non-ASCII characters in host_range_ncbirefseq_plasmidDB.csv and host_range_literature_plasmidDB.csv could cause unpredictable behaviour on systems not supporting the unicode characters during file reading or resulting rendering.

Partial solution:
I've went through the files replacing the non-ASCII characters with their equivalents.

I've used grep to highlight and find characters outside ascii hexadecimal character range x00-\x7f.

Also in vi it is possible to search for non-ascii characters via /[^\x00-\x7F] command.

grep --color='auto' -P -n "[\x80-\xFF]" host_range_ncbirefseq_plasmidDB.csv
11856:CP022018,8798Ê,57.11525347,-,-,-,-,-,-,-,-,N

I've patched both files and submitted commit 5edbabf0b to the string_encoding branch

@jrober84
Copy link
Collaborator Author

closing this for now but will re-open if it pops up again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants