Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf encoding issue with output from the reference parser #11

Open
grahamgower opened this issue Mar 8, 2023 · 6 comments
Open

utf encoding issue with output from the reference parser #11

grahamgower opened this issue Mar 8, 2023 · 6 comments

Comments

@grahamgower
Copy link
Member

The reference parser output for the valid test case unicode_deme_name_04.yaml from the demes-spec repo is encoded as utf16, when it should be encoded as utf8.

@grahamgower
Copy link
Member Author

See #9.

@grahamgower
Copy link
Member Author

@IsabelMarleen I think we must had a similar problem when using the reference parser in the demes-python test suite (but only in the continuous integration when running on Windows). It turned out to be an issue with Python choosing the encoding for stdout to match the OS-configured locale, which on Windows was utf16 by default. The solution was to call python with the -X utf8 option to override the default encoding.
https://github.com/popsim-consortium/demes-python/blob/392c6a0eb5e70223a00d6659df2134317a94bdf0/tests/test_spec.py#L33-L34

I guess you're using a locale on your computer, for which the default encoding is utf16? Could you try adding the -X utf8 option when calling the reference parser here:

py_command <- paste("import os; os.system('python", path_ref_implementation, input_file, ">", path_preparsed_file, "')")

@grahamgower
Copy link
Member Author

Some discussions about encoding here: popsim-consortium/demes-spec#129

@IsabelMarleen
Copy link
Collaborator

I tried it just now and it did not make a difference. I ran python3 reference_implementation/resolve_yaml.py test-cases/valid/unicode_deme_name_04.yaml -X utf8 > tmp.json and in the output the property in question looks like "name": "\ud867\ude3d". When trying to read tmp.json with yaml::read_yaml() I get the following error:

Error in yaml.load(string, error.label = error.label, ...) :
(tmp.json) Scanner error: while parsing a quoted scalar at line 9, column 15 found invalid Unicode character escape code at line 9, column 18

Without specifying -X utf8, the yaml parser worked when I specified fileEncoding=UTF-16, but that throws a different error now. The scanner error is the same I encountered before, however.

@grahamgower
Copy link
Member Author

What operating system are you using?

@IsabelMarleen
Copy link
Collaborator

I'm using macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants