Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615

Open
pdbailey0 opened this issue Jul 15, 2021 · 24 comments
Labels
bug an unexpected problem or unintended behavior readstat wip work in progress

Comments

@pdbailey0
Copy link

In haven 2.4.0, 2.4.1 (And 2.4.1.9000) I get an error when reading in the International Association for the Evaluation of Educational Achievement's ePIRLS data available here.

After unzipping that to ~/ePIRLS/2016/ I do

x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav")
# Error: Failed to parse [snip]/ePIRLS/2016/asausae1.sav: Unable to convert string to the requested encoding (invalid byte sequence).

however, with haven 2.3.1 it reads in without errors. If I write it out with the haven 2.3.1 write_sav, then haven 2.4.1 can read in that file cleanly.

@pdbailey0
Copy link
Author

pdbailey0 commented Jul 15, 2021

Maybe this is addressed in the read_dta/read_stata documentation which reads,

If you encounter an error such as "Unable to convert string to the requested encoding", try encoding = "latin1"

and this runs cleanly (no error on exit)

x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav", encoding="latin1")

@pdbailey0
Copy link
Author

Nevertheless, it's odd that there is nothing in the NEWS since 2.1.0 about read_sav but changed behavior. Is there a way to see what encoding is being used?

@hadley
Copy link
Member

hadley commented Jul 30, 2021

Unfortunately there's no easy way to track down what went wrong here 😞

@hadley hadley closed this as completed Jul 30, 2021
@Deleetdk
Copy link

Deleetdk commented Aug 4, 2021

The file here https://worldsofjournalism.org/data-d79/data-and-key-tables-2012-2016/ produces the same error, and it solved by @pdbailey0's suggestion above.

> woj2 = read_sav("data/WJS2 open V4-02 030517.sav")
Error: Failed to parse /science/projects/world of journalism/data/WJS2 open V4-02 030517.sav: Unable to convert string to the requested encoding (invalid byte sequence).
> woj2 = read_sav("data/WJS2 open V4-02 030517.sav", encoding = "latin1")

@pdbailey0
Copy link
Author

It would be really great to be able to see what encoding is being used when one is not set.

@hadley hadley reopened this Aug 4, 2021
@sam-crawley
Copy link

Hi,

The problem with the suggested workaround is that it can break encoding. e.g. loading Afrobarometer Wave 6 (available here: https://afrobarometer.org/data/merged-round-6-data-36-countries-2016), in haven 2.3.1:

afb6 <- read_sav("merged_r6_data_2016_36countries2.sav")

levels(haven::as_factor(unique(afb6$COUNTRY)))[[25]]
[1] "São Tomé and Príncipe"

But in haven 2.4.3:

afb6 <- read_sav("merged_r6_data_2016_36countries2.sav", encoding = "latin1")

levels(haven::as_factor(unique(afb6$COUNTRY)))[[25]]
[1] "São Tomé and Príncipe"

The file is UTF-8, and may have something broken about it, but loading it as latin1 may not be the solution?

@hadley
Copy link
Member

hadley commented Aug 5, 2021

@gorcha does this ring any bells with you? I don't think anything has changed in haven relating to this, so it might be a readstat bug?

@pdbailey0
Copy link
Author

haven only started documenting readStat version numbers in the NEWS in haven 2.4.0 (which uses readSav 1.1.5), but the 2.3.0 NEWS mentions sas "any" encoding, so that was probably readStat 1.1.2.

@gorcha
Copy link
Member

gorcha commented Aug 5, 2021

Hey @hadley, definitely nothing haven related.

I've done a bit of digging (thanks @pdbailey0 for the version tip!) and it's because of this change in ReadStat 1.1.4 WizardMac/ReadStat@a8b0466 - reverting this line to the old code loads this file successfully with the default encoding.

So it looks like something is falling over in iconv, but not sure what exactly. I'll have a poke around and see if I can find something definitive.

@gorcha
Copy link
Member

gorcha commented Aug 5, 2021

This is pretty obscure, but the short version is SPSS is probably not our friend and doesn't encode UTF-8 properly.

The issue is that SPSS (or at least the version that produced the offending files) appears to store multi-byte unicode characters using the code point (a single byte) instead of code units (which can be 1 to 4 bytes).

The string that's causing the issue in the AfroBarometer file is "VOTAÇÃO", which shows up in row 39619.
Having a look at the raw SPSS file, the hex representation is 56 4f 54 41 c7 c3 4f - the c7 and c3 represent the Ç and à characters respectively.

The problem is that, using à as an example, C3 is the "code point" representation, but the correct UTF-8 encoding is two bytes - C3 83 (see https://en.wikipedia.org/wiki/%C3%83).

So SPSS uses the correct "code point" representations of these two characters, but they're not the correct binary encoding for UTF-8. They should both be stored as multi-byte characters in a correct UTF-8 encoding.

I'm not deep enough in the ReadStat code to know why it was working fine, but it's failing now because it's being forced through iconv (for other very necessary reasons) under the totally fair but incorrect assumption that SPSS was encoding things in the way that it said it was.

You can get it down to the exact offending cell using:

read_sav("~/Downloads/merged_r6_data_2016_36countries2.sav", col_select = "Q29B", skip = 39618, n_max = 1)

@evanmiller can you please have a look?

@gorcha
Copy link
Member

gorcha commented Aug 5, 2021

@pdbailey0, any idea what version of SPSS produced these files? It could be a version specific thing

@evanmiller
Copy link
Collaborator

Just FYI what you are calling the "code point" representation is actually Latin-1, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Is SPSS producing files containing both UTF-8 and Latin-1 data?

@evanmiller
Copy link
Collaborator

@pdbailey0 If you download the standalone readstat utility, it will report the file's self-reported encoding.

$ readstat binlfp2.sav
Format: SPSS binary file (SAV)
Columns: 8
Rows: 753
Table label: binlfp2
Format version: 2
Text encoding: UTF-8
Byte order: little-endian
Timestamp: 28 Oct 2015 14:34

@pdbailey0
Copy link
Author

@gorcha I didn't write it. You would have to ask Boston College. Maybe @sam-crawley knows what version wrote his?

@sam-crawley
Copy link

I was not involved in creating the Afrobarometer file either.

FWIW, I opened the file in SPSS 27. It happily opens the file, but shows broken/box characters for that one field.

I guess somehow a string with broken encoding got inserted in that field when the file was created (which SPSS perhaps should have detected and thrown an error about?)

However, since messy/broken data is a fact of life, perhaps the right approach is to warn about the problem, rather than to throw an error?

Thanks to everyone for their time on this issue so far.

@gorcha
Copy link
Member

gorcha commented Aug 6, 2021

Oh my mistake, thanks @evanmiller!

I've had another look and it looks like in the Afrobarometer file there are just a handful of records with latin1 encoded characters. For e.g. the à shows up mostly as UTF-8 (C3 83) but a few times as latin1 (C3).

So I think you're right @sam-crawley, some funky characters have crept in at some point and SPSS doesn't properly enforce the encoding.

@evanmiller how would you feel about ReadStat copying over invalid bytes unedited rather than throwing an error? Obviously not ideal, but consistent with what SPSS does at least. I've hacked together something along those lines and it fixes this error, but I'm not sure what other nasty flow on effects there might be and how this would interact with other systems that ReadStat supports.

@gorcha gorcha added bug an unexpected problem or unintended behavior readstat labels Aug 7, 2021
@skalteis
Copy link

Hi @gorcha,
would you mind sharing your patch for ReadStat/haven to deal with this problem? I could not find anything in your forked ReadStat-repo about it and I have a broken SPSS-file here that I have to deal with. I'd very much appreciate it and would be very happy. :-)

Thank you & best regards,
Simon

@gorcha
Copy link
Member

gorcha commented Aug 19, 2021

Hey @skalteis,

Of course, always happy to help! 🙂
I've pushed the change to the invalid-bytes branch on my ReadStat fork if you want to have a look.

@deschen1
Copy link

I have the same issue (same error message) with one of my data sets. I didn't understand all of what was said above, but wanted to check if there's a fix or workaround to prevent/solve this issue?

I'm currently using an older version of haven, but this does not work well with other packages (i.e. the labelled package).

@gorcha
Copy link
Member

gorcha commented Feb 3, 2022

Hi @deschen1, a fix is in progress (requiring some changes in the underlying ReadStat library).

Unfortunately there's no simple workaround in the meantime, but hoping to get this fixed soon!

@deschen1
Copy link

deschen1 commented Feb 3, 2022

Thanks for the update nonetheless. And thanks for working on this bug/issue.

@deschen1
Copy link

FWIW, I have opened a bug report in SPSS, just in case they might have been able to do sth. about the beahviour. Here's their response. Not sure if it helps to solve the issue, though. I highlighted in bold two potential helpful pieces.

The discussion here #615 (comment)
is about end users employing a third party product ReadStat to read an existing SPSS Statistics system file (*.sav) into their application (not SPSS Statistics).
So immediately we have questions about:
Did SPSS Statistics produce this system file?
If it did, did is warn upon loading these data?
Was SPSS Statistics correctly setup when it built this file? (we have many troubles with our customers randomly switching back and forth between Codepage mode and Unicode mode)
See the continued discussion in your first link:
#615 (comment)
The problem is a mix of UTF-8 and Latin1 (Codpage) characters in the example data file.
SPSS Statistics will treat whatever you put into the system file as valid. In this case, the file was created with garbage text. I suspect, If it was created in SPSS Statistics, a warning was thrown when the original data was ingested prior to saving as '.sav'.

@gorcha
Copy link
Member

gorcha commented Feb 21, 2022

Thanks @deschen1! Good to know, it confirms that SPSS doesn't enforce the specified character encoding.

@aito123
Copy link

aito123 commented Oct 12, 2024

A solution that worked for me was to turn OFF the unicode inside SPSS.
Open SPSS, create new syntax, run this code:

SET UNICODE OFF.

After that open the dataset and save it again. Credits to this post for this solution:
https://stackoverflow.com/questions/3136293/read-spss-file-into-r

Hope there was a more automatic solution inside R...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior readstat wip work in progress
Projects
None yet
Development

No branches or pull requests

9 participants