-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615
Comments
Maybe this is addressed in the
and this runs cleanly (no error on exit) x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav", encoding="latin1") |
Nevertheless, it's odd that there is nothing in the NEWS since 2.1.0 about |
Unfortunately there's no easy way to track down what went wrong here 😞 |
The file here https://worldsofjournalism.org/data-d79/data-and-key-tables-2012-2016/ produces the same error, and it solved by @pdbailey0's suggestion above.
|
It would be really great to be able to see what encoding is being used when one is not set. |
Hi, The problem with the suggested workaround is that it can break encoding. e.g. loading Afrobarometer Wave 6 (available here: https://afrobarometer.org/data/merged-round-6-data-36-countries-2016), in haven 2.3.1:
But in haven 2.4.3:
The file is UTF-8, and may have something broken about it, but loading it as latin1 may not be the solution? |
@gorcha does this ring any bells with you? I don't think anything has changed in haven relating to this, so it might be a readstat bug? |
haven only started documenting readStat version numbers in the NEWS in haven 2.4.0 (which uses readSav 1.1.5), but the 2.3.0 NEWS mentions sas "any" encoding, so that was probably readStat 1.1.2. |
Hey @hadley, definitely nothing haven related. I've done a bit of digging (thanks @pdbailey0 for the version tip!) and it's because of this change in ReadStat 1.1.4 WizardMac/ReadStat@a8b0466 - reverting this line to the old code loads this file successfully with the default encoding. So it looks like something is falling over in iconv, but not sure what exactly. I'll have a poke around and see if I can find something definitive. |
This is pretty obscure, but the short version is SPSS is probably not our friend and doesn't encode UTF-8 properly. The issue is that SPSS (or at least the version that produced the offending files) appears to store multi-byte unicode characters using the code point (a single byte) instead of code units (which can be 1 to 4 bytes). The string that's causing the issue in the AfroBarometer file is "VOTAÇÃO", which shows up in row 39619. The problem is that, using à as an example, C3 is the "code point" representation, but the correct UTF-8 encoding is two bytes - C3 83 (see https://en.wikipedia.org/wiki/%C3%83). So SPSS uses the correct "code point" representations of these two characters, but they're not the correct binary encoding for UTF-8. They should both be stored as multi-byte characters in a correct UTF-8 encoding. I'm not deep enough in the ReadStat code to know why it was working fine, but it's failing now because it's being forced through iconv (for other very necessary reasons) under the totally fair but incorrect assumption that SPSS was encoding things in the way that it said it was. You can get it down to the exact offending cell using: read_sav("~/Downloads/merged_r6_data_2016_36countries2.sav", col_select = "Q29B", skip = 39618, n_max = 1) @evanmiller can you please have a look? |
@pdbailey0, any idea what version of SPSS produced these files? It could be a version specific thing |
Just FYI what you are calling the "code point" representation is actually Latin-1, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1 Is SPSS producing files containing both UTF-8 and Latin-1 data? |
@pdbailey0 If you download the standalone $ readstat binlfp2.sav
Format: SPSS binary file (SAV)
Columns: 8
Rows: 753
Table label: binlfp2
Format version: 2
Text encoding: UTF-8
Byte order: little-endian
Timestamp: 28 Oct 2015 14:34 |
@gorcha I didn't write it. You would have to ask Boston College. Maybe @sam-crawley knows what version wrote his? |
I was not involved in creating the Afrobarometer file either. FWIW, I opened the file in SPSS 27. It happily opens the file, but shows broken/box characters for that one field. I guess somehow a string with broken encoding got inserted in that field when the file was created (which SPSS perhaps should have detected and thrown an error about?) However, since messy/broken data is a fact of life, perhaps the right approach is to warn about the problem, rather than to throw an error? Thanks to everyone for their time on this issue so far. |
Oh my mistake, thanks @evanmiller! I've had another look and it looks like in the Afrobarometer file there are just a handful of records with latin1 encoded characters. For e.g. the à shows up mostly as UTF-8 (C3 83) but a few times as latin1 (C3). So I think you're right @sam-crawley, some funky characters have crept in at some point and SPSS doesn't properly enforce the encoding. @evanmiller how would you feel about ReadStat copying over invalid bytes unedited rather than throwing an error? Obviously not ideal, but consistent with what SPSS does at least. I've hacked together something along those lines and it fixes this error, but I'm not sure what other nasty flow on effects there might be and how this would interact with other systems that ReadStat supports. |
Hi @gorcha, Thank you & best regards, |
Hey @skalteis, Of course, always happy to help! 🙂 |
I have the same issue (same error message) with one of my data sets. I didn't understand all of what was said above, but wanted to check if there's a fix or workaround to prevent/solve this issue? I'm currently using an older version of haven, but this does not work well with other packages (i.e. the |
Hi @deschen1, a fix is in progress (requiring some changes in the underlying ReadStat library). Unfortunately there's no simple workaround in the meantime, but hoping to get this fixed soon! |
Thanks for the update nonetheless. And thanks for working on this bug/issue. |
FWIW, I have opened a bug report in SPSS, just in case they might have been able to do sth. about the beahviour. Here's their response. Not sure if it helps to solve the issue, though. I highlighted in bold two potential helpful pieces.
|
Thanks @deschen1! Good to know, it confirms that SPSS doesn't enforce the specified character encoding. |
A solution that worked for me was to turn OFF the unicode inside SPSS.
After that open the dataset and save it again. Credits to this post for this solution: Hope there was a more automatic solution inside R... |
In haven 2.4.0, 2.4.1 (And 2.4.1.9000) I get an error when reading in the International Association for the Evaluation of Educational Achievement's ePIRLS data available here.
After unzipping that to
~/ePIRLS/2016/
I dohowever, with haven 2.3.1 it reads in without errors. If I write it out with the haven 2.3.1
write_sav
, then haven 2.4.1 can read in that file cleanly.The text was updated successfully, but these errors were encountered: