read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615

pdbailey0 · 2021-07-15T14:55:14Z

In haven 2.4.0, 2.4.1 (And 2.4.1.9000) I get an error when reading in the International Association for the Evaluation of Educational Achievement's ePIRLS data available here.

After unzipping that to ~/ePIRLS/2016/ I do

x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav")
# Error: Failed to parse [snip]/ePIRLS/2016/asausae1.sav: Unable to convert string to the requested encoding (invalid byte sequence).

however, with haven 2.3.1 it reads in without errors. If I write it out with the haven 2.3.1 write_sav, then haven 2.4.1 can read in that file cleanly.

The text was updated successfully, but these errors were encountered:

pdbailey0 · 2021-07-15T15:01:52Z

Maybe this is addressed in the read_dta/read_stata documentation which reads,

If you encounter an error such as "Unable to convert string to the requested encoding", try encoding = "latin1"

and this runs cleanly (no error on exit)

x <- haven::read_sav("~/ePIRLS/2016//asausae1.sav", encoding="latin1")

pdbailey0 · 2021-07-15T16:57:01Z

Nevertheless, it's odd that there is nothing in the NEWS since 2.1.0 about read_sav but changed behavior. Is there a way to see what encoding is being used?

hadley · 2021-07-30T20:07:00Z

Unfortunately there's no easy way to track down what went wrong here 😞

Deleetdk · 2021-08-04T01:28:30Z

The file here https://worldsofjournalism.org/data-d79/data-and-key-tables-2012-2016/ produces the same error, and it solved by @pdbailey0's suggestion above.

> woj2 = read_sav("data/WJS2 open V4-02 030517.sav")
Error: Failed to parse /science/projects/world of journalism/data/WJS2 open V4-02 030517.sav: Unable to convert string to the requested encoding (invalid byte sequence).
> woj2 = read_sav("data/WJS2 open V4-02 030517.sav", encoding = "latin1")

pdbailey0 · 2021-08-04T16:08:56Z

It would be really great to be able to see what encoding is being used when one is not set.

sam-crawley · 2021-08-05T08:58:56Z

Hi,

The problem with the suggested workaround is that it can break encoding. e.g. loading Afrobarometer Wave 6 (available here: https://afrobarometer.org/data/merged-round-6-data-36-countries-2016), in haven 2.3.1:

afb6 <- read_sav("merged_r6_data_2016_36countries2.sav")

levels(haven::as_factor(unique(afb6$COUNTRY)))[[25]]
[1] "São Tomé and Príncipe"

But in haven 2.4.3:

afb6 <- read_sav("merged_r6_data_2016_36countries2.sav", encoding = "latin1")

levels(haven::as_factor(unique(afb6$COUNTRY)))[[25]]
[1] "SÃ£o TomÃ© and PrÃncipe"

The file is UTF-8, and may have something broken about it, but loading it as latin1 may not be the solution?

hadley · 2021-08-05T12:20:28Z

@gorcha does this ring any bells with you? I don't think anything has changed in haven relating to this, so it might be a readstat bug?

pdbailey0 · 2021-08-05T12:45:16Z

haven only started documenting readStat version numbers in the NEWS in haven 2.4.0 (which uses readSav 1.1.5), but the 2.3.0 NEWS mentions sas "any" encoding, so that was probably readStat 1.1.2.

gorcha · 2021-08-05T13:29:19Z

Hey @hadley, definitely nothing haven related.

I've done a bit of digging (thanks @pdbailey0 for the version tip!) and it's because of this change in ReadStat 1.1.4 WizardMac/ReadStat@a8b0466 - reverting this line to the old code loads this file successfully with the default encoding.

So it looks like something is falling over in iconv, but not sure what exactly. I'll have a poke around and see if I can find something definitive.

gorcha · 2021-08-05T16:13:33Z

This is pretty obscure, but the short version is SPSS is probably not our friend and doesn't encode UTF-8 properly.

The issue is that SPSS (or at least the version that produced the offending files) appears to store multi-byte unicode characters using the code point (a single byte) instead of code units (which can be 1 to 4 bytes).

The string that's causing the issue in the AfroBarometer file is "VOTAÇÃO", which shows up in row 39619.
Having a look at the raw SPSS file, the hex representation is 56 4f 54 41 c7 c3 4f - the c7 and c3 represent the Ç and Ã characters respectively.

The problem is that, using Ã as an example, C3 is the "code point" representation, but the correct UTF-8 encoding is two bytes - C3 83 (see https://en.wikipedia.org/wiki/%C3%83).

So SPSS uses the correct "code point" representations of these two characters, but they're not the correct binary encoding for UTF-8. They should both be stored as multi-byte characters in a correct UTF-8 encoding.

I'm not deep enough in the ReadStat code to know why it was working fine, but it's failing now because it's being forced through iconv (for other very necessary reasons) under the totally fair but incorrect assumption that SPSS was encoding things in the way that it said it was.

You can get it down to the exact offending cell using:

read_sav("~/Downloads/merged_r6_data_2016_36countries2.sav", col_select = "Q29B", skip = 39618, n_max = 1)

@evanmiller can you please have a look?

gorcha · 2021-08-05T16:18:30Z

@pdbailey0, any idea what version of SPSS produced these files? It could be a version specific thing

evanmiller · 2021-08-05T16:48:47Z

Just FYI what you are calling the "code point" representation is actually Latin-1, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Is SPSS producing files containing both UTF-8 and Latin-1 data?

evanmiller · 2021-08-05T16:55:10Z

@pdbailey0 If you download the standalone readstat utility, it will report the file's self-reported encoding.

$ readstat binlfp2.sav
Format: SPSS binary file (SAV)
Columns: 8
Rows: 753
Table label: binlfp2
Format version: 2
Text encoding: UTF-8
Byte order: little-endian
Timestamp: 28 Oct 2015 14:34

pdbailey0 · 2021-08-05T21:42:28Z

@gorcha I didn't write it. You would have to ask Boston College. Maybe @sam-crawley knows what version wrote his?

sam-crawley · 2021-08-06T01:09:08Z

I was not involved in creating the Afrobarometer file either.

FWIW, I opened the file in SPSS 27. It happily opens the file, but shows broken/box characters for that one field.

I guess somehow a string with broken encoding got inserted in that field when the file was created (which SPSS perhaps should have detected and thrown an error about?)

However, since messy/broken data is a fact of life, perhaps the right approach is to warn about the problem, rather than to throw an error?

Thanks to everyone for their time on this issue so far.

gorcha · 2021-08-06T02:24:13Z

Oh my mistake, thanks @evanmiller!

I've had another look and it looks like in the Afrobarometer file there are just a handful of records with latin1 encoded characters. For e.g. the Ã shows up mostly as UTF-8 (C3 83) but a few times as latin1 (C3).

So I think you're right @sam-crawley, some funky characters have crept in at some point and SPSS doesn't properly enforce the encoding.

@evanmiller how would you feel about ReadStat copying over invalid bytes unedited rather than throwing an error? Obviously not ideal, but consistent with what SPSS does at least. I've hacked together something along those lines and it fixes this error, but I'm not sure what other nasty flow on effects there might be and how this would interact with other systems that ReadStat supports.

skalteis · 2021-08-18T12:00:19Z

Hi @gorcha,
would you mind sharing your patch for ReadStat/haven to deal with this problem? I could not find anything in your forked ReadStat-repo about it and I have a broken SPSS-file here that I have to deal with. I'd very much appreciate it and would be very happy. :-)

Thank you & best regards,
Simon

gorcha · 2021-08-19T02:47:00Z

Hey @skalteis,

Of course, always happy to help! 🙂
I've pushed the change to the invalid-bytes branch on my ReadStat fork if you want to have a look.

deschen1 · 2022-01-31T07:47:05Z

I have the same issue (same error message) with one of my data sets. I didn't understand all of what was said above, but wanted to check if there's a fix or workaround to prevent/solve this issue?

I'm currently using an older version of haven, but this does not work well with other packages (i.e. the labelled package).

gorcha · 2022-02-03T01:45:41Z

Hi @deschen1, a fix is in progress (requiring some changes in the underlying ReadStat library).

Unfortunately there's no simple workaround in the meantime, but hoping to get this fixed soon!

deschen1 · 2022-02-03T08:20:02Z

Thanks for the update nonetheless. And thanks for working on this bug/issue.

deschen1 · 2022-02-10T11:31:41Z

FWIW, I have opened a bug report in SPSS, just in case they might have been able to do sth. about the beahviour. Here's their response. Not sure if it helps to solve the issue, though. I highlighted in bold two potential helpful pieces.

The discussion here #615 (comment)
is about end users employing a third party product ReadStat to read an existing SPSS Statistics system file (*.sav) into their application (not SPSS Statistics).
So immediately we have questions about:
Did SPSS Statistics produce this system file?
If it did, did is warn upon loading these data?
Was SPSS Statistics correctly setup when it built this file? (we have many troubles with our customers randomly switching back and forth between Codepage mode and Unicode mode)
See the continued discussion in your first link:
#615 (comment)
The problem is a mix of UTF-8 and Latin1 (Codpage) characters in the example data file.
SPSS Statistics will treat whatever you put into the system file as valid. In this case, the file was created with garbage text. I suspect, If it was created in SPSS Statistics, a warning was thrown when the original data was ingested prior to saving as '.sav'.

gorcha · 2022-02-21T22:44:20Z

Thanks @deschen1! Good to know, it confirms that SPSS doesn't enforce the specified character encoding.

aito123 · 2024-10-12T20:31:25Z

A solution that worked for me was to turn OFF the unicode inside SPSS.
Open SPSS, create new syntax, run this code:

SET UNICODE OFF.

After that open the dataset and save it again. Credits to this post for this solution:
https://stackoverflow.com/questions/3136293/read-spss-file-into-r

Hope there was a more automatic solution inside R...

hadley closed this as completed Jul 30, 2021

hadley reopened this Aug 4, 2021

gorcha added bug an unexpected problem or unintended behavior readstat labels Aug 7, 2021

gorcha mentioned this issue Aug 19, 2021

Keep invalid multi-byte sequences unedited in iconv encoding conversions WizardMac/ReadStat#252

Closed

gorcha mentioned this issue Feb 6, 2022

Handle invalid multi-byte sequences in iconv encoding conversions WizardMac/ReadStat#264

Open

deschen1 mentioned this issue Feb 10, 2022

Because of Chinese characters colnames,read_sav give an error:"Invalid file, or file has unsupported features" #467

Closed

gorcha mentioned this issue Mar 22, 2022

Add unordered_map for string references #584

Merged

gorcha added the wip work in progress label Mar 22, 2022

JorisGoosen mentioned this issue Jun 5, 2024

[Bug]: The Likert scale (answer options) is not fully enumerated. jasp-stats/jasp-issues#2669

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615

read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615

pdbailey0 commented Jul 15, 2021

pdbailey0 commented Jul 15, 2021 •

edited

Loading

pdbailey0 commented Jul 15, 2021

hadley commented Jul 30, 2021

Deleetdk commented Aug 4, 2021

pdbailey0 commented Aug 4, 2021

sam-crawley commented Aug 5, 2021

hadley commented Aug 5, 2021

pdbailey0 commented Aug 5, 2021

gorcha commented Aug 5, 2021

gorcha commented Aug 5, 2021

gorcha commented Aug 5, 2021

evanmiller commented Aug 5, 2021

evanmiller commented Aug 5, 2021

pdbailey0 commented Aug 5, 2021

sam-crawley commented Aug 6, 2021

gorcha commented Aug 6, 2021

skalteis commented Aug 18, 2021

gorcha commented Aug 19, 2021

deschen1 commented Jan 31, 2022

gorcha commented Feb 3, 2022

deschen1 commented Feb 3, 2022

deschen1 commented Feb 10, 2022

gorcha commented Feb 21, 2022

aito123 commented Oct 12, 2024

read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615

read_Sav regression "Unable to convert string to the requested encoding (invalid byte sequence)" #615

Comments

pdbailey0 commented Jul 15, 2021

pdbailey0 commented Jul 15, 2021 • edited Loading

pdbailey0 commented Jul 15, 2021

hadley commented Jul 30, 2021

Deleetdk commented Aug 4, 2021

pdbailey0 commented Aug 4, 2021

sam-crawley commented Aug 5, 2021

hadley commented Aug 5, 2021

pdbailey0 commented Aug 5, 2021

gorcha commented Aug 5, 2021

gorcha commented Aug 5, 2021

gorcha commented Aug 5, 2021

evanmiller commented Aug 5, 2021

evanmiller commented Aug 5, 2021

pdbailey0 commented Aug 5, 2021

sam-crawley commented Aug 6, 2021

gorcha commented Aug 6, 2021

skalteis commented Aug 18, 2021

gorcha commented Aug 19, 2021

deschen1 commented Jan 31, 2022

gorcha commented Feb 3, 2022

deschen1 commented Feb 3, 2022

deschen1 commented Feb 10, 2022

gorcha commented Feb 21, 2022

aito123 commented Oct 12, 2024

pdbailey0 commented Jul 15, 2021 •

edited

Loading