writing .sav files with international chars in variable names leads to crash #5

jonathon-love · 2020-05-27T02:20:03Z

hey @ofajardo,

do you have this problem?

with thanks

ofajardo · 2020-05-27T06:19:29Z

hi

no actually I don't see the issue on my side neither in pyreadstat or in pyreadr ...

let me know if I can be of any help

jonathon-love · 2020-05-27T06:45:55Z

hmm, no, i'm able to save files with international chars ...

are you able to export this .csv as a .sav?

fred.csv.zip

ofajardo · 2020-05-27T13:22:53Z

ah this is an interesting one. Right now I can't save it with this error:

pyreadstat._readstat_parser.PyreadstatError: variable name  השכלה starts with an illegal (non-alphabetic) character

Now, this error comes from my code. I did it because if you have a variable starting with a number, readstat will exit with an error saying something like illegal character, but you don't know where it comes from, so I added a checking for each variable name to give a better error message:

if not variable_name[0].isalpha():
            raise PyreadstatError("variable name %s starts with an illegal (non-alphabetic) character" % variable_name)

so, now I wonder what's the best way to check this. let me deactivate that guard anyway to see what readstat would do with this variable.

Also it's interesting that before I tried a variable name ábaco and that did work well. Also interesting that this is not the first variable, so somehow Python things that the character ה is non alphabetic. This variable is the only one having that characteristic.

ofajardo · 2020-05-27T13:33:47Z

and as expected if I disable my guard the expected Readstat error comes:

ReadstatError: A provided name contains an illegal character

ofajardo · 2020-05-27T13:45:30Z

ohhh, this variable name has a space in front, meaning the first character is 32.

so it seems pyreadstat is doing it correctly, altough the error message was a bit difficult to interpret

ofajardo · 2020-05-27T14:00:58Z

OK, now I stripped the space in front of that character and pyreadstat does not complain anymore, however Readstat does:

A provided name contains an illegal character

it doesn't like the name of the column 16: this one is because it has a space in it. I just tried and it seems that indeed Readstat doesn't like spaces, so yet another thing to check!

I replaced all the spaces in the variables names by _ and now it is saved fine

jonathon-love · 2020-05-28T06:23:15Z

ah, thanks, this is very helpful ... but where/how is that error thrown?

readstat_variable_t *readstat_add_variable(readstat_writer_t *writer, const char *name, readstat_type_t type, 
        size_t storage_width);

this doesn't return an error code.

jonathon-love · 2020-05-28T07:28:55Z

oh righto, you have to call readstat_begin_row() ...

jonathon-love · 2020-05-28T07:30:11Z

I replaced all the spaces in the variables names by _ and now it is saved fine

but does that same file open fine?

ofajardo · 2020-05-28T07:34:15Z

for me yes

jonathon-love · 2020-05-28T07:47:49Z

oh! well you're doing better than me and haven.

could you send me through that file?

with thanks

ofajardo · 2020-05-28T08:11:12Z

OK sorry, I was wrong: if I replace the spaces by _ then the file saves but cannot be opened:

Invalid file, or file has unsupported features

however if I just remove the spaces (replace " " by "") then the file saves and can be opened correctly. I can send you the file if still interested.

But yeah, it seems there is a Readstat bug here.

ofajardo · 2020-05-28T08:20:14Z

interestingly this works, meaning it is not the international characters per se, there is something else:

>>> df2 = pd.DataFrame([[1,2],[3,4]], columns = ["á_1","á_2"])
>>> pyreadstat.write_sav(df2, "underscore.sav")
>>> df3, meta3 = pyreadstat.read_sav("underscore.sav")
>>> df3
   á_1  á_2
0  1.0  2.0
1  3.0  4.0

jonathon-love · 2020-05-28T08:31:38Z

yeah, i'm coming down on the side of "something else" too.

ofajardo · 2020-05-28T09:44:44Z

it is column 37 specifically Not sure what is special about it

jonathon-love · 2020-05-28T09:46:31Z

i think it's more likely that it's got itself into a bad state, and it just happens to fall over at column 37 ... try deleting the first 36 columns and see if it still fails.

ofajardo · 2020-05-28T09:47:10Z

no, I have a dataframe with only column 37 and the error it is reproducible

jonathon-love · 2020-05-28T09:48:19Z

oh! good work!

ofajardo · 2020-05-28T09:49:48Z

and in particular it is the 4th (counting from 0) character in that name

jonathon-love · 2020-05-28T09:50:21Z

chuckles, oh wow!

ofajardo · 2020-05-28T09:50:41Z

ok wait, that last statement I did too fast

ofajardo · 2020-05-28T09:52:08Z

so to reproduce the problem I need the first 5 characters of the name. If the first or the last is not included the error doesn't come

jonathon-love · 2020-05-28T09:54:02Z

do you mean 5 characters, or 5 bytes?

if the first byte is missing, that would change all the unicode boundaries?

ofajardo · 2020-05-28T09:54:33Z

characters, I am dealing it as a str object in python

ofajardo · 2020-05-28T09:55:58Z

among those 5 characters we have the underscore. Removing the undescore also cures the issue

jonathon-love · 2020-05-28T09:56:50Z

hahahaha ... this is a good one :P

ofajardo · 2020-05-28T09:57:48Z

these are the bytes
b'\xd7\x95\xd7\xaa\xd7\xa7_\xd7\x91'

ofajardo · 2020-05-28T09:59:05Z

well, I have no idea what's going on. But now that the issue is isolated, should we file an issue in Readstat?

jonathon-love · 2020-05-28T10:00:34Z

those bytes work for me:

ofajardo · 2020-05-28T10:02:34Z

what is interesting is that visually the word looks different in your screenshot or if I paste it here (ותק_ב) compared to what I see on my console

jonathon-love · 2020-05-28T10:04:05Z

yeah ok ... there's certainly something unusual going on. i've gotta go shopping, and then probably to bed, but if you're confident with the issue then feel free to report it to readstat.

ofajardo · 2020-05-28T10:14:39Z

this is really insane, I saved the thing to a csv to share in the readstat issue ... if I open that in an editor it shows again different characters. If I read the csv in pandas the difference again ... but it is fully reproducible, if I read the csv and try to save it to sav same issue. I wonder if it is a python bug rather than Readstat? maybe it worths trying with Readstat command line before submitting the issue. I'll try to do it if I get time.

ofajardo · 2020-05-28T10:41:22Z

Ok I didn't manage to convert the csv to anything with readstat cli, it is supposed to support such an operation but for me it doesn't work.
Attached the suspicious csv. Maybe tomorrow or some other day you can try to see if jamovi handles it correctly?

hebrew.csv.zip

jonathon-love · 2020-05-28T10:56:37Z

yeah, i tried to use the cli, but the docker image wouldn't build for me.

jamovi will open that .csv though:

jonathon-love · 2020-05-28T10:57:58Z

oh wait, of course jamovi will open the .csv ... chuckles ... total brain fart ... one sec.

jonathon-love · 2020-05-28T11:00:22Z

yup, if i read that in, and write it with jamovi, the .sav file can't be opened. same with writing the file with haven.

ofajardo · 2020-06-03T08:13:56Z

after the last Readstat fix, the whole file can be written to sav and read from sav again (after replacing spaces by underscore)

jonathon-love mentioned this issue May 28, 2020

readstat won't write a .sav file with a variable named ותק_ב correctly WizardMac/ReadStat#206

Closed

writing .sav files with international chars in variable names leads to crash #5

writing .sav files with international chars in variable names leads to crash #5

Comments

jonathon-love commented May 27, 2020

ofajardo commented May 27, 2020

jonathon-love commented May 27, 2020

ofajardo commented May 27, 2020 • edited Loading

ofajardo commented May 27, 2020

ofajardo commented May 27, 2020

ofajardo commented May 27, 2020 • edited Loading

jonathon-love commented May 28, 2020

jonathon-love commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020 • edited Loading

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented May 28, 2020

ofajardo commented May 28, 2020

jonathon-love commented May 28, 2020

jonathon-love commented May 28, 2020

jonathon-love commented May 28, 2020

ofajardo commented Jun 3, 2020

ofajardo commented May 27, 2020 •

edited

Loading

ofajardo commented May 27, 2020 •

edited

Loading

ofajardo commented May 28, 2020 •

edited

Loading