Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing .sav files with international chars in variable names leads to crash #5

Open
jonathon-love opened this issue May 27, 2020 · 37 comments

Comments

@jonathon-love
Copy link
Member

hey @ofajardo,

do you have this problem?

with thanks

@ofajardo
Copy link

hi

no actually I don't see the issue on my side neither in pyreadstat or in pyreadr ...

let me know if I can be of any help

@jonathon-love
Copy link
Member Author

hmm, no, i'm able to save files with international chars ...

are you able to export this .csv as a .sav?

fred.csv.zip

@ofajardo
Copy link

ofajardo commented May 27, 2020

ah this is an interesting one. Right now I can't save it with this error:

pyreadstat._readstat_parser.PyreadstatError: variable name  השכלה starts with an illegal (non-alphabetic) character

Now, this error comes from my code. I did it because if you have a variable starting with a number, readstat will exit with an error saying something like illegal character, but you don't know where it comes from, so I added a checking for each variable name to give a better error message:

if not variable_name[0].isalpha():
            raise PyreadstatError("variable name %s starts with an illegal (non-alphabetic) character" % variable_name)

so, now I wonder what's the best way to check this. let me deactivate that guard anyway to see what readstat would do with this variable.

Also it's interesting that before I tried a variable name ábaco and that did work well. Also interesting that this is not the first variable, so somehow Python things that the character ה is non alphabetic. This variable is the only one having that characteristic.

@ofajardo
Copy link

and as expected if I disable my guard the expected Readstat error comes:

ReadstatError: A provided name contains an illegal character

@ofajardo
Copy link

ohhh, this variable name has a space in front, meaning the first character is 32.

so it seems pyreadstat is doing it correctly, altough the error message was a bit difficult to interpret

@ofajardo
Copy link

ofajardo commented May 27, 2020

OK, now I stripped the space in front of that character and pyreadstat does not complain anymore, however Readstat does:

A provided name contains an illegal character

it doesn't like the name of the column 16: this one is because it has a space in it. I just tried and it seems that indeed Readstat doesn't like spaces, so yet another thing to check!

I replaced all the spaces in the variables names by _ and now it is saved fine

@jonathon-love
Copy link
Member Author

ah, thanks, this is very helpful ... but where/how is that error thrown?

readstat_variable_t *readstat_add_variable(readstat_writer_t *writer, const char *name, readstat_type_t type, 
        size_t storage_width);

this doesn't return an error code.

@jonathon-love
Copy link
Member Author

oh righto, you have to call readstat_begin_row() ...

@jonathon-love
Copy link
Member Author

I replaced all the spaces in the variables names by _ and now it is saved fine

but does that same file open fine?

@ofajardo
Copy link

for me yes

@jonathon-love
Copy link
Member Author

oh! well you're doing better than me and haven.

could you send me through that file?

with thanks

@ofajardo
Copy link

OK sorry, I was wrong: if I replace the spaces by _ then the file saves but cannot be opened:

Invalid file, or file has unsupported features

however if I just remove the spaces (replace " " by "") then the file saves and can be opened correctly. I can send you the file if still interested.

But yeah, it seems there is a Readstat bug here.

@ofajardo
Copy link

interestingly this works, meaning it is not the international characters per se, there is something else:

>>> df2 = pd.DataFrame([[1,2],[3,4]], columns = ["á_1","á_2"])
>>> pyreadstat.write_sav(df2, "underscore.sav")
>>> df3, meta3 = pyreadstat.read_sav("underscore.sav")
>>> df3
   á_1  á_2
0  1.0  2.0
1  3.0  4.0

@jonathon-love
Copy link
Member Author

yeah, i'm coming down on the side of "something else" too.

@ofajardo
Copy link

ofajardo commented May 28, 2020

it is column 37 specifically Not sure what is special about it

@jonathon-love
Copy link
Member Author

i think it's more likely that it's got itself into a bad state, and it just happens to fall over at column 37 ... try deleting the first 36 columns and see if it still fails.

@ofajardo
Copy link

no, I have a dataframe with only column 37 and the error it is reproducible

@jonathon-love
Copy link
Member Author

oh! good work!

@ofajardo
Copy link

and in particular it is the 4th (counting from 0) character in that name

@jonathon-love
Copy link
Member Author

chuckles, oh wow!

@ofajardo
Copy link

ok wait, that last statement I did too fast

@ofajardo
Copy link

so to reproduce the problem I need the first 5 characters of the name. If the first or the last is not included the error doesn't come

@jonathon-love
Copy link
Member Author

do you mean 5 characters, or 5 bytes?

if the first byte is missing, that would change all the unicode boundaries?

@ofajardo
Copy link

characters, I am dealing it as a str object in python

@ofajardo
Copy link

among those 5 characters we have the underscore. Removing the undescore also cures the issue

@jonathon-love
Copy link
Member Author

hahahaha ... this is a good one :P

@ofajardo
Copy link

these are the bytes
b'\xd7\x95\xd7\xaa\xd7\xa7_\xd7\x91'

@ofajardo
Copy link

well, I have no idea what's going on. But now that the issue is isolated, should we file an issue in Readstat?

@jonathon-love
Copy link
Member Author

those bytes work for me:

Screen Shot 2020-05-28 at 20 00 01

@ofajardo
Copy link

what is interesting is that visually the word looks different in your screenshot or if I paste it here (ותק_ב) compared to what I see on my console
image

@jonathon-love
Copy link
Member Author

yeah ok ... there's certainly something unusual going on. i've gotta go shopping, and then probably to bed, but if you're confident with the issue then feel free to report it to readstat.

@ofajardo
Copy link

this is really insane, I saved the thing to a csv to share in the readstat issue ... if I open that in an editor it shows again different characters. If I read the csv in pandas the difference again ... but it is fully reproducible, if I read the csv and try to save it to sav same issue. I wonder if it is a python bug rather than Readstat? maybe it worths trying with Readstat command line before submitting the issue. I'll try to do it if I get time.

@ofajardo
Copy link

Ok I didn't manage to convert the csv to anything with readstat cli, it is supposed to support such an operation but for me it doesn't work.
Attached the suspicious csv. Maybe tomorrow or some other day you can try to see if jamovi handles it correctly?

hebrew.csv.zip

@jonathon-love
Copy link
Member Author

yeah, i tried to use the cli, but the docker image wouldn't build for me.

jamovi will open that .csv though:

Screen Shot 2020-05-28 at 20 55 35

@jonathon-love
Copy link
Member Author

oh wait, of course jamovi will open the .csv ... chuckles ... total brain fart ... one sec.

@jonathon-love
Copy link
Member Author

yup, if i read that in, and write it with jamovi, the .sav file can't be opened. same with writing the file with haven.

@ofajardo
Copy link

ofajardo commented Jun 3, 2020

after the last Readstat fix, the whole file can be written to sav and read from sav again (after replacing spaces by underscore)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants