-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
writing .sav files with international chars in variable names leads to crash #5
Comments
hi no actually I don't see the issue on my side neither in pyreadstat or in pyreadr ... let me know if I can be of any help |
hmm, no, i'm able to save files with international chars ... are you able to export this .csv as a .sav? |
ah this is an interesting one. Right now I can't save it with this error:
Now, this error comes from my code. I did it because if you have a variable starting with a number, readstat will exit with an error saying something like illegal character, but you don't know where it comes from, so I added a checking for each variable name to give a better error message:
so, now I wonder what's the best way to check this. let me deactivate that guard anyway to see what readstat would do with this variable. Also it's interesting that before I tried a variable name ábaco and that did work well. Also interesting that this is not the first variable, so somehow Python things that the character ה is non alphabetic. This variable is the only one having that characteristic. |
and as expected if I disable my guard the expected Readstat error comes:
|
ohhh, this variable name has a space in front, meaning the first character is 32. so it seems pyreadstat is doing it correctly, altough the error message was a bit difficult to interpret |
OK, now I stripped the space in front of that character and pyreadstat does not complain anymore, however Readstat does:
it doesn't like the name of the column 16: this one is because it has a space in it. I just tried and it seems that indeed Readstat doesn't like spaces, so yet another thing to check! I replaced all the spaces in the variables names by _ and now it is saved fine |
ah, thanks, this is very helpful ... but where/how is that error thrown?
this doesn't return an error code. |
oh righto, you have to call |
but does that same file open fine? |
for me yes |
oh! well you're doing better than me and haven. could you send me through that file? with thanks |
OK sorry, I was wrong: if I replace the spaces by _ then the file saves but cannot be opened:
however if I just remove the spaces (replace " " by "") then the file saves and can be opened correctly. I can send you the file if still interested. But yeah, it seems there is a Readstat bug here. |
interestingly this works, meaning it is not the international characters per se, there is something else:
|
yeah, i'm coming down on the side of "something else" too. |
it is column 37 specifically Not sure what is special about it |
i think it's more likely that it's got itself into a bad state, and it just happens to fall over at column 37 ... try deleting the first 36 columns and see if it still fails. |
no, I have a dataframe with only column 37 and the error it is reproducible |
oh! good work! |
and in particular it is the 4th (counting from 0) character in that name |
chuckles, oh wow! |
ok wait, that last statement I did too fast |
so to reproduce the problem I need the first 5 characters of the name. If the first or the last is not included the error doesn't come |
do you mean 5 characters, or 5 bytes? if the first byte is missing, that would change all the unicode boundaries? |
characters, I am dealing it as a str object in python |
among those 5 characters we have the underscore. Removing the undescore also cures the issue |
hahahaha ... this is a good one :P |
these are the bytes |
well, I have no idea what's going on. But now that the issue is isolated, should we file an issue in Readstat? |
yeah ok ... there's certainly something unusual going on. i've gotta go shopping, and then probably to bed, but if you're confident with the issue then feel free to report it to readstat. |
this is really insane, I saved the thing to a csv to share in the readstat issue ... if I open that in an editor it shows again different characters. If I read the csv in pandas the difference again ... but it is fully reproducible, if I read the csv and try to save it to sav same issue. I wonder if it is a python bug rather than Readstat? maybe it worths trying with Readstat command line before submitting the issue. I'll try to do it if I get time. |
Ok I didn't manage to convert the csv to anything with readstat cli, it is supposed to support such an operation but for me it doesn't work. |
oh wait, of course jamovi will open the .csv ... chuckles ... total brain fart ... one sec. |
yup, if i read that in, and write it with jamovi, the .sav file can't be opened. same with writing the file with haven. |
after the last Readstat fix, the whole file can be written to sav and read from sav again (after replacing spaces by underscore) |
hey @ofajardo,
do you have this problem?
with thanks
The text was updated successfully, but these errors were encountered: