-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.to_stata should automatically write in format 117 with wide strings #23564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It would also be good to remove this warning in the documentation, saying that writing wider strings than 244 characters is prohibited. Should I make a new issue for that? |
|
I would argue against automatic promotion to 117 since this is adding some magic -- the default command has I think a more helpful error would be good, one that mentions that setting verions=117 allows for arbitrary length strings. The error message instructing users to consider 117 should also inform them of the limitations -- specifically that files exported in format 117 can only be ready by Stata 13 or newer. (FWIW 114 goes back to Stata 10 |
What about changing the default Stata 13 was released 5.5 years ago, so I think it's a safe wager that the vast majority of users have 13, 14, or 15. |
I always find automagic choices to be too magic. I would prefer to move to 117 by default with a standard deprecation cycle than to have data-dependent format choices.
I think you might be surprised how many institutions are not current on Stata and only upgrade idiosyncratically. |
Perhaps you're right. I think updating the docs and error message would be helpful at least. Additionally, the 117 format can write fixed-width strings up to 2045 characters long. Currently, writing a string that's 250 characters long in the 117 format saves it as an |
@bashtage : Deprecation cycle also works. However, keep in mind that the proposal of auto-changing to 117 would have been accompanied by a warning message if we had to auto-convert, for transparency reasons (anti-magic) as you have eluded. |
Yes, but I think if one wants to move to 117 as default it is OK. 117 has strictly more features, with the only misfeature a lack of backward compat. 114 will still be available. |
@kylebarron A PR to enable writing for up to 2045 would be fine. IIRC this choice was implemented to simplify and improve resue. IMO there are no real advantages to writing 1500 character fixed width strings when compred to strLs, and it can really blow up the size of a file if the string sizes vary much. If you want to contribute, the most important contribution would be to add support for 118 format which adds unicode. This is a non-trivial change since the main writer would need to be rewritten since the output is (sort of) variable width. |
I'd love to try out a non-trivial PR! It doesn't appear that there's an existing issue for writing unicode to Stata. (#9882 looks like it was only concerning reading version 118 files). Should I create one? I agree with your conclusion that there isn't really an advantage to wider fixed width strings, so I won't touch that. |
Sure. Adds a useful export/import capability. |
I ran into this just now and noticed that (a) there's 117 writing functionality, and (b) there's no suggestion in the error. I use pandas and Stata a lot, so I was surprised to see it.
In my crowd of colleagues (i.e. academics who aren't the fastest upgraders), one of these rules would work (assuming Stata stays on a roughly two-year cycle):
For an initial change, maybe the 0.25.0 to 1.0 cycle would be a good time to change it. If your code is (possibly) going to break anyway, why not add one more easy fix?
I made a PR for improving the error message: #23629. |
I think this can be closed. I think the consensus best choice was implemented in #23629 which is to give users the option. I think the only other issue is whether 117 should become the default output format. This would require a deprecation cycle, and by the time it is finished Stata capable of reading 117 will be 6+ years old, so I think it is probably a good idea. |
Sounds good. We can always revisit if need be. Thanks! |
I've been trying to use version 117 exaclty because of this, but I get the following error: TypeError: to_stata() got an unexpected keyword argument 'version' Any clues? Edit: I'm using Python 3.6 |
You probably need to update pandas |
Code Sample, a copy-pastable example if possible
Problem description
The last line above raises an exception:
but is solved with:
This functionality (writing in
dta
format 117) was added in version 0.23. In my opinion, the Stata writer should automatically switch to version 117 if one of the columns is wider than 244 characters. At the least, the error message should be changed to note that as of version 0.23, it's possible to write long strings to Stata files by addingversion=117
.I'd be happy to submit a PR if this functionality is desired.
Expected Output
Stata file written to disk.
Output of
pd.show_versions()
cc: @bashtage
The text was updated successfully, but these errors were encountered: