Skip to content

df.to_stata should automatically write in format 117 with wide strings #23564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kylebarron opened this issue Nov 8, 2018 · 16 comments
Closed
Labels
API Design Docs IO Stata read_stata, to_stata
Milestone

Comments

@kylebarron
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': ['x' * 250]})
df.to_stata('test')

Problem description

The last line above raises an exception:

ValueError:
Fixed width strings in Stata .dta files are limited to 244 (or fewer)
characters.  Column 'a' does not satisfy this restriction.

but is solved with:

df.to_stata('test', version=117)

This functionality (writing in dta format 117) was added in version 0.23. In my opinion, the Stata writer should automatically switch to version 117 if one of the columns is wider than 244 characters. At the least, the error message should be changed to note that as of version 0.23, it's possible to write long strings to Stata files by adding version=117.

I'd be happy to submit a PR if this functionality is desired.

Expected Output

Stata file written to disk.

Output of pd.show_versions()

pd.show_versions()
No module named 'dask'

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+948.g82120016e
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.10.9
IPython: 7.1.1
sphinx: 1.8.1
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.13
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.1.2

cc: @bashtage

@kylebarron
Copy link
Contributor Author

It would also be good to remove this warning in the documentation, saying that writing wider strings than 244 characters is prohibited. Should I make a new issue for that?

@gfyoung gfyoung added IO Stata read_stata, to_stata API Design Docs labels Nov 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2018

  • Automatic switching to format 117 makes sense. However, let's issue a warning so that end users are aware that formats are being changed.

  • Good catch (re: the docs)! No need to open a separate issue. A PR that fixes everything that you've mentioned in this issue will be fine.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2018

I would argue against automatic promotion to 117 since this is adding some magic -- the default command has version=114 which would then be ignored based on data. Better to be explicit (don't want to be like R full of magic and pixie dust having a user wondering why they got a 117 format file and not understanding it was because they have a long string that they might not care aboue).

I think a more helpful error would be good, one that mentions that setting verions=117 allows for arbitrary length strings. The error message instructing users to consider 117 should also inform them of the limitations -- specifically that files exported in format 117 can only be ready by Stata 13 or newer. (FWIW 114 goes back to Stata 10

@kylebarron
Copy link
Contributor Author

What about changing the default version=114 to version=None, with options 114 and 117? If None, then try writing to version 114, and if impossible, automatically switch to 117. If version=114 is passed, raise an error with wide strings.

Stata 13 was released 5.5 years ago, so I think it's a safe wager that the vast majority of users have 13, 14, or 15.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2018

I always find automagic choices to be too magic. I would prefer to move to 117 by default with a standard deprecation cycle than to have data-dependent format choices.

Stata 13 was released 5.5 years ago, so I think it's a safe wager that the vast majority of users have 13, 14, or 15.

I think you might be surprised how many institutions are not current on Stata and only upgrade idiosyncratically.

@kylebarron
Copy link
Contributor Author

Perhaps you're right. I think updating the docs and error message would be helpful at least.

Additionally, the 117 format can write fixed-width strings up to 2045 characters long. Currently, writing a string that's 250 characters long in the 117 format saves it as an strL. Do you think it would be beneficial to change the 117 writer to allow writing wider widths as fixed widths?

@gfyoung
Copy link
Member

gfyoung commented Nov 8, 2018

@bashtage : Deprecation cycle also works. However, keep in mind that the proposal of auto-changing to 117 would have been accompanied by a warning message if we had to auto-convert, for transparency reasons (anti-magic) as you have eluded.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2018

Yes, but I think if one wants to move to 117 as default it is OK. 117 has strictly more features, with the only misfeature a lack of backward compat. 114 will still be available.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2018

@kylebarron A PR to enable writing for up to 2045 would be fine.

IIRC this choice was implemented to simplify and improve resue. IMO there are no real advantages to writing 1500 character fixed width strings when compred to strLs, and it can really blow up the size of a file if the string sizes vary much.

If you want to contribute, the most important contribution would be to add support for 118 format which adds unicode. This is a non-trivial change since the main writer would need to be rewritten since the output is (sort of) variable width.

@kylebarron
Copy link
Contributor Author

I'd love to try out a non-trivial PR! It doesn't appear that there's an existing issue for writing unicode to Stata. (#9882 looks like it was only concerning reading version 118 files). Should I create one?

I agree with your conclusion that there isn't really an advantage to wider fixed width strings, so I won't touch that.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2018

Sure. Adds a useful export/import capability.

@jtkiley
Copy link
Contributor

jtkiley commented Nov 11, 2018

I ran into this just now and noticed that (a) there's 117 writing functionality, and (b) there's no suggestion in the error. I use pandas and Stata a lot, so I was surprised to see it.

I always find automagic choices to be too magic. I would prefer to move to 117 by default with a standard deprecation cycle than to have data-dependent format choices.

Stata 13 was released 5.5 years ago, so I think it's a safe wager that the vast majority of users have 13, 14, or 15.

I think you might be surprised how many institutions are not current on Stata and only upgrade idiosyncratically.

In my crowd of colleagues (i.e. academics who aren't the fastest upgraders), one of these rules would work (assuming Stata stays on a roughly two-year cycle):

  1. When new Stata version n is released, begin the deprecation process to change the default to the format supporting Stata n-2.
  2. Same as 1 but substituting n-3.

For an initial change, maybe the 0.25.0 to 1.0 cycle would be a good time to change it. If your code is (possibly) going to break anyway, why not add one more easy fix?

I would argue against automatic promotion to 117 since this is adding some magic -- the default command has version=114 which would then be ignored based on data. Better to be explicit (don't want to be like R full of magic and pixie dust having a user wondering why they got a 117 format file and not understanding it was because they have a long string that they might not care aboue).

I think a more helpful error would be good, one that mentions that setting verions=117 allows for arbitrary length strings. The error message instructing users to consider 117 should also inform them of the limitations -- specifically that files exported in format 117 can only be ready by Stata 13 or newer. (FWIW 114 goes back to Stata 10

I made a PR for improving the error message: #23629.

@bashtage
Copy link
Contributor

I think this can be closed. I think the consensus best choice was implemented in #23629 which is to give users the option.

I think the only other issue is whether 117 should become the default output format. This would require a deprecation cycle, and by the time it is finished Stata capable of reading 117 will be 6+ years old, so I think it is probably a good idea.

@gfyoung
Copy link
Member

gfyoung commented Nov 15, 2018

Sounds good. We can always revisit if need be. Thanks!

@gfyoung gfyoung closed this as completed Nov 15, 2018
@gfyoung gfyoung added this to the 0.24.0 milestone Nov 15, 2018
@carolinalq
Copy link

carolinalq commented Nov 20, 2018

I've been trying to use version 117 exaclty because of this, but I get the following error:

TypeError: to_stata() got an unexpected keyword argument 'version'

Any clues?

Edit: I'm using Python 3.6

@kylebarron
Copy link
Contributor Author

You probably need to update pandas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Docs IO Stata read_stata, to_stata
Projects
None yet
Development

No branches or pull requests

5 participants