Skip to content

Respect numpy fixed length strings #10351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maxnoe opened this issue Jun 14, 2015 · 8 comments
Closed

Respect numpy fixed length strings #10351

maxnoe opened this issue Jun 14, 2015 · 8 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action

Comments

@maxnoe
Copy link

maxnoe commented Jun 14, 2015

pandas converts all strings to 'O' columns.

For consistency and writing to files it would be good if pandas respects numpy fixed length strings.

import numpy as np
import pandas as pd

df = pd.DataFrame({'date': np.array(['2015-10-10', '2015-01-01'], dtype='S10')})
print(df.date.dtype)

Results in dtype('O') and not dtype('S10')

@jreback
Copy link
Contributor

jreback commented Jun 14, 2015

this would add quite a bit of complexity, so what exactly would be the gain here?
as a side note, you are representing a datetime, which should almost certainly be coerced to a datetime64[ns] for any usage.

@maxnoe
Copy link
Author

maxnoe commented Jun 14, 2015

The date was just an example.

The gain would be that numpy operations on the underlying string array would be possible and
that one has not to care for a 'min_itemsize' in to_hdf or root_pandas 'to_root'

@jreback
Copy link
Contributor

jreback commented Jun 14, 2015

you don't need to specify min_itemsize except if you need to anticipate a bigger string in another chunk, it will be computed automatically otherwise.

However, you have much greater complexity in the entire code base.

@bashtage
Copy link
Contributor

This seems nearly impossible. What should happen in the following case?

df = pd.DataFrame({'date': np.array(['2015-10-10', '2015-01-01'], dtype='S10')})

df.iloc[0,0] = 'Some other string that is large'

The only manner I could imagine something like this being implemented would be to add a fixed-width-string-column type, like Categorical. This column would have to be aware of the fixed size and do padding for strings that are too small (or raise) and raise for strings that are too large. Seems like a lot effort to avoid a small number of function calls when writing data.

@jreback jreback added API Design Needs Discussion Requires discussion from core team before further action Dtype Conversions Unexpected or buggy dtype conversions labels Jun 17, 2015
@jreback
Copy link
Contributor

jreback commented Jun 17, 2015

yeh, I think I went thru this exercise before. The problem is that assignment might have to astype the entire array, which is completely inefficient. The S type would be nice if could mark a column as immutable (e.g. so for an Index type this could actually make sense). However, in almost all cases just use Categoricals and the efficient problem is solved anyhow.

closing, but if you want to discuss more, happy to.

@jreback jreback closed this as completed Jun 17, 2015
@jreback
Copy link
Contributor

jreback commented Jun 17, 2015

this is a dupe of #5261 as well.

@allComputableThings
Copy link

Since pandas supports customs types now:
https://pandas.pydata.org/pandas-docs/version/0.23.1/whatsnew.html#whatsnew-023-enhancements-extension
can the same mechanism be used to support fixed width strings?

@jreback
Copy link
Contributor

jreback commented Jul 3, 2018

in theory

but would require a community pull request to make this happen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants