Skip to content

Using assign to place values from a dict into an empty dataframe adds the column names, but no values #17847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rs481 opened this issue Oct 11, 2017 · 9 comments

Comments

@rs481
Copy link

rs481 commented Oct 11, 2017

summary= pd.DataFrame()

myData = dict()

myData['B'] = 7
myData['C'] = -9

summary = summary.assign(**myData)

The result of

print(summary)

is

Empty DataFrame
Columns: [B, C]
Index: []

This occurs without error or warning messages.

If this had been done to a dataframe with some data already in a column called 'A' then the result would have been a data frame with:

A B C
1 7 -9

I believe that assign should work consistently whether or not there is already data in the dataframe, creating an index if necessary. The current behaviour to add columns, but not to add any data rows, and not raise an exception or warning allows problems to occur silently.

I would expect the outcome of assigning myData to an empty dataframe to be:

B C
7 -9

This problem is worse than it initially appears to be, as operations adding more data to the "empty" dataframe (with some columns) via assign will succeed.

Once written out to file the dataframe will have less data than it has columns

summary= pd.DataFrame()

myData = dict()

myData['B'] = 7
myData['C'] = -9

summary = summary.assign(**myData)

myOtherData = dict()

myOtherData['D'] = 3
myOtherData['E'] = 4

summary = summary.assign(**myData)

summary.to_csv("summary.txt", sep=" ", header=True, index_label='rep')

gives as summary.txt

B C D E
3 4

Which is clearly malformed.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pandas.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-126-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.3
pytest: None
pip: 1.5.4
setuptools: 3.3
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.13.3
xarray: None
IPython: 1.2.1
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.2
feather: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@toobaz
Copy link
Member

toobaz commented Oct 11, 2017

The fact that summary = summary.assign(**myData) does not insert any data in a DataFrame with empty index is not a bug. It couldn't be otherwise. Assigning a scalar to a column means "set that column to that number for all rows" - if there are no rows, there is nothing to do.

This said, as a consequence of "fixing" #16823 (I respectfully disagree on the fact that it was a bug in the first place), your code will now raise an error in the git version of pandas (but why not forbid pd.DataFrame({'a' : 3}, index=[]) too then for coherence?)

Notice your code calls twice summary.assign(**myData), you probably meant the second to be summary.assign(**myOtherData). Anyway, the current version of pandas does not produce a malformed csv when saving a DataFrame with empty index:

pd.DataFrame({'B' : 7, 'C' : -9}, index=[]).to_csv("summary.txt", sep=" ", header=True, index_label='rep')

results in

rep B C

which is correct.

So I think this can be closed.

@rs481
Copy link
Author

rs481 commented Oct 12, 2017

Ok, it seems the code I sent (even with your typo correction) failed to reproduce the issue (despite being my actual code with renamed variables) as I didn't have the right data types.

import pandas as pd

summary= pd.DataFrame()

print("OUT 1")
print(summary)

myData = dict()

myData['B'] = 7
myData['C'] = -9

summary = summary.assign(**myData)

print("OUT 2")
print(summary)

myOtherData = dict()

myOtherData['D'] = [3]
myOtherData['E'] = [4]

summary = summary.assign(**myOtherData)

print("OUT 3")
print(summary)

summary.to_csv("summary.txt", sep=" ", header=True, index_label='rep')

The issue was that 'D' and 'E' were actually vectors of length 1, not scalars. This means that the assign "just works".

The output of the above is:

OUT 1
Empty DataFrame
Columns: []
Index: []
OUT 2
Empty DataFrame
Columns: [B, C]
Index: []
OUT 3
B C D E
0 NaN NaN 3 4

and the output of cat summary.txt is

rep B C D E
0   3 4

Where here it has put spaces such that the 3 and 4 are perfectly aligned beneath B and C not D and E, which is what threw me (and the script that was reading the file) off.

If we change the output to summary.to_csv("summary.txt", sep=",", header=True, index_label='rep') (now with a separator of comma (",") not space)

rep,B,C,D,E
0,,,3,4

Which makes it obvious that the NaNs were coming out as empty strings.

So the behaviour of the second assign, with the length one vectors was actually exactly as expected and as desired.

So I think in effect the thing I actually want is that pandas would treat scalars as vectors of length 1 (doing whatever index stuff was necessary for the user), which would mean that the behaviour of this and #16823 would end up with a data frame with 1 row once the scalar data had been added.

I feel that treating a scalar the same as a length-1 vector is unambiguous and desirable behaviour, but obviously I'm not a dev on this project and I haven't reviewed the previous 17846 issues to find a reason why this is not the case.

@toobaz
Copy link
Member

toobaz commented Oct 12, 2017

I feel that treating a scalar the same as a length-1 vector is unambiguous and desirable behaviour

I'm afraid we all feel differently :-) But incidentally, a Series behaves in a very similar way (that is, if you forget that it contains scalars rather than vectors) to a DataFrame which automatically treats scalars as length-1 vectors!

@rs481
Copy link
Author

rs481 commented Oct 12, 2017

Ok, well thank you for taking a look at the issue. I suppose I will just have to be very careful and explicit in the future.

As a matter of interest, is there an explanation written anywhere on why there is a decision (perhaps as a consequence of some overarching principle) to make scalars behave very differently to length-1 vectors?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2017

closing as this is a usage error

@jreback jreback closed this as completed Oct 12, 2017
@toobaz
Copy link
Member

toobaz commented Oct 12, 2017

As a matter of interest, is there an explanation written anywhere on why there is a decision (perhaps as a consequence of some overarching principle) to make scalars behave very differently to length-1 vectors?

If you mean "in general"... then there is simply no reason why they should be equal, in Maths as in programming... can't think of a specific reference, but anyway this is something pandas inherits from numpy.

@rs481
Copy link
Author

rs481 commented Oct 12, 2017

Yes, ok they are different types and so we can expect nothing, but I disagree mathematically as a scalar, a 1x1 matrix and a Rank-1 (0?) Tensor are all equivalent.

If the answer is that numpy does this and consistency with numpy is a critical concern then that is the end of it.

My feeling on this is that turning a dict with scalar elements into a pandas dataframe with one row has a single, unambiguous meaning, and that this is a useful feature. This is also the way it works in R:

> d = list("a" = 7, "b" = -9)
> as.data.frame(d)
  a  b
1 7 -9

and I would be baffled if the above R code was equivalent to:

> data.frame(a = numeric(0), b = numeric(0))
[1] a b
<0 rows> (or 0-length row.names)

Really, I'm not arguing that it should be defined for scalars as a direct consequence of being defined for vectors, but that there is a good, and useful, definition of many operations for scalar values.

@toobaz
Copy link
Member

toobaz commented Oct 12, 2017

I disagree mathematically as a scalar, a 1x1 matrix and a Rank-1 (0?) Tensor are all equivalent

Uhm... the underlying set is the same, the set of operations you define on it differ. But yeah, right, not a particularly enlightening comparison.

Anyway, coherence with numpy certainly matters, but then there's also coherence with the fact that df[col] = 1 never changes the index, coupled with the fact that what you would like pd.DataFrame({'a' : 0, 'b' : 1}) to do can be easily done - this time, really unambiguously - with pd.DataFrame({'a' : [0], 'b' : [1]}).

@rs481
Copy link
Author

rs481 commented Oct 13, 2017

I have similar disagreements with the behaviour of df[col] = 1 (which is of course equivalent to df[col] = 0), but I'll agree it is too late to change this behaviour now as I'm sure code relies on this.

The workaround you suggest of checking and converting all elements in the dictionary to lists probably requires as much code as constructing a new DataFrame from the dict and using pd.concat()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants