Skip to content

Bug: Join with Series inconsistent when DataFrame is empty vs when it is not #11719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hannahdiels opened this issue Nov 29, 2015 · 4 comments
Closed
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@hannahdiels
Copy link

Expected

Regardless of the arguments specified, the return of DataFrame.join always has the left side's index values and name. In the case of how='right', there should be np.nan in the index when the right side does not 'match' a row on the left.

In this case:

>>> pd.DataFrame([], columns=['gene'], index=pd.Index([], name='family')).join(s), how='right', on='gene')
     gene  gene_id
family              
NaN     NaN        1

Actual / reproduce

Right index is returned for empty dataframe:

>>> import pandas as pd
>>> s = pd.Series([1], name='gene_id', index=pd.Index([1], name='gene')
>>> pd.DataFrame([], columns=['gene'], index=pd.Index([], name='family')).join(s), how='right', on='gene')
     gene  gene_id
gene              
1     NaN        1

Note that it doesn't happen with non-empty frames:

>>> pd.DataFrame([1], columns=['gene'], index=pd.Index([1], name='family')).join(s), how='right', on='gene')
        gene  gene_id
family               
1          1        1

Version

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.5-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: None
pip: 7.1.2
setuptools: 18.5
Cython: None
numpy: 1.10.1
scipy: 0.16.1
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.0b1
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Nov 29, 2015

Regardless of the arguments specified, the return of DataFrame.join always has the left side's index values and name. In the case of how='right', there should be np.nan in the index when the right side does not 'match' a row on the left.

where does this come from?

In this case there are no matches, yet you need an index. What would you have it do?

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design labels Nov 29, 2015
@hannahdiels
Copy link
Author

The docstring leaves it unspecified, or at least it's not very clear; so combined with a python interpreter, I assume it always returns the left side's index. Further, I interpreted how as inner, left outer, right outer, outer join. Knowing which index will be returned in each case is very useful for scripts at least.

Now, back to this case. I have an algorithm whose inputs sometimes contains an empty DataFrame. This will cause a lot of NaNs in outer joins and that's fine for my case. I don't have to check for it being empty anywhere, it just keeps working; except in the case I mentioned above. It breaks like so:

df = df.join(s, how='right', on='gene')
# later on
df.reset_index().groupby('gene_id')['family']  # <-- no 'family' column if input df was empty, yet worked just fine when input df wasn't empty

@jreback
Copy link
Contributor

jreback commented Nov 30, 2015

ok, can you make a short tests which exercises all of these cases (e.g. 4 x how x (left empty, right non-empty + left non-empty, right empty + both empty + both non-empty), and show results? this can confirm what is existing behavior.

essentialy this is a nested loop over the cases (and which can directly be turned into a test)

@jreback jreback closed this as completed in 6693a72 Feb 5, 2016
@jreback jreback reopened this Feb 5, 2016
@mroeschke
Copy link
Member

Looks like this needs more information of the expected behavior of multiple cases, so closing until there is more clarity what the intended behavior should be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants