Skip to content

add concat #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 15, 2023
Merged

add concat #137

merged 15 commits into from
May 15, 2023

Conversation

MarcoGorelli
Copy link
Contributor

No description provided.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Marco! concat being vertical-only matches my recollection, and overall this looks good modulo my minor comments. I'll comment on gh-136 on more of what was previously discussed about no horizontal concat.

@@ -214,6 +214,22 @@ def get_column_names(self) -> Sequence[str]:
"""
...

def concat(self, other: Sequence[DataFrame]) -> DataFrame:
"""
Concatenate current and other DataFrames vertically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about mentioning that column names must match? And must ordering match too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to document any row ordering guarantees or lack thereof as well. I.E. does this guarantee that for df1.concat([df2, df3]) that the row order is df1 rows, then df2 rows, then df3 rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume so - or should that not be guaranteed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe ANSI SQL doesn't guarantee any ordering of UNION ALL (which is the dataframe concat equivalent) unless there's also an ORDER BY clause (sort). For example, postgres doesn't guarantee any maintained ordering in a UNION ALL operation. So DataFrame APIs backed by a relational database would possibly struggle to meet that guarantee.

In general though, the often desired behavior is to maintain the order based on the passed in objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wasn't aware - ok thanks have added a note

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.E. Ibis (https://github.com/ibis-project/ibis) backed by basically any of its backends (https://github.com/ibis-project/ibis#backends) where the DataFrame(s) in question are the result of running joins.

There's no guarantees made in output ordering for the joins as well as no guarantees made in output ordering for the concatenation. It could potentially be worked around by doing something like using SQL ROW_NUMBER() functions to get ordering of the input tables to the concatenate call and then sort by that generated column afterwards, but that's somewhat expensive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need resolution here before we can merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've got Column.__getitem__, what does that return for an unordered column?

I don't know what the resolution is here, this discussion's a bit abstract for me - what do you suggest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've got Column.__getitem__, what does that return for an unordered column?

Potentially a random row or set of rows with no guarantee about determinism.

Maybe we can be a bit looser in the language and leave it how it currently is which is a bit underdefined which basically makes it implementation defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - anything you'd like changed, or ok to merge?

@MarcoGorelli MarcoGorelli changed the title (draft) add concat add concat Apr 17, 2023
@rgommers
Copy link
Member

One thing that I missed the first time around looking at this is: why is concat a method rather than a free function? I looked for a previous discussion about it, but could not find it. The few I did find only addressed that the input should be a list of df's - I suspect most people would assume this would be a free function, just like in pandas.

@jorisvandenbossche
Copy link
Member

I would also expect a function. But do we already have a full concept of a "namespace" from which this function can be accessed?

@rgommers
Copy link
Member

But do we already have a full concept of a "namespace" from which this function can be accessed?

We have the concept I think, and a package where it can live. Just need to add a DataFrame.__dataframe_namespace__ method - I'll open a PR. Also need it to add a null object I believe.

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Apr 27, 2023

I wasn't aware of such a namespace - so in practice, if I wanted to concatenate a sequence of dataframes dfs, I'd do something like

    namespace = dfs[0].__dataframe_namespace__
    return namespace.concat(dfs)

?
Sounds fine if so. I'll update after Ralf's PR then

@rgommers
Copy link
Member

I wasn't aware of such a namespace - so in practice, if I wanted to concatenate a sequence of dataframes dfs, I'd do something like

See gh-156.

@MarcoGorelli
Copy link
Contributor Author

I've added a commit to make this a top-level function rather than a DataFrame method - OK to get this in?

@MarcoGorelli
Copy link
Contributor Author

can add to index.rst after https://github.com/data-apis/dataframe-api/pull/157/files is merged

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented May 10, 2023

I've amended the description to note that the ordering guarantees are only there if the DataFrame implementation has a concept of ordered columns - OK to get this in?

(arguably the same comment might have to be made for sort, which would be undefined for dataframes which don't have ordered columns)

Co-authored-by: Keith Kraus <keith.j.kraus@gmail.com>
@MarcoGorelli
Copy link
Contributor Author

thanks all for reviews and comments!

@MarcoGorelli MarcoGorelli merged commit 2aeaf00 into data-apis:main May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants