-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nest, unnest, extract and extract! #3258
base: main
Are you sure you want to change the base?
Conversation
Agreed.
Let's continue discussion at #3258 (comment). |
src/abstractdataframe/nest.jl
Outdated
`cols` (default `:setequal`) and `promote` (default `true`) keyword arguments | ||
have the same meaning as in [`push!`](@ref). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe repeat the explanation? Usually we don't require users to read other docstrings like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I started describing it I realized that anything except cols=:union
and promote=true
is not really useful. We can always add it later. So I opted for a simpler design for now and functions do not take these arguments.
I propose to discuss this PR step by step. Start with some data frame:
First, already doing
(note that such grouped data frame also supports convenient indexing to get the nested data frames, something which is not easily available if we indeed nest data) Now, a basic pattern to nest some columns as
The only thing user needs to remember here is that If one wants to nest data frames one can just write:
Now the columns are copied (as Finally note that operation specification syntax also works with row nesting (something we have not discussed but is useful):
In the implementation in the PR I used a bit different pattern:
but it was only because the implementation should support all cases (and In summary - my question is. Given these considerations do we need to add |
e.g. julia> combine(groupby(df, :a), AsTable([:b, :c]) => DataFrame => :bc)
ERROR: ArgumentError: a single value or vector result is required (got DataFrame) Is the reason for this error documented somewhere? i.e. why doesn't this just work? The error message could be a good place to document the I don't like having to specify the columns twice. If it's okay for combine(groupby(df, :a), AsTable(:) => Ref∘DataFrame => :bc) but sometimes it's desired that |
The reason is safety (in other works: to allow for non-breaking change in the future if we decided it is needed). For example if user writes:
it would be ambiguous if user wants the produced data frame to be stored in one cell of a data frame or expanded (as intuitively users might expect it to be expanded just as vectors are expanded). This is especially relevant when trying to detect rare cases when some function can either return a table or a scalar (i.e. when the return type of the operation is not type stable).
This is a good point. I will make a PR changing this. |
Yeah That said, if we add |
This is what I had in mind.
This is something we do not need to add as we already have it. As What we need is a function designed to handle cases when each row potentially has a different schema.
Currently dicts would not work as they do not have a defined column order (but we could change this; but then also a change in |
I'm missing something, how does GroupedDataFrame substitute for df nesting? My uninformed impression so far is that GroupedDataFrame partially substitutes for Pandas's row labels but that hierarchical column labels have no equivalent in DFjl and nested dataframes is my workaround for that missing feature. |
Indeed nested column labels are not supported and a work-around for them is needed. However, my question is why do you use data frame for this. Normally a
That is why I have said that normally I would expect
What we mean is that you can easily index into a The point is that if you nest whole |
In my dataframe, each row specifies a regression model and the |
You mean that "per group" you want to store different objects:
Then indeed nesting a data frame (or any other object) makes sense. |
@jariji - given my last comment. Now I realized that one cannot create such an object by nesting. I.e. the use case when nesting seems to be indeed needed is when you add different objects sequentially (if you did it in one-shot they would have to have the same number of columns). Is this indeed your case, i.e. you nest only columns needed for estimation of the model but the other columns, that are also nested, are only added later? |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
OK, I have thought how we should move forward. Although the functions are simple I think we can keep them for user-friendlyness. The functions are:
For now I gave tentative implementations (to show how things internally would work). |
The schema isn't fixed either when splitting string columns into one or more columns: in some cases you might have no occurrence of the separator, in some cases one or more occurrences, giving one, two or more columns. That's why it seems to make sense to be able to support both strings and collections in the same function. Why call it Otherwise your proposal sounds good to me. |
I meant Now regarding:
This does not supported anyway, as currently there is no way to specify that the string should be split in the syntax. What we provide is the following:
with the restriction that every string must be split into the same number of groups. Maybe then - instead of adding |
Related PR for string-splitting with fixed number of splitpoints JuliaLang/julia#43557 |
OK - |
Self-note. Investigate:
pattern |
Just learning Julia, so apologies if this is redundant, but I'd love to have nest/unnest (and convenience functions for mapping pipelined functions over columns that contain dataframes) for something like Jennifer Bryan's "row-oriented" workflow, which really helps to keep a project organized (in a real data structure, rather than with ad-hoc naming conventions, etc) when repeating multiple analyses over e.g. different geographic units. https://github.com/jennybc/row-oriented-workflows |
Fixes
#3005
#2890
#3116
#2767
The PR adds
nest
andunnest
and introducesscalar
kwarg toflatten
(which is needed inunnest
.flatten
is ready for review.For
nest
andunnest
requires discussion if we like the proposed API (they work, but maybe we will decide to change API).Some important decisions I propose:
nest
only works onGroupedDataFrame
(the reason is to avoid complexity of group order specification); nesting is done always toDataFrame
(to keep things simple); another not easy decision is syntax I proposed[:x, :y] => :z
which means that columns:x
and:y
should be nested and stored in column:z
(but I would like to confirm that we find it intuitive, as syntax:z => [:x, :y]
also could be advocated for).unnest
supports both tables (e.g.DataFrame
) and rows (e.g.Tables.AbstractRow
) and has two options:flatten=true
, when rows of the nested columns are flattened, andflatten=false
(when they are left as is - this is probably useful, if we work with rows)TODO:
CC @nalimilan @pdeffebach @jariji