[WIP] complete and expand df #1864

versipellis · 2019-06-28T21:37:52Z

First PR I'm ever making to a big project, so bear with my noviceness :)

This PR spawns from a conversation on the Slack #data channel. It takes a dataframe as an input, and some vector of columns from that dataframe, and turns implicit into explicit missing values.

expanddf(df, indexcols) returns a dataframe of every possible combination of index cols, the length of which would be the product of the length of every column's unique values.

completedf(df, indexcols, fill=missing) wraps expanddf, but joins the original dataframe back, and provides the utility to fill the newly-missing values.

I haven't written the documentation in anticipation that feedback might cause a number of changes to the implementation. See below for examples in use.

The equivalent in...
R Tidyverse: https://tidyr.tidyverse.org/reference/complete.html
Python Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Stata: https://www.stata.com/manuals13/dfillin.pdf

Example:

julia> df = DataFrame(A=["05","05","05","06","06","07","07",missing],B=["a",missing,"c","a",missing,"c","d","a"],C=[1,1,10,43,5,10,3,10])

julia> indexcols = [:A,:B]

julia> df
8×3 DataFrame
│ Row │ A       │ B       │ C     │
│     │ String⍰ │ String⍰ │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 05      │ a       │ 1     │
│ 2   │ 05      │ missing │ 1     │
│ 3   │ 05      │ c       │ 10    │
│ 4   │ 06      │ a       │ 43    │
│ 5   │ 06      │ missing │ 5     │
│ 6   │ 07      │ c       │ 10    │
│ 7   │ 07      │ d       │ 3     │
│ 8   │ missing │ a       │ 10    │

julia> expanddf(df, indexcols)
16×2 DataFrame
│ Row │ A       │ B       │
│     │ String⍰ │ String⍰ │
├─────┼─────────┼─────────┤
│ 1   │ 05      │ a       │
│ 2   │ 06      │ a       │
│ 3   │ 07      │ a       │
│ 4   │ missing │ a       │
│ 5   │ 05      │ missing │
⋮
│ 11  │ 07      │ c       │
│ 12  │ missing │ c       │
│ 13  │ 05      │ d       │
│ 14  │ 06      │ d       │
│ 15  │ 07      │ d       │
│ 16  │ missing │ d       │

julia> completedf(df, indexcols)
16×3 DataFrame
│ Row │ A       │ B       │ C       │
│     │ String⍰ │ String⍰ │ Int64⍰  │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 05      │ a       │ 1       │
│ 2   │ 06      │ a       │ 43      │
│ 3   │ 07      │ a       │ missing │
│ 4   │ missing │ a       │ 10      │
│ 5   │ 05      │ missing │ 1       │
⋮
│ 11  │ 07      │ c       │ 10      │
│ 12  │ missing │ c       │ missing │
│ 13  │ 05      │ d       │ missing │
│ 14  │ 06      │ d       │ missing │
│ 15  │ 07      │ d       │ 3       │
│ 16  │ missing │ d       │ missing │

julia> completedf(df, indexcols, fill=10000)
16×3 DataFrame
│ Row │ A       │ B       │ C     │
│     │ String⍰ │ String⍰ │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 05      │ a       │ 1     │
│ 2   │ 06      │ a       │ 43    │
│ 3   │ 07      │ a       │ 10000 │
│ 4   │ missing │ a       │ 10    │
│ 5   │ 05      │ missing │ 1     │
⋮
│ 11  │ 07      │ c       │ 10    │
│ 12  │ missing │ c       │ 10000 │
│ 13  │ 05      │ d       │ 10000 │
│ 14  │ 06      │ d       │ 10000 │
│ 15  │ 07      │ d       │ 3     │
│ 16  │ missing │ d       │ 10000 │

src/dataframe/dataframe.jl

bkamins

Thank you for an excellent idea and contribution. I have left some suggestions in comments.

versipellis · 2019-06-28T23:02:41Z

Thank you for the review! I'll go through them in the next few days and make the changes/see if there's a sane approach for some of them.

bkamins · 2019-06-29T07:13:42Z

Thank you. Also one more thing. If a column is categorical you probably should not use unique but levels - is this what you have indented? Probably @nalimilan can comment what is the best strategy to check if a column is categorcal given the planned changes in CategoricalArrays.jl (@nalimilan - note that df in this PR can be SubDataFrame in particular).

nalimilan · 2019-06-29T19:07:47Z

Thanks. Regarding the API, I wonder whether we need both expand and complete (I would drop the "df" suffix anyway): wouldn't it make sense to have a single function, with an argument to determine whether the other columns should be kept (complete) or not (expand)?

src/dataframe/dataframe.jl

versipellis · 2019-07-01T20:16:41Z

Thanks. Regarding the API, I wonder whether we need both expand and complete (I would drop the "df" suffix anyway): wouldn't it make sense to have a single function, with an argument to determine whether the other columns should be kept (complete) or not (expand)?

That would certainly be more elegant. I kept the different functions mostly to keep in mind users transitioning from R.

Thank you. Also one more thing. If a column is categorical you probably should not use unique but levels - is this what you have indented? Probably @nalimilan can comment what is the best strategy to check if a column is categorcal given the planned changes in CategoricalArrays.jl (@nalimilan - note that df in this PR can be SubDataFrame in particular).

Quite honestly, I forgot about categorical as a column type, and haven't used it much myself. I can go take a look at how to implement this though :)

versipellis · 2019-07-22T20:30:53Z

@bkamins @nalimilan I think I've addressed all the comments that were raised, except for the one about nesting. I'm not sure if that discussion is completely within the scope of this PR however - would you both be OK if I resolve that comment for the time being?

bkamins · 2019-07-23T01:00:06Z

@nalimilan - I am OK to leave out nesting. The only question is if we can later introduce it without breaking things (I do not know this functionality of dplyr in detail).

src/dataframe/dataframe.jl

versipellis · 2020-03-09T00:46:15Z

Hi @bkamins, my apologies, real life has been absolutely a nightmare. I'll be working on this PR in the coming week or two.

I'm not very good with writing tests, however, which is where I've mostly been stuck with this up till now. I had been planning to write tests to make sure each of the args works, but that's all I got to. Do you have anything else you'd like to see covered in the testing?

src/dataframe/dataframe.jl

bkamins · 2020-03-09T07:49:54Z

src/dataframe/dataframe.jl

+    if complete == false
+        return dummydf
+    else
+        joined = join(dummydf, df; on=_names(df)[colind], kind=:left, indicator=:source)


now it should be leftjoin

bkamins · 2020-03-09T07:53:10Z

src/dataframe/dataframe.jl

+    end
+end
+
+function expand(df::AbstractDataFrame, indexcols; error::Bool=true, complete::Bool=false, fill=missing, replaceallmissing::Bool=false)


Why do you think we need replaceallmissing kwarg? I think that fill is enough. If later someone wants to replace the remaining missing values that were originally in the data frame in non-indexcols then it is easily done.

bkamins · 2020-03-09T07:56:31Z

Excellent - thank you. This time frame is OK (as you probably know we are pushing to finalize DataFrames.jl to 1.0 release and that is why I was asking about this feature).

So the steps I would recommend you to take are:

do not do anything until you have time to work on PR (we are actively adding new things now, so this way you will avoid rebasing the PR several times)
rebase the PR (Resolve conflicts button above)
address all the unresolved comments
add docstring
Write tests that will cover a Cartesian product of the following things:
- data frame type: DataFrame, SubDataFrame
- number of rows: 0, 1, many
- idxcols: many columns, one column, zero columns (also testing for the case when what idxcols slects is not present in the data frame)
- all combinations of kwarg values (for fill use two values: the default and some other value to check it is probably used to fill the columns)

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>

…ames.jl into bt/completedf

bkamins · 2020-03-27T20:56:32Z

rebase the PR

Hi @versipellis - I recommended to rebase the PR, you have merged branches. Now it is impossible to make a review (you have added 8,500 lines and removed 5,000). I think that in the current state of this PR it is best if you just take the changes you have made, rebase it to master, add only one commit with changes and force push.

versipellis · 2020-03-27T20:59:51Z

rebase the PR

Hi @versipellis - I recommended to rebase the PR, you have merged branches. Now it is impossible to make a review (you have added 8,500 lines and removed 5,000). I think that in the current state of this PR it is best if you just take the changes you have made, rebase it to master, add only one commit with changes and force push.

Sorry about that - I had switched laptops in the middle of the covid-19 stuff and working on this, and my own local repos were a giant mess; rebase was doing something funky. I'll do that.

bkamins · 2020-03-27T21:03:15Z

That is what I assumed. That is why I recommend to use "force" in git - to prune old stuff and just make one commit on top of current master. Ideally - with the changes recommended in the comments (but they can be added later if that would be simpler for you.

Fortunately this should be simple as you just add a function (and not change code in many places).

bkamins · 2022-01-31T10:48:51Z

I will implement this functionality in a separate PR for 1.4 release.

bkamins · 2022-02-19T15:48:32Z

Closing in favor of #3012

Added complete and expand df

ebce2e4

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jun 28, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins requested changes Jun 28, 2019

View reviewed changes

nalimilan reviewed Jun 29, 2019

View reviewed changes

versipellis added 4 commits July 22, 2019 12:34

Latest dev, implementing some changes proposed from Github

de12ee7

Merge remote-tracking branch 'origin/master' into bt/completedf

3e09a50

Added warning messages for duplicate rows

c9676a1

clarified two lines of comments

ebbfcec

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Jul 23, 2019

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins added 3 commits March 8, 2020 22:44

Make the Jupyter Notebook documentation more precise

b81df82

add haskey to GroupedDataFrame and GroupKey

6de7361

fix eltype in stack with view=true

8bfa59e

bkamins reviewed Mar 9, 2020

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Mar 9, 2020

View reviewed changes

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved

bkamins reviewed Mar 9, 2020

View reviewed changes

non-Jedi and others added 11 commits March 9, 2020 09:00

Give ErrorException when trying to iterate AbstractDataFrame

3681550

change ⍰ to ? when showing a DataFrame and type display improvements

7456a4c

add eltypes kwag to show

61ecf44

finalize adding eltypes to show

20353f4

Add transformation and renaming to select and select!

d98b9be

Fix rename docstring

1e86fd2

update to Juia 1.4 and add tests of "begin"

54b77c3

[BREAKING] make id_vars go first in stack

d95eed7

Update src/dataframe/dataframe.jl

2ad3460

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>

Update src/dataframe/dataframe.jl

0179d8e

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>

Merge branch 'bt/completedf' of https://github.com/versipellis/DataFr…

66bb72e

…ames.jl into bt/completedf

bkamins mentioned this pull request Apr 23, 2020

Kwarg to choose missing values for unstack #2205

Closed

bkamins modified the milestones: 1.x, 1.4 Jan 31, 2022

bkamins mentioned this pull request Feb 19, 2022

Add fillcombinations function #3012

Merged

bkamins closed this Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] complete and expand df #1864

[WIP] complete and expand df #1864

versipellis commented Jun 28, 2019

bkamins left a comment

versipellis commented Jun 28, 2019

bkamins commented Jun 29, 2019

nalimilan commented Jun 29, 2019

versipellis commented Jul 1, 2019

versipellis commented Jul 22, 2019

bkamins commented Jul 23, 2019

versipellis commented Mar 9, 2020

bkamins Mar 9, 2020

bkamins Mar 9, 2020

bkamins commented Mar 9, 2020

bkamins commented Mar 27, 2020

versipellis commented Mar 27, 2020

bkamins commented Mar 27, 2020

bkamins commented Jan 31, 2022

bkamins commented Feb 19, 2022

[WIP] complete and expand df #1864

[WIP] complete and expand df #1864

Conversation

versipellis commented Jun 28, 2019

bkamins left a comment

Choose a reason for hiding this comment

versipellis commented Jun 28, 2019

bkamins commented Jun 29, 2019

nalimilan commented Jun 29, 2019

versipellis commented Jul 1, 2019

versipellis commented Jul 22, 2019

bkamins commented Jul 23, 2019

versipellis commented Mar 9, 2020

bkamins Mar 9, 2020

Choose a reason for hiding this comment

bkamins Mar 9, 2020

Choose a reason for hiding this comment

bkamins commented Mar 9, 2020

bkamins commented Mar 27, 2020

versipellis commented Mar 27, 2020

bkamins commented Mar 27, 2020

bkamins commented Jan 31, 2022

bkamins commented Feb 19, 2022