Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] complete and expand df #1864

Closed
wants to merge 156 commits into from

Conversation

versipellis
Copy link

First PR I'm ever making to a big project, so bear with my noviceness :)

This PR spawns from a conversation on the Slack #data channel. It takes a dataframe as an input, and some vector of columns from that dataframe, and turns implicit into explicit missing values.

expanddf(df, indexcols) returns a dataframe of every possible combination of index cols, the length of which would be the product of the length of every column's unique values.

completedf(df, indexcols, fill=missing) wraps expanddf, but joins the original dataframe back, and provides the utility to fill the newly-missing values.

I haven't written the documentation in anticipation that feedback might cause a number of changes to the implementation. See below for examples in use.

The equivalent in...
R Tidyverse: https://tidyr.tidyverse.org/reference/complete.html
Python Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Stata: https://www.stata.com/manuals13/dfillin.pdf

Example:

julia> df = DataFrame(A=["05","05","05","06","06","07","07",missing],B=["a",missing,"c","a",missing,"c","d","a"],C=[1,1,10,43,5,10,3,10])

julia> indexcols = [:A,:B]

julia> df
8×3 DataFrame
│ Row │ A       │ B       │ C     │
│     │ String⍰ │ String⍰ │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 105      │ a       │ 1     │
│ 205missing1     │
│ 305      │ c       │ 10    │
│ 406      │ a       │ 43    │
│ 506missing5     │
│ 607      │ c       │ 10    │
│ 707      │ d       │ 3     │
│ 8missing │ a       │ 10    │

julia> expanddf(df, indexcols)
16×2 DataFrame
│ Row │ A       │ B       │
│     │ String⍰ │ String⍰ │
├─────┼─────────┼─────────┤
│ 105      │ a       │
│ 206      │ a       │
│ 307      │ a       │
│ 4missing │ a       │
│ 505missing1107      │ c       │
│ 12missing │ c       │
│ 1305      │ d       │
│ 1406      │ d       │
│ 1507      │ d       │
│ 16missing │ d       │

julia> completedf(df, indexcols)
16×3 DataFrame
│ Row │ A       │ B       │ C       │
│     │ String⍰ │ String⍰ │ Int64⍰  │
├─────┼─────────┼─────────┼─────────┤
│ 105      │ a       │ 1       │
│ 206      │ a       │ 43      │
│ 307      │ a       │ missing │
│ 4missing │ a       │ 10      │
│ 505missing11107      │ c       │ 10      │
│ 12missing │ c       │ missing │
│ 1305      │ d       │ missing │
│ 1406      │ d       │ missing │
│ 1507      │ d       │ 3       │
│ 16missing │ d       │ missing │

julia> completedf(df, indexcols, fill=10000)
16×3 DataFrame
│ Row │ A       │ B       │ C     │
│     │ String⍰ │ String⍰ │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 105      │ a       │ 1     │
│ 206      │ a       │ 43    │
│ 307      │ a       │ 10000 │
│ 4missing │ a       │ 10    │
│ 505missing11107      │ c       │ 10    │
│ 12missing │ c       │ 10000 │
│ 1305      │ d       │ 10000 │
│ 1406      │ d       │ 10000 │
│ 1507      │ d       │ 3     │
│ 16missing │ d       │ 10000

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for an excellent idea and contribution. I have left some suggestions in comments.

@versipellis
Copy link
Author

Thank you for the review! I'll go through them in the next few days and make the changes/see if there's a sane approach for some of them.

@bkamins
Copy link
Member

bkamins commented Jun 29, 2019

Thank you. Also one more thing. If a column is categorical you probably should not use unique but levels - is this what you have indented? Probably @nalimilan can comment what is the best strategy to check if a column is categorcal given the planned changes in CategoricalArrays.jl (@nalimilan - note that df in this PR can be SubDataFrame in particular).

@nalimilan
Copy link
Member

Thanks. Regarding the API, I wonder whether we need both expand and complete (I would drop the "df" suffix anyway): wouldn't it make sense to have a single function, with an argument to determine whether the other columns should be kept (complete) or not (expand)?

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@versipellis
Copy link
Author

Thanks. Regarding the API, I wonder whether we need both expand and complete (I would drop the "df" suffix anyway): wouldn't it make sense to have a single function, with an argument to determine whether the other columns should be kept (complete) or not (expand)?

That would certainly be more elegant. I kept the different functions mostly to keep in mind users transitioning from R.

Thank you. Also one more thing. If a column is categorical you probably should not use unique but levels - is this what you have indented? Probably @nalimilan can comment what is the best strategy to check if a column is categorcal given the planned changes in CategoricalArrays.jl (@nalimilan - note that df in this PR can be SubDataFrame in particular).

Quite honestly, I forgot about categorical as a column type, and haven't used it much myself. I can go take a look at how to implement this though :)

@versipellis
Copy link
Author

@bkamins @nalimilan I think I've addressed all the comments that were raised, except for the one about nesting. I'm not sure if that discussion is completely within the scope of this PR however - would you both be OK if I resolve that comment for the time being?

@bkamins
Copy link
Member

bkamins commented Jul 23, 2019

@nalimilan - I am OK to leave out nesting. The only question is if we can later introduce it without breaking things (I do not know this functionality of dplyr in detail).

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@versipellis
Copy link
Author

Hi @bkamins, my apologies, real life has been absolutely a nightmare. I'll be working on this PR in the coming week or two.

I'm not very good with writing tests, however, which is where I've mostly been stuck with this up till now. I had been planning to write tests to make sure each of the args works, but that's all I got to. Do you have anything else you'd like to see covered in the testing?

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
if complete == false
return dummydf
else
joined = join(dummydf, df; on=_names(df)[colind], kind=:left, indicator=:source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it should be leftjoin

end
end

function expand(df::AbstractDataFrame, indexcols; error::Bool=true, complete::Bool=false, fill=missing, replaceallmissing::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think we need replaceallmissing kwarg? I think that fill is enough. If later someone wants to replace the remaining missing values that were originally in the data frame in non-indexcols then it is easily done.

@bkamins
Copy link
Member

bkamins commented Mar 9, 2020

Excellent - thank you. This time frame is OK (as you probably know we are pushing to finalize DataFrames.jl to 1.0 release and that is why I was asking about this feature).

So the steps I would recommend you to take are:

  1. do not do anything until you have time to work on PR (we are actively adding new things now, so this way you will avoid rebasing the PR several times)
  2. rebase the PR (Resolve conflicts button above)
  3. address all the unresolved comments
  4. add docstring
  5. Write tests that will cover a Cartesian product of the following things:
    • data frame type: DataFrame, SubDataFrame
    • number of rows: 0, 1, many
    • idxcols: many columns, one column, zero columns (also testing for the case when what idxcols slects is not present in the data frame)
    • all combinations of kwarg values (for fill use two values: the default and some other value to check it is probably used to fill the columns)

@bkamins
Copy link
Member

bkamins commented Mar 27, 2020

rebase the PR

Hi @versipellis - I recommended to rebase the PR, you have merged branches. Now it is impossible to make a review (you have added 8,500 lines and removed 5,000). I think that in the current state of this PR it is best if you just take the changes you have made, rebase it to master, add only one commit with changes and force push.

@versipellis
Copy link
Author

rebase the PR

Hi @versipellis - I recommended to rebase the PR, you have merged branches. Now it is impossible to make a review (you have added 8,500 lines and removed 5,000). I think that in the current state of this PR it is best if you just take the changes you have made, rebase it to master, add only one commit with changes and force push.

Sorry about that - I had switched laptops in the middle of the covid-19 stuff and working on this, and my own local repos were a giant mess; rebase was doing something funky. I'll do that.

@bkamins
Copy link
Member

bkamins commented Mar 27, 2020

That is what I assumed. That is why I recommend to use "force" in git - to prune old stuff and just make one commit on top of current master. Ideally - with the changes recommended in the comments (but they can be added later if that would be simpler for you.

Fortunately this should be simple as you just add a function (and not change code in many places).

@bkamins
Copy link
Member

bkamins commented Jan 31, 2022

I will implement this functionality in a separate PR for 1.4 release.

@bkamins bkamins modified the milestones: 1.x, 1.4 Jan 31, 2022
@bkamins
Copy link
Member

bkamins commented Feb 19, 2022

Closing in favor of #3012

@bkamins bkamins closed this Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.