-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] complete and expand df #1864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for an excellent idea and contribution. I have left some suggestions in comments.
Thank you for the review! I'll go through them in the next few days and make the changes/see if there's a sane approach for some of them. |
Thank you. Also one more thing. If a column is categorical you probably should not use |
Thanks. Regarding the API, I wonder whether we need both |
That would certainly be more elegant. I kept the different functions mostly to keep in mind users transitioning from R.
Quite honestly, I forgot about categorical as a column type, and haven't used it much myself. I can go take a look at how to implement this though :) |
@bkamins @nalimilan I think I've addressed all the comments that were raised, except for the one about nesting. I'm not sure if that discussion is completely within the scope of this PR however - would you both be OK if I resolve that comment for the time being? |
@nalimilan - I am OK to leave out nesting. The only question is if we can later introduce it without breaking things (I do not know this functionality of dplyr in detail). |
Hi @bkamins, my apologies, real life has been absolutely a nightmare. I'll be working on this PR in the coming week or two. I'm not very good with writing tests, however, which is where I've mostly been stuck with this up till now. I had been planning to write tests to make sure each of the args works, but that's all I got to. Do you have anything else you'd like to see covered in the testing? |
if complete == false | ||
return dummydf | ||
else | ||
joined = join(dummydf, df; on=_names(df)[colind], kind=:left, indicator=:source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now it should be leftjoin
end | ||
end | ||
|
||
function expand(df::AbstractDataFrame, indexcols; error::Bool=true, complete::Bool=false, fill=missing, replaceallmissing::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think we need replaceallmissing
kwarg? I think that fill
is enough. If later someone wants to replace the remaining missing values that were originally in the data frame in non-indexcols then it is easily done.
Excellent - thank you. This time frame is OK (as you probably know we are pushing to finalize DataFrames.jl to 1.0 release and that is why I was asking about this feature). So the steps I would recommend you to take are:
|
Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
Hi @versipellis - I recommended to rebase the PR, you have merged branches. Now it is impossible to make a review (you have added 8,500 lines and removed 5,000). I think that in the current state of this PR it is best if you just take the changes you have made, rebase it to master, add only one commit with changes and force push. |
Sorry about that - I had switched laptops in the middle of the covid-19 stuff and working on this, and my own local repos were a giant mess; rebase was doing something funky. I'll do that. |
That is what I assumed. That is why I recommend to use "force" in git - to prune old stuff and just make one commit on top of current master. Ideally - with the changes recommended in the comments (but they can be added later if that would be simpler for you. Fortunately this should be simple as you just add a function (and not change code in many places). |
I will implement this functionality in a separate PR for 1.4 release. |
Closing in favor of #3012 |
First PR I'm ever making to a big project, so bear with my noviceness :)
This PR spawns from a conversation on the Slack #data channel. It takes a dataframe as an input, and some vector of columns from that dataframe, and turns implicit into explicit missing values.
expanddf(df, indexcols)
returns a dataframe of every possible combination of index cols, the length of which would be the product of the length of every column's unique values.completedf(df, indexcols, fill=missing)
wrapsexpanddf
, but joins the original dataframe back, and provides the utility to fill the newly-missing values.I haven't written the documentation in anticipation that feedback might cause a number of changes to the implementation. See below for examples in use.
The equivalent in...
R Tidyverse: https://tidyr.tidyverse.org/reference/complete.html
Python Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Stata: https://www.stata.com/manuals13/dfillin.pdf
Example: