Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to separate Nullable backend into DataFrames2 #1154

Closed
ararslan opened this issue Jan 26, 2017 · 12 comments
Closed

Proposal to separate Nullable backend into DataFrames2 #1154

ararslan opened this issue Jan 26, 2017 · 12 comments
Labels

Comments

@ararslan
Copy link
Member

After much thought and conferring with others, I'd like to formally propose that we separate the current master branch of DataFrames into a separate package, tentatively called DataFrames2.

Advantages:

  • The Base representation of Nullable may change nontrivially (see [WIP] Nullable Redesign JuliaLang/Juleps#21)
  • The Nullable-based backend is more difficult to work with at the moment (see Working with Nullable DataFrames #1148)
  • This gives us more time to make a rock-solid release with a comprehensive document on how to port existing DataFrames code to DataFrames2, Jacob Quinn's I/O packages, StatsModels, etc.
  • While we develop the Nullable-backend version, the current version will continue to be more actively maintained
  • Bugfixes are easier to backport rather than cherry picking commits from master into the release-0.8 branch

Disadvantages:

  • DataArrays, the current DataFrames backend, doesn't work on 0.6 yet
  • using DataFrames2 is kind of unfortunate as a long-term name unless we re-merge the packages at some point

Fixing DataArrays is definitely not trivial, but I think overall this is the best course of action for both the developers and the users. I'd love to hear your thoughts, including further advantages or disadvantages not covered here.

@nalimilan
Copy link
Member

That sounds fine to me but I would wait until DataArrays works on Julia 0.6. That's really the prerequisite before we can consider the old DataFrames framework as viable for at least one more release.

@nalimilan
Copy link
Member

Also, in practical terms, I guess it would be better to rename this repo to DataFrames2 so that we don't lose open issues/PR, which contain lots of useful discussions and still valid points. Then we can create a new DataFrames package from the release-0.8 branch (including git history).

@amellnik
Copy link
Contributor

This seems like a more-workable approach.

@tkelman
Copy link
Contributor

tkelman commented Jan 28, 2017

I guess it would be better to rename this repo to DataFrames2 so that we don't lose open issues/PR, which contain lots of useful discussions and still valid points. Then we can create a new DataFrames package from the release-0.8 branch (including git history).

I'd recommend the other way around, leave this repo in place since the majority of the issues are w.r.t. the DataArrays backend. Can switch around the github default branch in the short term, and (optionally) roll back master once there's a separate new repo for the NullableArrays backend?

@nalimilan
Copy link
Member

I don't really like this idea since in the end we'll have two projects with useful history: this one, plus DataFrames2 and issues/PR filed during the transition. The way forward is DataFrames2, so it should retain the history; DataFrames 0.8 is the dead-end and we won't care about it in a few months.

@tshort
Copy link
Contributor

tshort commented Jan 28, 2017

I don't like the idea that the long-term name will be DataFrames2. If we make the split, can we agree that DataFrames2 is planned to be merged back into DataFrames at some point?

@tkelman
Copy link
Contributor

tkelman commented Jan 28, 2017

The way forward is DataFrames2

Uncertainty about this is why this issue was filed. Isn't it still unclear whether the implementation that's on master is going to be the long term usable solution? This is taking time and the original February plan is nearly here, without the picture on the ground having changed much.

@nalimilan
Copy link
Member

I don't like the idea that the long-term name will be DataFrames2. If we make the split, can we agree that DataFrames2 is planned to be merged back into DataFrames at some point?

Yes, of course my idea would be to deprecate DataFrames at some point, and later replace it with DataFrames2.

Uncertainty about this is why this issue was filed. Isn't it still unclear whether the implementation that's on master is going to be the long term usable solution? This is taking time and the original February plan is nearly here, without the picture on the ground having changed much.

Even if the implementation based on Nullable and NullableArray has to change significantly (which isn't certain yet), it is clear that the final implementation will be closer to DataFrames2 than to DataFrames. So IMO it makes sense to consider the current master as the way forward, even if we end up releasing DataFrames3.

@davidanthoff
Copy link
Contributor

I'm strongly in favor of this proposal.

In terms of naming, what about naming the new version DataTable?

@ararslan
Copy link
Member Author

ararslan commented Feb 1, 2017

I'd be fine with a name like DataTable. If we did something like that then perhaps we wouldn't need to merge it back with DataFrames at all; the Nullable-based version could just live under a new name. That said, I'd also be fine having a DataFrames2 and merging it back with DataFrames at some point.

@ararslan
Copy link
Member Author

ararslan commented Feb 1, 2017

Glad to see there's support for this. Action items (which as I write them are snowballing) are as follows:

  • Get DataArrays working on Julia 0.6 and tag a new version -- PRIORITY
  • Split DataFrames into two repos
  • Move the appropriate issues/discussions to DataTables.jl
  • Bump the DataArrays version requirement in the DataArrays-backed DataFrames and tag it
  • Upper bound all registered versions of DataFrames in METADATA to Julia 0.6(-dev?)
  • Port some of the functionality from StatsModels to the DataArrays-backed DataFrames (Replace bare tilde with formula macro #1170)
    • StatsModels is current set up assuming the Nullable-based backend
    • We need @formula to handle the @~ change in both repos

Longer term items:

  • Provide comprehensive documentation on how to migrate code from DataArrays-backed DataFrames to the NullableArrays-backed DataFrames ecosystem
    • The ecosystem includes StatsModels, querying macros, Jacob Quinn's I/O packages, etc.

Seems simple enough... maybe?! 😵

@ararslan
Copy link
Member Author

This issue seems to have served its purpose, so I'm going to go ahead and close it. Thanks everyone for your help and feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants