Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DataTables.jl compatibility #63

Closed
wants to merge 5 commits into from
Closed

add DataTables.jl compatibility #63

wants to merge 5 commits into from

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Mar 1, 2017

See JuliaData/DataStreams.jl#27

julia> using DataTables

julia> using CSV

julia> dt = CSV.read("$(pwd())/Desktop/testdata.tsv", delim='\t')
18964×6 DataTables.DataTable
│ Row   │ chr    │ start     │ stop      │ score │ source     │ transcript_id       │
├───────┼────────┼───────────┼───────────┼───────┼────────────┼─────────────────────┤
│ 1"chr1"9063339073180.033"putative""MSTRG.314.1"       │
│ 2"chr1"9418459457640.041"putative""ENST00000466827.1" │
│ 3"chr1"9418459457640.041"putative""MSTRG.317.21"      │
│ 4"chr1"9418459457640.041"putative""ENST00000464948.1"

@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 7, 2017

@ararslan
Copy link
Member

ararslan commented Mar 7, 2017

This is another case where I'm hesitant to do this until we have a generic table abstraction. Otherwise DataFrames will be SOL.

@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 7, 2017

Similar sentiments were shared in JuliaData/DataStreams.jl#27 by myself and others. I'm not sure what the best way forward is myself. Ideas so far are:

  • Make the hardcoded swap and cap the DataFrames REQUIRE on the current release version (that still uses DataFrames) and also continue to use readtable. Should be fine until Julia v0.6 release, but then it might be tricky. That's why I proposed opening a second branch (not pushing this to master). Could we tag releases on a non-master branch for DataTables?
  • Have CSV.jl autodetect which package is loaded and write to that version of data table package. This is a pseudo-solution until AbstractTables is ready, at which point I guess we could use that interface? I think that's what you're referring to by

until we have a generic table abstraction

  • Get AbstractTables ready. Not sure what this would entail.

@nalimilan
Copy link
Member

Actually, the abstraction we need here already exists: it's Data.Source from DataStreams.jl. We don't need AbstractTable at all. As can be seen in this PR, the only place where the DataFrame type was used (apart from tests) is for the default sink type to create when reading a file. DataFrames are still supported even after this PR, and the only exported DataTables function is DataTable, which isn't an issue for DataFrames users.

CSV.jl doesn't work very well for DataFrame anyway, since it creates NullableArray columns which DataFrame users do not know how to work with. There have been several threads about this. So the best solution seems to be to recommend using readtable when working with DataFrames, and CSV.jl when working with DataTables. People who know what they are doing can use CSV.jl with DataFrames by passing it as sink type, but they need to be prepared to handle nullables (which DataFrame by definition isn't made to handle well).

Ideally at some point the default sink type will depend on which package is loaded, but for now specifying that you want a DataFrame isn't that painful and it allows using a more natural default.

@quinnj
Copy link
Member

quinnj commented Mar 8, 2017

Yes, @nalimilan is pretty dead on here. A big part of my "next phase" plans for DataStreams was taking most of Data.Source and moving it to AbstractTables. I think it's fine to leave the default sink as a DataFrame. At some point, we can make the switch of default. Sorry for the slowness in participating in all these discussions, but I'll try to dive in and actually review and do some coding very soon.

@quinnj
Copy link
Member

quinnj commented Mar 18, 2017

Hey @cjprybol, why change the default weakrefstrings=false? That seems unrelated to supporting DataTables.

@cjprybol
Copy link
Contributor Author

I had issues getting WeakRefStrings read into DataFrames. After flipping the DataFrames code to read into DataArrays rather than NullableArrays I hit errors I couldn't figure out how to resolve. If I could get some help reading WeakRefStrings into DataArrays I'm happy to flip that back

@quinnj
Copy link
Member

quinnj commented Mar 18, 2017

I'm confused. The whole idea here is we're switching to DataTables, which uses NullableArrays by default, right? WeakRefStrings will certainly have problems w/ DataArrays, but that should only be a DataFrames problem (not DAtaTables)

@cjprybol
Copy link
Contributor Author

Yes, sorry, I wasn't sure how to best communicate these changes as they're now fragmented across 4 PRs. I wrote a brief summary here in DataStreams JuliaData/DataStreams.jl#28 (comment) but I should summarize everything again to clarify.

I've removed the DataFrames specific code from DataStreams here JuliaData/DataStreams.jl#28. Because the DataStreams code that's currently in master reads very nicely into NullableArrays, CategoricalArrays, WeakRefStrings, that code from DataStreams has been pushed to DataTables https://github.com/JuliaData/DataTables.jl/pull/35/files. A subset of that code is also in DataFrames https://github.com/JuliaStats/DataFrames.jl/pull/1174/files. Now those packages each depend on and implement their respective DataStreams code.

I first pushed the code to DataFrames with NullableArrays included, but was asked to remove the NullableArrays addition and convert the behavior to return DataArrays. I could no longer get WeakRefStrings to work after that when using CSV.read and DataFrames, so I changed the default weakrefstring behavior to false and made the requested changes in DataFrames. We can keep weakrefstrings=true, but then every call to CSV.read by DataFrames users would require that keyword to be set to false. Now DataFrames only supports a subset of the full CSV.read behavior and so I thought it would be better to keep this PR in CSV pointing towards DataTables rather than DataFrames, because DataTables supports the full range of features and DataFrames doesn't. We can keep weakrefstrings=true and ask DataFrames users to always call CSV.read with weakrefstrings=false?

@ararslan
Copy link
Member

Or we could just fix whatever is preventing DataArrays and WeakRefStrings from playing nice together.

@quinnj
Copy link
Member

quinnj commented Mar 19, 2017

True, it was just involve adding an extra field to the DataArray type, like we did for NullableArrays

@quinnj
Copy link
Member

quinnj commented Mar 19, 2017

I think I'm going to hold off on this for a while. It's not strictly necessary in the DataStreams -> DataTables/DataFrames code migration and I want to minimize as much changes at one time as possible. CSV can continue to work w/ DataFrames by default and we can change to DataTables later.

@nalimilan
Copy link
Member

Actually, the issue is not with adding support for DataTables: it's with changing DataFrames to using DataArrays. We could apply all changes, except the switch to DataArrays. At least that wouldn't introduce any regression. We could keep @cjprybol's JuliaData/DataFrames.jl#1174 as it is now to merge it when we have sorted out the WeakRefString issues.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Implemented in #95

@quinnj quinnj closed this Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants