add DataTables.jl compatibility #63

cjprybol · 2017-03-01T19:00:20Z

julia> using DataTables

julia> using CSV

julia> dt = CSV.read("$(pwd())/Desktop/testdata.tsv", delim='\t')
18964×6 DataTables.DataTable
│ Row   │ chr    │ start     │ stop      │ score │ source     │ transcript_id       │
├───────┼────────┼───────────┼───────────┼───────┼────────────┼─────────────────────┤
│ 1     │ "chr1" │ 906333    │ 907318    │ 0.033 │ "putative" │ "MSTRG.314.1"       │
│ 2     │ "chr1" │ 941845    │ 945764    │ 0.041 │ "putative" │ "ENST00000466827.1" │
│ 3     │ "chr1" │ 941845    │ 945764    │ 0.041 │ "putative" │ "MSTRG.317.21"      │
│ 4     │ "chr1" │ 941845    │ 945764    │ 0.041 │ "putative" │ "ENST00000464948.1" │

cjprybol · 2017-03-07T22:16:14Z

bump re: JuliaData/DataTables.jl#26
@ararslan @nalimilan @quinnj

ararslan · 2017-03-07T22:21:08Z

This is another case where I'm hesitant to do this until we have a generic table abstraction. Otherwise DataFrames will be SOL.

cjprybol · 2017-03-07T22:37:06Z

Similar sentiments were shared in JuliaData/DataStreams.jl#27 by myself and others. I'm not sure what the best way forward is myself. Ideas so far are:

Make the hardcoded swap and cap the DataFrames REQUIRE on the current release version (that still uses DataFrames) and also continue to use readtable. Should be fine until Julia v0.6 release, but then it might be tricky. That's why I proposed opening a second branch (not pushing this to master). Could we tag releases on a non-master branch for DataTables?
Have CSV.jl autodetect which package is loaded and write to that version of data table package. This is a pseudo-solution until AbstractTables is ready, at which point I guess we could use that interface? I think that's what you're referring to by

until we have a generic table abstraction

Get AbstractTables ready. Not sure what this would entail.

nalimilan · 2017-03-08T09:50:37Z

Actually, the abstraction we need here already exists: it's Data.Source from DataStreams.jl. We don't need AbstractTable at all. As can be seen in this PR, the only place where the DataFrame type was used (apart from tests) is for the default sink type to create when reading a file. DataFrames are still supported even after this PR, and the only exported DataTables function is DataTable, which isn't an issue for DataFrames users.

CSV.jl doesn't work very well for DataFrame anyway, since it creates NullableArray columns which DataFrame users do not know how to work with. There have been several threads about this. So the best solution seems to be to recommend using readtable when working with DataFrames, and CSV.jl when working with DataTables. People who know what they are doing can use CSV.jl with DataFrames by passing it as sink type, but they need to be prepared to handle nullables (which DataFrame by definition isn't made to handle well).

Ideally at some point the default sink type will depend on which package is loaded, but for now specifying that you want a DataFrame isn't that painful and it allows using a more natural default.

quinnj · 2017-03-08T15:32:18Z

Yes, @nalimilan is pretty dead on here. A big part of my "next phase" plans for DataStreams was taking most of Data.Source and moving it to AbstractTables. I think it's fine to leave the default sink as a DataFrame. At some point, we can make the switch of default. Sorry for the slowness in participating in all these discussions, but I'll try to dive in and actually review and do some coding very soon.

quinnj · 2017-03-18T22:11:03Z

Hey @cjprybol, why change the default weakrefstrings=false? That seems unrelated to supporting DataTables.

cjprybol · 2017-03-18T22:17:35Z

I had issues getting WeakRefStrings read into DataFrames. After flipping the DataFrames code to read into DataArrays rather than NullableArrays I hit errors I couldn't figure out how to resolve. If I could get some help reading WeakRefStrings into DataArrays I'm happy to flip that back

quinnj · 2017-03-18T22:24:16Z

I'm confused. The whole idea here is we're switching to DataTables, which uses NullableArrays by default, right? WeakRefStrings will certainly have problems w/ DataArrays, but that should only be a DataFrames problem (not DAtaTables)

cjprybol · 2017-03-18T22:38:06Z

Yes, sorry, I wasn't sure how to best communicate these changes as they're now fragmented across 4 PRs. I wrote a brief summary here in DataStreams JuliaData/DataStreams.jl#28 (comment) but I should summarize everything again to clarify.

I've removed the DataFrames specific code from DataStreams here JuliaData/DataStreams.jl#28. Because the DataStreams code that's currently in master reads very nicely into NullableArrays, CategoricalArrays, WeakRefStrings, that code from DataStreams has been pushed to DataTables https://github.com/JuliaData/DataTables.jl/pull/35/files. A subset of that code is also in DataFrames https://github.com/JuliaStats/DataFrames.jl/pull/1174/files. Now those packages each depend on and implement their respective DataStreams code.

I first pushed the code to DataFrames with NullableArrays included, but was asked to remove the NullableArrays addition and convert the behavior to return DataArrays. I could no longer get WeakRefStrings to work after that when using CSV.read and DataFrames, so I changed the default weakrefstring behavior to false and made the requested changes in DataFrames. We can keep weakrefstrings=true, but then every call to CSV.read by DataFrames users would require that keyword to be set to false. Now DataFrames only supports a subset of the full CSV.read behavior and so I thought it would be better to keep this PR in CSV pointing towards DataTables rather than DataFrames, because DataTables supports the full range of features and DataFrames doesn't. We can keep weakrefstrings=true and ask DataFrames users to always call CSV.read with weakrefstrings=false?

ararslan · 2017-03-18T22:43:12Z

Or we could just fix whatever is preventing DataArrays and WeakRefStrings from playing nice together.

quinnj · 2017-03-19T02:38:13Z

True, it was just involve adding an extra field to the DataArray type, like we did for NullableArrays

quinnj · 2017-03-19T04:51:21Z

I think I'm going to hold off on this for a while. It's not strictly necessary in the DataStreams -> DataTables/DataFrames code migration and I want to minimize as much changes at one time as possible. CSV can continue to work w/ DataFrames by default and we can change to DataTables later.

nalimilan · 2017-03-19T18:59:56Z

Actually, the issue is not with adding support for DataTables: it's with changing DataFrames to using DataArrays. We could apply all changes, except the switch to DataArrays. At least that wouldn't introduce any regression. We could keep @cjprybol's JuliaData/DataFrames.jl#1174 as it is now to merge it when we have sorted out the WeakRefString issues.

quinnj · 2017-09-07T05:58:53Z

Implemented in #95

cjprybol added 2 commits February 13, 2017 21:42

DataFrame -> DataTable

16d19c4

df -> dt

d826f3b

cjprybol mentioned this pull request Mar 3, 2017

deprecate io functions for CSV equivalents JuliaData/DataTables.jl#26

Merged

nalimilan mentioned this pull request Mar 8, 2017

Move data tables dependencies to respective packages JuliaData/DataStreams.jl#28

Closed

cjprybol added 3 commits March 16, 2017 20:19

updates

7a1fba2

Merge branch 'cjp/tableio' into cjp/DataTables

c7fa638

modify docstring with new default

4411334

nalimilan mentioned this pull request Apr 3, 2017

How do I use CSV.read to return a DataTable instead of a DataFrame? JuliaData/DataTables.jl#47

Closed

quinnj closed this Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add DataTables.jl compatibility #63

add DataTables.jl compatibility #63

cjprybol commented Mar 1, 2017

cjprybol commented Mar 7, 2017

ararslan commented Mar 7, 2017

cjprybol commented Mar 7, 2017 •

edited

Loading

nalimilan commented Mar 8, 2017

quinnj commented Mar 8, 2017

quinnj commented Mar 18, 2017

cjprybol commented Mar 18, 2017

quinnj commented Mar 18, 2017

cjprybol commented Mar 18, 2017

ararslan commented Mar 18, 2017

quinnj commented Mar 19, 2017

quinnj commented Mar 19, 2017

nalimilan commented Mar 19, 2017

quinnj commented Sep 7, 2017

add DataTables.jl compatibility #63

add DataTables.jl compatibility #63

Conversation

cjprybol commented Mar 1, 2017

cjprybol commented Mar 7, 2017

ararslan commented Mar 7, 2017

cjprybol commented Mar 7, 2017 • edited Loading

nalimilan commented Mar 8, 2017

quinnj commented Mar 8, 2017

quinnj commented Mar 18, 2017

cjprybol commented Mar 18, 2017

quinnj commented Mar 18, 2017

cjprybol commented Mar 18, 2017

ararslan commented Mar 18, 2017

quinnj commented Mar 19, 2017

quinnj commented Mar 19, 2017

nalimilan commented Mar 19, 2017

quinnj commented Sep 7, 2017

cjprybol commented Mar 7, 2017 •

edited

Loading