Move DataFrame sink for CSV/DataStreams to DataFrames #1174

cjprybol · 2017-03-17T03:52:45Z

No description provided.

ararslan · 2017-03-17T05:10:03Z

src/DataFrames.jl

@@ -14,7 +14,7 @@ using Reexport
 @reexport using DataArrays
 using GZip
 using SortingAlgorithms
-
+using NullableArrays


DataFrames should not depend on NullableArrays...

…irement

andreasnoack · 2017-03-17T13:30:10Z

What is the motivation for this PR? I'm not against it, I'm just not sure what the motivation is.

nalimilan · 2017-03-17T14:09:14Z

Let's review https://github.com/JuliaData/DataTables.jl/pull/35/files at the same time since these are very similar. Why doesn't this one include any ~~DataArray~~ PooledDataArray-specific code?

nalimilan · 2017-03-17T14:10:14Z

src/abstractdataframe/io.jl

+allocate{T}(::Type{T}, rows, ref) = Array{T}(rows)
+allocate{T}(::Type{Vector{T}}, rows, ref) = Array{T}(rows)
+
+if isdefined(Main, :DataArray)


Remove this condition.

nalimilan · 2017-03-17T14:11:54Z

@andreasnoack The goal to have CSV support loading data into either a DataFrame or a DataTable. That way all DataStream sources will support both without depending on either package.

andreasnoack · 2017-03-17T19:51:54Z

So the dependency is reversed? DataStreams used to depend on DataFrames and now it will be the other way around?

cjprybol · 2017-03-17T20:20:30Z

yes @andreasnoack.

Did a little more work on the types here so this should now support reading directly to DataArrays. Neither this nor the DataTables PR support reading directly to Categoricals but I'd like to save that for another PR.

julia> using DataFrames, DataTables, CSV
WARNING: Method definition describe(AbstractArray) in module DataFrames at /Users/Cameron/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407 overwritten in module DataTables at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/abstractdatatable.jl:381.
WARNING: Method definition describe(Any, AbstractArray{#T<:Number, N<:Any}) in module DataFrames at /Users/Cameron/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:409 overwritten in module DataTables at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/abstractdatatable.jl:383.
WARNING: Method definition describe(Any, AbstractArray{#T<:Any, N<:Any}) in module DataFrames at /Users/Cameron/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:426 overwritten in module DataTables at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/abstractdatatable.jl:400.

julia> a = DataFrames.head(CSV.read("test.tsv", DataFrame, delim='\t'))
6×6 DataFrames.DataFrame
│ Row │ chr    │ start  │ stop   │ score │ source     │ transcript_id       │
├─────┼────────┼────────┼────────┼───────┼────────────┼─────────────────────┤
│ 1   │ "chr1" │ 906333 │ 907318 │ 0.033 │ "putative" │ "MSTRG.336.1"       │
│ 2   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000466827.1" │
│ 3   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.339.21"      │
│ 4   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000464948.1" │
│ 5   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.340.6"       │
│ 6   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000496938.1" │

julia> DataFrames.eltypes(a)
6-element Array{Type,1}:
 String
 Int64
 Int64
 Float64
 String
 String

julia> a[:chr]
6-element DataArrays.DataArray{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

julia> DataFrames.head(CSV.read("testnull.tsv", DataFrame, delim='\t'))
2×3 DataFrames.DataFrame
│ Row │ c1 │ c2 │ c3 │
├─────┼────┼────┼────┤
│ 1   │ NA │ 2  │ 3  │
│ 2   │ 1  │ 2  │ 3  │

julia> b = DataTables.head(CSV.read("test.tsv", delim='\t'))
6×6 DataTables.DataTable
│ Row │ chr  │ start  │ stop   │ score │ source   │ transcript_id     │
├─────┼──────┼────────┼────────┼───────┼──────────┼───────────────────┤
│ 1   │ chr1 │ 906333 │ 907318 │ 0.033 │ putative │ MSTRG.336.1       │
│ 2   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000466827.1 │
│ 3   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.339.21      │
│ 4   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000464948.1 │
│ 5   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.340.6       │
│ 6   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000496938.1 │

julia> DataTables.eltypes(b)
6-element Array{Type,1}:
 Nullable{String}
 Nullable{Int64}
 Nullable{Int64}
 Nullable{Float64}
 Nullable{String}
 Nullable{String}

julia> b[:chr]
6-element NullableArrays.NullableArray{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

julia> DataTables.head(CSV.read("testnull.tsv", delim='\t'))
2×3 DataTables.DataTable
│ Row │ c1    │ c2 │ c3 │
├─────┼───────┼────┼────┤
│ 1   │ #NULL │ 2  │ 3  │
│ 2   │ 1     │ 2  │ 3  │

EDIT: sanity check that null-free reading works too

julia> a = DataFrames.head(CSV.read("test.tsv", DataFrame, delim='\t', nullable=false))
6×6 DataFrames.DataFrame
│ Row │ chr    │ start  │ stop   │ score │ source     │ transcript_id       │
├─────┼────────┼────────┼────────┼───────┼────────────┼─────────────────────┤
│ 1   │ "chr1" │ 906333 │ 907318 │ 0.033 │ "putative" │ "MSTRG.336.1"       │
│ 2   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000466827.1" │
│ 3   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.339.21"      │
│ 4   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000464948.1" │
│ 5   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.340.6"       │
│ 6   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000496938.1" │

julia> DataFrames.eltypes(a)
6-element Array{Type,1}:
 String
 Int64
 Int64
 Float64
 String
 String

julia> a[:chr]
6-element Array{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

julia> b = DataTables.head(CSV.read("test.tsv", delim='\t', nullable=false))
6×6 DataTables.DataTable
│ Row │ chr  │ start  │ stop   │ score │ source   │ transcript_id     │
├─────┼──────┼────────┼────────┼───────┼──────────┼───────────────────┤
│ 1   │ chr1 │ 906333 │ 907318 │ 0.033 │ putative │ MSTRG.336.1       │
│ 2   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000466827.1 │
│ 3   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.339.21      │
│ 4   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000464948.1 │
│ 5   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.340.6       │
│ 6   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000496938.1 │

julia> DataTables.eltypes(b)
6-element Array{Type,1}:
 String
 Int64
 Int64
 Float64
 String
 String

julia> b[:chr]
6-element Array{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

nalimilan · 2017-03-17T21:03:42Z

src/abstractdataframe/io.jl

@@ -229,7 +229,7 @@ importall DataStreams
 # DataFrames DataStreams implementation
 function Data.schema(df::DataFrame, ::Type{Data.Column})
    return Data.Schema(map(string, names(df)),
-            DataType[typeof(A) for A in df.columns], size(df, 1))
+           DataType[typeof(A) for A in df.columns], size(df, 1))


No, this should align with map since it's the continuation of the arguments list.

nalimilan · 2017-03-17T22:19:52Z

src/abstractdataframe/io.jl

+Data.streamto!{T}(sink::DataFrame, ::Type{Data.Field}, val::T, row, col, sch::Data.Schema{false}) =
+    push!(sink.columns[col]::Vector{T}, val)
+Data.streamto!{T}(sink::DataFrame, ::Type{Data.Field}, val::Nullable{T}, row, col, sch::Data.Schema{false}) =
+    push!(sink.columns[col]::DataVector{T}, isnull(val) ? NA : get(val))


get(val, NA) should work and be more efficient (since internally it uses ifelse to avoid a branch when possible). Same below.

quinnj · 2017-03-19T04:18:13Z

This looks good from first glance. I have a dedicated package for DataStreams testing (DataStreamIntegrationTests.jl) that we could tweak to ensure both DataTables and DataFrames get tested.

Note that this will require some package tagging work. We'll want to put DataStreams and DataFrames upperbounds on all existing DataStream-implementation packages (CSV, SQLite, ODBC, Feather), and do a new DataStreams tag along w/ a DataFrames tag that all those packages new tags can then depend on.

quinnj · 2017-04-13T04:44:16Z

I'm playing with this locally now; this isn't quite right yet, because we're not taking care of the mmapped ref argument that the DataFrame constructor takes for DataStreams. I opened a PR for DataArrays to allow storing that ref vector in a DataArray in the same way we do for NullableArrays. Once that's merged, we can utilize that functionality here to properly implement DataStreams.

quinnj · 2017-09-07T17:59:41Z

No longer relevant.

import DataStreams functionality to read from CSV.read

9990dc5

cjprybol changed the title ~~Move DataFrame sink for CSV/DataStreams to DataTable~~ Move DataFrame sink for CSV/DataStreams to DataFrames Mar 17, 2017

cjprybol mentioned this pull request Mar 17, 2017

Move data tables dependencies to respective packages JuliaData/DataStreams.jl#28

Closed

cjprybol added 3 commits March 16, 2017 21:28

add CSV requirement (and DataStreams)

65a252a

try moving CSV dependencies to test/REQUIRE to avoid precompile

a6e525d

missed a save

669a056

ararslan reviewed Mar 17, 2017

View reviewed changes

cjprybol added 3 commits March 16, 2017 22:30

comment examlpes ported from DataStreams and drop NullableArrays requ…

c6c12f6

…irement

flip datastreams requirement

1a0ad14

unrelated line removal

b43730a

nalimilan requested a review from quinnj March 17, 2017 07:55

nalimilan reviewed Mar 17, 2017

View reviewed changes

cjprybol added 3 commits March 17, 2017 13:21

read directly to DataArrays

cb4f189

fix strange linewrap

aa85b2c

spacing

9636665

nalimilan reviewed Mar 17, 2017

View reviewed changes

changes

9fb4e13

nalimilan reviewed Mar 17, 2017

View reviewed changes

insull ternary -> get with default

4445a8f

nalimilan mentioned this pull request Mar 19, 2017

add DataTables.jl compatibility JuliaData/CSV.jl#63

Closed

quinnj closed this Sep 7, 2017

cjprybol deleted the cjp/addcsv branch September 10, 2017 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move DataFrame sink for CSV/DataStreams to DataFrames #1174

Move DataFrame sink for CSV/DataStreams to DataFrames #1174

cjprybol commented Mar 17, 2017

ararslan Mar 17, 2017

andreasnoack commented Mar 17, 2017

nalimilan commented Mar 17, 2017 •

edited

Loading

nalimilan Mar 17, 2017

nalimilan commented Mar 17, 2017

andreasnoack commented Mar 17, 2017

cjprybol commented Mar 17, 2017 •

edited

Loading

nalimilan Mar 17, 2017

nalimilan Mar 17, 2017

quinnj commented Mar 19, 2017

quinnj commented Apr 13, 2017

quinnj commented Sep 7, 2017

Move DataFrame sink for CSV/DataStreams to DataFrames #1174

Move DataFrame sink for CSV/DataStreams to DataFrames #1174

Conversation

cjprybol commented Mar 17, 2017

ararslan Mar 17, 2017

Choose a reason for hiding this comment

andreasnoack commented Mar 17, 2017

nalimilan commented Mar 17, 2017 • edited Loading

nalimilan Mar 17, 2017

Choose a reason for hiding this comment

nalimilan commented Mar 17, 2017

andreasnoack commented Mar 17, 2017

cjprybol commented Mar 17, 2017 • edited Loading

nalimilan Mar 17, 2017

Choose a reason for hiding this comment

nalimilan Mar 17, 2017

Choose a reason for hiding this comment

quinnj commented Mar 19, 2017

quinnj commented Apr 13, 2017

quinnj commented Sep 7, 2017

nalimilan commented Mar 17, 2017 •

edited

Loading

cjprybol commented Mar 17, 2017 •

edited

Loading