Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DataFrame sink for CSV/DataStreams to DataFrames #1174

Closed
wants to merge 12 commits into from
Closed

Move DataFrame sink for CSV/DataStreams to DataFrames #1174

wants to merge 12 commits into from

Conversation

cjprybol
Copy link
Contributor

No description provided.

@cjprybol cjprybol changed the title Move DataFrame sink for CSV/DataStreams to DataTable Move DataFrame sink for CSV/DataStreams to DataFrames Mar 17, 2017
@@ -14,7 +14,7 @@ using Reexport
@reexport using DataArrays
using GZip
using SortingAlgorithms

using NullableArrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrames should not depend on NullableArrays...

@nalimilan nalimilan requested a review from quinnj March 17, 2017 07:55
@andreasnoack
Copy link
Member

What is the motivation for this PR? I'm not against it, I'm just not sure what the motivation is.

@nalimilan
Copy link
Member

nalimilan commented Mar 17, 2017

Let's review https://github.com/JuliaData/DataTables.jl/pull/35/files at the same time since these are very similar. Why doesn't this one include any DataArray PooledDataArray-specific code?

allocate{T}(::Type{T}, rows, ref) = Array{T}(rows)
allocate{T}(::Type{Vector{T}}, rows, ref) = Array{T}(rows)

if isdefined(Main, :DataArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this condition.

@nalimilan
Copy link
Member

@andreasnoack The goal to have CSV support loading data into either a DataFrame or a DataTable. That way all DataStream sources will support both without depending on either package.

@andreasnoack
Copy link
Member

So the dependency is reversed? DataStreams used to depend on DataFrames and now it will be the other way around?

@cjprybol
Copy link
Contributor Author

cjprybol commented Mar 17, 2017

yes @andreasnoack.

Did a little more work on the types here so this should now support reading directly to DataArrays. Neither this nor the DataTables PR support reading directly to Categoricals but I'd like to save that for another PR.

julia> using DataFrames, DataTables, CSV
WARNING: Method definition describe(AbstractArray) in module DataFrames at /Users/Cameron/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:407 overwritten in module DataTables at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/abstractdatatable.jl:381.
WARNING: Method definition describe(Any, AbstractArray{#T<:Number, N<:Any}) in module DataFrames at /Users/Cameron/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:409 overwritten in module DataTables at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/abstractdatatable.jl:383.
WARNING: Method definition describe(Any, AbstractArray{#T<:Any, N<:Any}) in module DataFrames at /Users/Cameron/.julia/v0.5/DataFrames/src/abstractdataframe/abstractdataframe.jl:426 overwritten in module DataTables at /Users/Cameron/.julia/v0.5/DataTables/src/abstractdatatable/abstractdatatable.jl:400.

julia> a = DataFrames.head(CSV.read("test.tsv", DataFrame, delim='\t'))
6×6 DataFrames.DataFrame
│ Row │ chr    │ start  │ stop   │ score │ source     │ transcript_id       │
├─────┼────────┼────────┼────────┼───────┼────────────┼─────────────────────┤
│ 1   │ "chr1" │ 906333 │ 907318 │ 0.033 │ "putative" │ "MSTRG.336.1"       │
│ 2   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000466827.1" │
│ 3   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.339.21"      │
│ 4   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000464948.1" │
│ 5   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.340.6"       │
│ 6   │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000496938.1" │

julia> DataFrames.eltypes(a)
6-element Array{Type,1}:
 String
 Int64
 Int64
 Float64
 String
 String

julia> a[:chr]
6-element DataArrays.DataArray{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

julia> DataFrames.head(CSV.read("testnull.tsv", DataFrame, delim='\t'))
2×3 DataFrames.DataFrame
│ Row │ c1 │ c2 │ c3 │
├─────┼────┼────┼────┤
│ 1   │ NA │ 2  │ 3  │
│ 2   │ 1  │ 2  │ 3  │

julia> b = DataTables.head(CSV.read("test.tsv", delim='\t'))
6×6 DataTables.DataTable
│ Row │ chr  │ start  │ stop   │ score │ source   │ transcript_id     │
├─────┼──────┼────────┼────────┼───────┼──────────┼───────────────────┤
│ 1   │ chr1 │ 906333 │ 907318 │ 0.033 │ putative │ MSTRG.336.1       │
│ 2   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000466827.1 │
│ 3   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.339.21      │
│ 4   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000464948.1 │
│ 5   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.340.6       │
│ 6   │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000496938.1 │

julia> DataTables.eltypes(b)
6-element Array{Type,1}:
 Nullable{String}
 Nullable{Int64}
 Nullable{Int64}
 Nullable{Float64}
 Nullable{String}
 Nullable{String}

julia> b[:chr]
6-element NullableArrays.NullableArray{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

julia> DataTables.head(CSV.read("testnull.tsv", delim='\t'))
2×3 DataTables.DataTable
│ Row │ c1    │ c2 │ c3 │
├─────┼───────┼────┼────┤
│ 1   │ #NULL │ 2  │ 3  │
│ 2   │ 1     │ 2  │ 3  │

EDIT: sanity check that null-free reading works too

julia> a = DataFrames.head(CSV.read("test.tsv", DataFrame, delim='\t', nullable=false))
6×6 DataFrames.DataFrame
│ Row │ chr    │ start  │ stop   │ score │ source     │ transcript_id       │
├─────┼────────┼────────┼────────┼───────┼────────────┼─────────────────────┤
│ 1"chr1"9063339073180.033"putative""MSTRG.336.1"       │
│ 2"chr1"9418459457640.041"putative""ENST00000466827.1" │
│ 3"chr1"9418459457640.041"putative""MSTRG.339.21"      │
│ 4"chr1"9418459457640.041"putative""ENST00000464948.1" │
│ 5"chr1"9418459457640.041"putative""MSTRG.340.6"       │
│ 6"chr1"9418459457640.041"putative""ENST00000496938.1" │

julia> DataFrames.eltypes(a)
6-element Array{Type,1}:
 String
 Int64
 Int64
 Float64
 String
 String

julia> a[:chr]
6-element Array{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

julia> b = DataTables.head(CSV.read("test.tsv", delim='\t', nullable=false))
6×6 DataTables.DataTable
│ Row │ chr  │ start  │ stop   │ score │ source   │ transcript_id     │
├─────┼──────┼────────┼────────┼───────┼──────────┼───────────────────┤
│ 1   │ chr1 │ 9063339073180.033 │ putative │ MSTRG.336.1       │
│ 2   │ chr1 │ 9418459457640.041 │ putative │ ENST00000466827.1 │
│ 3   │ chr1 │ 9418459457640.041 │ putative │ MSTRG.339.21      │
│ 4   │ chr1 │ 9418459457640.041 │ putative │ ENST00000464948.1 │
│ 5   │ chr1 │ 9418459457640.041 │ putative │ MSTRG.340.6       │
│ 6   │ chr1 │ 9418459457640.041 │ putative │ ENST00000496938.1 │

julia> DataTables.eltypes(b)
6-element Array{Type,1}:
 String
 Int64
 Int64
 Float64
 String
 String

julia> b[:chr]
6-element Array{String,1}:
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"
 "chr1"

@@ -229,7 +229,7 @@ importall DataStreams
# DataFrames DataStreams implementation
function Data.schema(df::DataFrame, ::Type{Data.Column})
return Data.Schema(map(string, names(df)),
DataType[typeof(A) for A in df.columns], size(df, 1))
DataType[typeof(A) for A in df.columns], size(df, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this should align with map since it's the continuation of the arguments list.

Data.streamto!{T}(sink::DataFrame, ::Type{Data.Field}, val::T, row, col, sch::Data.Schema{false}) =
push!(sink.columns[col]::Vector{T}, val)
Data.streamto!{T}(sink::DataFrame, ::Type{Data.Field}, val::Nullable{T}, row, col, sch::Data.Schema{false}) =
push!(sink.columns[col]::DataVector{T}, isnull(val) ? NA : get(val))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get(val, NA) should work and be more efficient (since internally it uses ifelse to avoid a branch when possible). Same below.

@quinnj
Copy link
Member

quinnj commented Mar 19, 2017

This looks good from first glance. I have a dedicated package for DataStreams testing (DataStreamIntegrationTests.jl) that we could tweak to ensure both DataTables and DataFrames get tested.

Note that this will require some package tagging work. We'll want to put DataStreams and DataFrames upperbounds on all existing DataStream-implementation packages (CSV, SQLite, ODBC, Feather), and do a new DataStreams tag along w/ a DataFrames tag that all those packages new tags can then depend on.

@quinnj
Copy link
Member

quinnj commented Apr 13, 2017

I'm playing with this locally now; this isn't quite right yet, because we're not taking care of the mmapped ref argument that the DataFrame constructor takes for DataStreams. I opened a PR for DataArrays to allow storing that ref vector in a DataArray in the same way we do for NullableArrays. Once that's merged, we can utilize that functionality here to properly implement DataStreams.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

No longer relevant.

@quinnj quinnj closed this Sep 7, 2017
@cjprybol cjprybol deleted the cjp/addcsv branch September 10, 2017 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants