-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move DataFrame sink for CSV/DataStreams to DataFrames #1174
Conversation
src/DataFrames.jl
Outdated
@@ -14,7 +14,7 @@ using Reexport | |||
@reexport using DataArrays | |||
using GZip | |||
using SortingAlgorithms | |||
|
|||
using NullableArrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrames should not depend on NullableArrays...
What is the motivation for this PR? I'm not against it, I'm just not sure what the motivation is. |
Let's review https://github.com/JuliaData/DataTables.jl/pull/35/files at the same time since these are very similar. Why doesn't this one include any |
src/abstractdataframe/io.jl
Outdated
allocate{T}(::Type{T}, rows, ref) = Array{T}(rows) | ||
allocate{T}(::Type{Vector{T}}, rows, ref) = Array{T}(rows) | ||
|
||
if isdefined(Main, :DataArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this condition.
@andreasnoack The goal to have CSV support loading data into either a |
So the dependency is reversed? |
yes @andreasnoack. Did a little more work on the types here so this should now support reading directly to DataArrays. Neither this nor the DataTables PR support reading directly to Categoricals but I'd like to save that for another PR.
EDIT: sanity check that null-free reading works too julia> a = DataFrames.head(CSV.read("test.tsv", DataFrame, delim='\t', nullable=false))
6×6 DataFrames.DataFrame
│ Row │ chr │ start │ stop │ score │ source │ transcript_id │
├─────┼────────┼────────┼────────┼───────┼────────────┼─────────────────────┤
│ 1 │ "chr1" │ 906333 │ 907318 │ 0.033 │ "putative" │ "MSTRG.336.1" │
│ 2 │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000466827.1" │
│ 3 │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.339.21" │
│ 4 │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000464948.1" │
│ 5 │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "MSTRG.340.6" │
│ 6 │ "chr1" │ 941845 │ 945764 │ 0.041 │ "putative" │ "ENST00000496938.1" │
julia> DataFrames.eltypes(a)
6-element Array{Type,1}:
String
Int64
Int64
Float64
String
String
julia> a[:chr]
6-element Array{String,1}:
"chr1"
"chr1"
"chr1"
"chr1"
"chr1"
"chr1"
julia> b = DataTables.head(CSV.read("test.tsv", delim='\t', nullable=false))
6×6 DataTables.DataTable
│ Row │ chr │ start │ stop │ score │ source │ transcript_id │
├─────┼──────┼────────┼────────┼───────┼──────────┼───────────────────┤
│ 1 │ chr1 │ 906333 │ 907318 │ 0.033 │ putative │ MSTRG.336.1 │
│ 2 │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000466827.1 │
│ 3 │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.339.21 │
│ 4 │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000464948.1 │
│ 5 │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ MSTRG.340.6 │
│ 6 │ chr1 │ 941845 │ 945764 │ 0.041 │ putative │ ENST00000496938.1 │
julia> DataTables.eltypes(b)
6-element Array{Type,1}:
String
Int64
Int64
Float64
String
String
julia> b[:chr]
6-element Array{String,1}:
"chr1"
"chr1"
"chr1"
"chr1"
"chr1"
"chr1" |
src/abstractdataframe/io.jl
Outdated
@@ -229,7 +229,7 @@ importall DataStreams | |||
# DataFrames DataStreams implementation | |||
function Data.schema(df::DataFrame, ::Type{Data.Column}) | |||
return Data.Schema(map(string, names(df)), | |||
DataType[typeof(A) for A in df.columns], size(df, 1)) | |||
DataType[typeof(A) for A in df.columns], size(df, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this should align with map
since it's the continuation of the arguments list.
src/abstractdataframe/io.jl
Outdated
Data.streamto!{T}(sink::DataFrame, ::Type{Data.Field}, val::T, row, col, sch::Data.Schema{false}) = | ||
push!(sink.columns[col]::Vector{T}, val) | ||
Data.streamto!{T}(sink::DataFrame, ::Type{Data.Field}, val::Nullable{T}, row, col, sch::Data.Schema{false}) = | ||
push!(sink.columns[col]::DataVector{T}, isnull(val) ? NA : get(val)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get(val, NA)
should work and be more efficient (since internally it uses ifelse
to avoid a branch when possible). Same below.
This looks good from first glance. I have a dedicated package for DataStreams testing (DataStreamIntegrationTests.jl) that we could tweak to ensure both DataTables and DataFrames get tested. Note that this will require some package tagging work. We'll want to put DataStreams and DataFrames upperbounds on all existing DataStream-implementation packages (CSV, SQLite, ODBC, Feather), and do a new DataStreams tag along w/ a DataFrames tag that all those packages new tags can then depend on. |
I'm playing with this locally now; this isn't quite right yet, because we're not taking care of the mmapped |
No longer relevant. |
No description provided.