-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CSV.read docs about WeakRefString conversions #118
Conversation
Closes JuliaData#114
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@quinnj Maybe the docs should even recommend using |
src/Source.jl
Outdated
@@ -244,7 +244,7 @@ Keyword Arguments: | |||
* `categorical::Bool=true`: read string column as a `CategoricalArray` ([ref](https://github.com/JuliaData/CategoricalArrays.jl)), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small. | |||
|
|||
Note by default, "string" or text columns will be parsed as the [`WeakRefString`](https://github.com/quinnj/WeakRefStrings.jl) type. This is a custom type that only stores a pointer to the actual byte data + the number of bytes. | |||
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)`, this also works on an entire column. | |||
To convert a `String` to a standard Julia string type, just call `string(::WeakRefString)` for an individual observation, or `string.(::WeakRefString)` on an entire column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This looks like a nice small clarification, so I feel sorry for my lengthy comment ;) :
- I think
String
has to be fixed too since it is a "standard Julia string type". string.(::WeakRefString)
will work, but::WeakRefString
is not a type of the column
An alternative:
"
To convert wstr::WeakRefString
to a standard Julia string type, just call string(wstr)
, or string.(df[:wstr])
to convert an entire wstr
column of df
table.
"
However, the problem with this approach is that missing values in df[:wstr]
would be converted into "missing" strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, good catch. So I guess the only correct approach is convert(Union{String, Missing}, ::WeakRefString)
and convert(Array{Union{String, Missing}}, ::AbstractArray{WeakRefString})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the problem with this approach is that missing values in df[:wstr] would be converted into "missing" strings.
Not sure I follow... is that a problem? Or do you mean they'd become ""
instead of Missings.missing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only correct approach is
convert(Union{String, Missing}, ::WeakRefString)
andconvert(Array{Union{String, Missing}}, ::AbstractArray{WeakRefString})
?
I also think so. More precisely, the scalar form convert(Union{String, Missing}, wstr)
would give the correct result both for wstr::WeakRefString
and wstr === missing
.
But convert(Array{Union{String, Missing}}, wstr)
should only be called for wstr::AbstractArray{Union{WeakRefString, Missing}}
unless we want to allow missingness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> string(Missings.missing)
"missing"
string.(wstr::AbstractArray{Union{WeakRefString, Missing}})
doesn't preserve missingness: your Missings.missing
missing values in the array would become non-missing strings "missing"::String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately
julia> using CSV, Missings
julia> df = CSV.read(joinpath(Pkg.dir("CSV"), "test/test_files/attenu.csv"), null="NA", rows_for_type_detect=200);
julia> eltype(df[3])
Union{Missings.Missing, WeakRefString{UInt8}}
julia> eltype(convert.(Union{String, Missing}, df[3]))
Any
Maybe it's somehow related to this
julia> typeof(df[1,3])
String
julia> collect(skipmissing(df[3]))
ERROR: TypeError: _collect: in typeassert, expected WeakRefString{UInt8}, got String
Stacktrace:
[1] next at /home/astukalov/.julia/v0.6/Missings/src/Missings.jl:265 [inlined]
[2] _collect(::UnitRange{Int64}, ::Missings.EachSkipMissing{WeakRefStrings.WeakRefStringArray{Union{Missings.Missing, WeakRefString{UInt8}},1}}, ::Base.HasEltype, ::Base.SizeUnknown) at ./array.jl:442
[3] collect(::Missings.EachSkipMissing{WeakRefStrings.WeakRefStringArray{Union{Missings.Missing, WeakRefString{UInt8}},1}}) at ./array.jl:431
[4] macro expansion at ./REPL.jl:97 [inlined]
[5] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like WeakRefStringArray
is lying about its eltype: it should be String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If WeakRefString
to String
conversion is now supposed to be handled automatically, maybe we don't need this docstring section at all :)
I guess we need @quinnj assistance to clarify the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And regarding the original problem that the eltype is Any
, see JuliaLang/julia#24332.
So I guess the only solution for now is to recommend using either convert(Array{String}, x)
or convert(Array{Union{String, Missing}}, x)
depending on whether you want to support missing values or not.
Sorry, the conversation was more complex than anticipated. Would you update the PR? |
@nalimilan Yup, sorry -- lost track of things. Submitting now. |
Codecov Report
@@ Coverage Diff @@
## master #118 +/- ##
=======================================
Coverage 86.71% 86.71%
=======================================
Files 8 8
Lines 858 858
=======================================
Hits 744 744
Misses 114 114
Continue to review full report at Codecov.
|
Closing in favor of doc updates in #143 |
Closes #114