-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify WeakRefStrings purpose #21
Conversation
…RefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors
Codecov Report
@@ Coverage Diff @@
## master #21 +/- ##
==========================================
+ Coverage 78.72% 79.59% +0.86%
==========================================
Files 1 1
Lines 47 49 +2
==========================================
+ Hits 37 39 +2
Misses 10 10
Continue to review full report at Codecov.
|
The key difference w.r.t #20 is that here |
Yeah, I think this is mostly orthogonal to #20. I wonder whether it's a good idea to remove features, though. As long as CSV doesn't export BTW, I wonder whether we could find a type which would be easier to understand, especially for |
src/WeakRefStrings.jl
Outdated
WeakRefStringArray(data::Vector{UInt8}, ::Type{Union{Missing, T}}, rows::Integer) where {T} = WeakRefStringArray(Any[data], Vector{Union{Missing, T}}(rows)) | ||
WeakRefStringArray(data::Vector{UInt8}, A::Array{T}) where {T <: Union{WeakRefString, Missing}} = WeakRefStringArray(Any[data], A) | ||
init(::Type{T}, rows) where {T} = fill(zero(T), rows) | ||
if !isdefined(Base, :uninitialized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though Compat provided this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only compat definition needed currently, so easier to just provide the single definition.
src/WeakRefStrings.jl
Outdated
|
||
wk(w::WeakRefString) = string(w) | ||
wk(::Missing) = missing | ||
|
||
Base.size(A::WeakRefStringArray) = size(A.elements) | ||
Base.eltype(A::WeakRefStringArray{T}) where {T <: WeakRefString} = String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These shouldn't be needed if WeakRefStringArray
correctly inherited from AbstractArray{String}
or AbstractArray{Union{String, Missing}}
.
What's worrysome to me with the current implementation is that upon any string-like operation struct LazyStringArray{T <: LazyString, N} <: AbstractArray{String, N}
data::Vector{Any}
elements::Array{Union{String, T}, N}
end The idea is that if Actually, this could be generalized to other lazy element allocation usecases, and I wonder whether something like that already exist (ala MappedArrays.jl, Lazy.jl) |
The other question I had is how much struct WeakRefString
ptr::Ptr{UInt8}
len::Int
end is faster/smaller than struct LazyString
data::Array{Uint8}
start::Int
len::Int
end that really references the data. |
I think we could also have the Though that |
If all columns/rows of the dataframe created by |
On the question of storing I'm also not entirely sure on the need to inherit from |
AFAIK the And why would you want dispatch not to accept a |
It will be considered as |
"Good enough" doesn't seem enough to me when we can easily implement the correct definition (and already have code by @alyst for that). We've been that way with |
But what if we end up implementing a proper AbstractString interface for WeakRefString directly? And all the sudden we actually want the element type to be WeakRefString? All that changes in my current definition here is we change what If there are actual cases where subtyping |
👍
Nah, it was not a "type parameter gymnastics". We can do much more overcomplicated things than that ;) Actually, I agree with @nalimilan. I think the issue of
I think that leaves us with |
See what the manual says:
Anyway, moving from |
Thanks for making these changes
I don't think this is so simple anymore. @yuyichao has improved the compiler such that some of these allocations can be avoided. I also think that |
Alright, thanks for the encouragement @alyst and @nalimilan; I've updated the type parameters here to better reflect the I've also done a few rounds of benchmarking w/ parsing a single string up to reading a million strings w/ CSV; while the improvements are minor on the small scale, WeakRefStrings still really provide huge performance benefits when reading lots and lots of string values. |
Great! Do you mean CSV.jl benchmarks or some internal ones? Would be nice to have some benchmark scripts in WeakRefString.jl. Also CI is green on 0.7, but I guess if |
@alyst, I didn't see any issues adding the eltypes from the other PR 🤷♂️ . |
Remove a few pitfall functions that allowed direct processing of WeakRefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors
cc: @nalimilan @alyst: I see this as step one to the plan in JuliaData/CSV.jl#140
Should resolve #19 (@andreasnoack).