Intern strings by default instead of using WeakRefString #204

nalimilan · 2018-05-10T15:31:23Z

WeakRefStrings are dangerous in interaction with use_mmap=true since crashes can happen
if the file is modified during the lifetime of the resulting DataFrame (#180). String interning can also be more efficient when the proportion of unique strings is small. Finally, returning plain Vector{String} columns is more user-friendly.

Ideally, DataFrames could take advantage of the fact that strings are interned to speed up grouping, which would allow using categorical=false by default.

A basic benchmark on 0.6 appears to indicate that interning is the fastest approach (not sure why the number of allocations is higher with WeakRefString):

julia> @btime CSV.read("test/test_files/Fielding.csv", strings=:intern, categorical=false, types=Dict("GS"=>Union{Int, Missing},"PO"=>Union{Int, Missing},"A"=>Union{Int, Missing},"E"=>Union{Int, Missing},"DP"=>Union{Int, Missing},"PB"=>Union{Int, Missing},"InnOuts"=>Union{Int, Missing},"WP"=>Union{Int, Missing},"SB"=>Union{Int, Missing},"CS"=>Union{Int, Missing},"ZR"=>Union{Int, Missing}), rows_for_type_detect=1);
  615.896 ms (10589390 allocations: 195.60 MiB)

julia> @btime CSV.read("CSV/test/test_files/Fielding.csv", strings=:weakref, categorical=false, types=Dict("GS"=>Union{Int, Missing},"PO"=>Union{Int, Missing},"A"=>Union{Int, Missing},"E"=>Union{Int, Missing},"DP"=>Union{Int, Missing},"PB"=>Union{Int, Missing},"InnOuts"=>Union{Int, Missing},"WP"=>Union{Int, Missing},"SB"=>Union{Int, Missing},"CS"=>Union{Int, Missing},"ZR"=>Union{Int, Missing}), rows_for_type_detect=1);
  675.738 ms (12563641 allocations: 255.85 MiB)

julia> @btime CSV.read("test/test_files/Fielding.csv", strings=:raw, categorical=false, types=Dict("GS"=>Union{Int, Missing},"PO"=>Union{Int, Missing},"A"=>Union{Int, Missing},"E"=>Union{Int, Missing},"DP"=>Union{Int, Missing},"PB"=>Union{Int, Missing},"InnOuts"=>Union{Int, Missing},"WP"=>Union{Int, Missing},"SB"=>Union{Int, Missing},"CS"=>Union{Int, Missing},"ZR"=>Union{Int, Missing}), rows_for_type_detect=1);
  725.674 ms (10589390 allocations: 195.60 MiB)

WeakRefStrings are dangerous in interaction with use_mmap=true since crashes can happen if the file is modified during the lifetime of the resulting DataFrame. String interning can also be more efficient when the proportion of unique strings is small. Finally, returning plain Vector{String} columns is more user-friendly.

quinnj

This looks awesome. I love how simple it is. Looks like InternedStrings has a 0.7 error? Have you tested/benchmarked on 0.7 at all? It'd be nice to make sure there's nothing drastic there.

Also, it seems like we could just make this the default pretty soon (after weakrefstrings deprecation) and get rid of all the keyword arguments.

The only other thing that comes to mind is some of the recent work that @shashi and @andreasnoack have done on the new StringArray type in WeakRefStrings.jl. The performance story could be really awesome, but one advantage of StringArray is the ability to be mmapped and shared between processes. I think we could figure out a world where both approaches can live together though.

nalimilan · 2018-05-11T13:10:35Z

Yes, it's really cool how simple this approach is. I agree WeakRefStringArray/StringArray are useful too, but they are better kept as options given that they only work well when you can ensure the backing file is preserved.

Fixing the 0.7 failures doesn't seem to be hard, but while doing that I bumped on a strange fact:

julia> x = UInt8['a'];
julia> y = UInt8['a'];

julia> String(x) === String(y)
true

julia> x === y
false

So it looks like Julia 0.7 interns strings by default? That would mean most of that PR can be dropped once 0.7 is out.

EDIT: actually you need to use pointer(String(x)) === pointer(String(y)) now to check whether two strings share the same storage. I'll update the PR.

quinnj · 2018-05-11T13:18:28Z

Yeah, not "interned by default", but it's more the result of all the alloc-elim pass work; i.e. the compiler is smarter at detecting when things are the same and avoiding unnecessary allocations when objects are "the same" (===). Interning strings will still be useful for the foreseeable future.

codecov · 2018-05-11T13:35:56Z

Codecov Report

Merging #204 into master will decrease coverage by 0.1%.
The diff coverage is 64.7%.

@@            Coverage Diff            @@
##           master    #204      +/-   ##
=========================================
- Coverage    84.5%   84.4%   -0.11%     
=========================================
  Files           8       8              
  Lines         897     904       +7     
=========================================
+ Hits          758     763       +5     
- Misses        139     141       +2

Impacted Files	Coverage Δ
src/parsefields.jl	`91.17% <100%> (+0.49%)`	⬆️
src/CSV.jl	`55.17% <33.33%> (-1.98%)`	⬇️
src/Source.jl	`91.3% <50%> (-0.55%)`	⬇️
src/TransposedSource.jl	`66.27% <60%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29d1c52...bf9b4ba. Read the comment docs.

nalimilan · 2018-05-11T13:43:20Z

Yeah, not "interned by default", but it's more the result of all the alloc-elim pass work; i.e. the compiler is smarter at detecting when things are the same and avoiding unnecessary allocations when objects are "the same" (===). Interning strings will still be useful for the foreseeable future.

IIUC what Stefan said on Slack (at #strings), the new behavior of === is just a semantic change, it doesn't actually mean that the compiler avoids allocations when possible. And indeed pointer still shows that objects are actually different.

I've added a commit which fixes tests on 0.7 with JuliaString/InternedStrings.jl#15 (except for a failure which is already present on master).

nalimilan · 2018-05-12T14:59:46Z

Tests pass locally on 0.7 with DataStreams master. Merging.

nalimilan mentioned this pull request May 10, 2018

Add intern(::Type, ::AbstractString) to choose custom return type JuliaString/InternedStrings.jl#10

Merged

quinnj approved these changes May 11, 2018

View reviewed changes

Julia 0.7 fixes

b3dbf47

nalimilan mentioned this pull request May 11, 2018

Fix most deprecations on Julia 0.7 JuliaString/InternedStrings.jl#15

Merged

nalimilan closed this May 12, 2018

nalimilan reopened this May 12, 2018

Fix failing test on Julia 0.7

bf9b4ba

nalimilan merged commit aa99e5c into master May 12, 2018

nalimilan deleted the nl/intern branch May 12, 2018 15:00

nalimilan mentioned this pull request May 12, 2018

Julia crashes with "Bus error: 10" after CSV.write call #180

Closed

nalimilan mentioned this pull request Jul 3, 2018

CategoricalArrays without CategoricalValue JuliaData/CategoricalArrays.jl#151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intern strings by default instead of using WeakRefString #204

Intern strings by default instead of using WeakRefString #204

nalimilan commented May 10, 2018 •

edited

Loading

quinnj left a comment

nalimilan commented May 11, 2018 •

edited

Loading

quinnj commented May 11, 2018

codecov bot commented May 11, 2018 •

edited

Loading

nalimilan commented May 11, 2018

nalimilan commented May 12, 2018

Intern strings by default instead of using WeakRefString #204

Intern strings by default instead of using WeakRefString #204

Conversation

nalimilan commented May 10, 2018 • edited Loading

quinnj left a comment

Choose a reason for hiding this comment

nalimilan commented May 11, 2018 • edited Loading

quinnj commented May 11, 2018

codecov bot commented May 11, 2018 • edited Loading

Codecov Report

nalimilan commented May 11, 2018

nalimilan commented May 12, 2018

nalimilan commented May 10, 2018 •

edited

Loading

nalimilan commented May 11, 2018 •

edited

Loading

codecov bot commented May 11, 2018 •

edited

Loading