Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intern strings by default instead of using WeakRefString #204

Merged
merged 3 commits into from
May 12, 2018
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented May 10, 2018

WeakRefStrings are dangerous in interaction with use_mmap=true since crashes can happen
if the file is modified during the lifetime of the resulting DataFrame (#180). String interning can also be more efficient when the proportion of unique strings is small. Finally, returning plain Vector{String} columns is more user-friendly.

Ideally, DataFrames could take advantage of the fact that strings are interned to speed up grouping, which would allow using categorical=false by default.

A basic benchmark on 0.6 appears to indicate that interning is the fastest approach (not sure why the number of allocations is higher with WeakRefString):

julia> @btime CSV.read("test/test_files/Fielding.csv", strings=:intern, categorical=false, types=Dict("GS"=>Union{Int, Missing},"PO"=>Union{Int, Missing},"A"=>Union{Int, Missing},"E"=>Union{Int, Missing},"DP"=>Union{Int, Missing},"PB"=>Union{Int, Missing},"InnOuts"=>Union{Int, Missing},"WP"=>Union{Int, Missing},"SB"=>Union{Int, Missing},"CS"=>Union{Int, Missing},"ZR"=>Union{Int, Missing}), rows_for_type_detect=1);
  615.896 ms (10589390 allocations: 195.60 MiB)

julia> @btime CSV.read("CSV/test/test_files/Fielding.csv", strings=:weakref, categorical=false, types=Dict("GS"=>Union{Int, Missing},"PO"=>Union{Int, Missing},"A"=>Union{Int, Missing},"E"=>Union{Int, Missing},"DP"=>Union{Int, Missing},"PB"=>Union{Int, Missing},"InnOuts"=>Union{Int, Missing},"WP"=>Union{Int, Missing},"SB"=>Union{Int, Missing},"CS"=>Union{Int, Missing},"ZR"=>Union{Int, Missing}), rows_for_type_detect=1);
  675.738 ms (12563641 allocations: 255.85 MiB)

julia> @btime CSV.read("test/test_files/Fielding.csv", strings=:raw, categorical=false, types=Dict("GS"=>Union{Int, Missing},"PO"=>Union{Int, Missing},"A"=>Union{Int, Missing},"E"=>Union{Int, Missing},"DP"=>Union{Int, Missing},"PB"=>Union{Int, Missing},"InnOuts"=>Union{Int, Missing},"WP"=>Union{Int, Missing},"SB"=>Union{Int, Missing},"CS"=>Union{Int, Missing},"ZR"=>Union{Int, Missing}), rows_for_type_detect=1);
  725.674 ms (10589390 allocations: 195.60 MiB)

WeakRefStrings are dangerous in interaction with use_mmap=true since crashes can happen
if the file is modified during the lifetime of the resulting DataFrame. String interning
can also be more efficient when the proportion of unique strings is small. Finally,
returning plain Vector{String} columns is more user-friendly.
Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. I love how simple it is. Looks like InternedStrings has a 0.7 error? Have you tested/benchmarked on 0.7 at all? It'd be nice to make sure there's nothing drastic there.

Also, it seems like we could just make this the default pretty soon (after weakrefstrings deprecation) and get rid of all the keyword arguments.

The only other thing that comes to mind is some of the recent work that @shashi and @andreasnoack have done on the new StringArray type in WeakRefStrings.jl. The performance story could be really awesome, but one advantage of StringArray is the ability to be mmapped and shared between processes. I think we could figure out a world where both approaches can live together though.

@nalimilan
Copy link
Member Author

nalimilan commented May 11, 2018

Yes, it's really cool how simple this approach is. I agree WeakRefStringArray/StringArray are useful too, but they are better kept as options given that they only work well when you can ensure the backing file is preserved.

Fixing the 0.7 failures doesn't seem to be hard, but while doing that I bumped on a strange fact:

julia> x = UInt8['a'];
julia> y = UInt8['a'];

julia> String(x) === String(y)
true

julia> x === y
false

So it looks like Julia 0.7 interns strings by default? That would mean most of that PR can be dropped once 0.7 is out.

EDIT: actually you need to use pointer(String(x)) === pointer(String(y)) now to check whether two strings share the same storage. I'll update the PR.

@quinnj
Copy link
Member

quinnj commented May 11, 2018

Yeah, not "interned by default", but it's more the result of all the alloc-elim pass work; i.e. the compiler is smarter at detecting when things are the same and avoiding unnecessary allocations when objects are "the same" (===). Interning strings will still be useful for the foreseeable future.

@codecov
Copy link

codecov bot commented May 11, 2018

Codecov Report

Merging #204 into master will decrease coverage by 0.1%.
The diff coverage is 64.7%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #204      +/-   ##
=========================================
- Coverage    84.5%   84.4%   -0.11%     
=========================================
  Files           8       8              
  Lines         897     904       +7     
=========================================
+ Hits          758     763       +5     
- Misses        139     141       +2
Impacted Files Coverage Δ
src/parsefields.jl 91.17% <100%> (+0.49%) ⬆️
src/CSV.jl 55.17% <33.33%> (-1.98%) ⬇️
src/Source.jl 91.3% <50%> (-0.55%) ⬇️
src/TransposedSource.jl 66.27% <60%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29d1c52...bf9b4ba. Read the comment docs.

@nalimilan
Copy link
Member Author

Yeah, not "interned by default", but it's more the result of all the alloc-elim pass work; i.e. the compiler is smarter at detecting when things are the same and avoiding unnecessary allocations when objects are "the same" (===). Interning strings will still be useful for the foreseeable future.

IIUC what Stefan said on Slack (at #strings), the new behavior of === is just a semantic change, it doesn't actually mean that the compiler avoids allocations when possible. And indeed pointer still shows that objects are actually different.

I've added a commit which fixes tests on 0.7 with JuliaString/InternedStrings.jl#15 (except for a failure which is already present on master).

@nalimilan
Copy link
Member Author

Tests pass locally on 0.7 with DataStreams master. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants