-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intern strings by default instead of using WeakRefString #204
Conversation
WeakRefStrings are dangerous in interaction with use_mmap=true since crashes can happen if the file is modified during the lifetime of the resulting DataFrame. String interning can also be more efficient when the proportion of unique strings is small. Finally, returning plain Vector{String} columns is more user-friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome. I love how simple it is. Looks like InternedStrings has a 0.7 error? Have you tested/benchmarked on 0.7 at all? It'd be nice to make sure there's nothing drastic there.
Also, it seems like we could just make this the default pretty soon (after weakrefstrings deprecation) and get rid of all the keyword arguments.
The only other thing that comes to mind is some of the recent work that @shashi and @andreasnoack have done on the new StringArray
type in WeakRefStrings.jl. The performance story could be really awesome, but one advantage of StringArray
is the ability to be mmapped and shared between processes. I think we could figure out a world where both approaches can live together though.
Yes, it's really cool how simple this approach is. I agree Fixing the 0.7 failures doesn't seem to be hard, but while doing that I bumped on a strange fact: julia> x = UInt8['a'];
julia> y = UInt8['a'];
julia> String(x) === String(y)
true
julia> x === y
false So it looks like Julia 0.7 interns strings by default? That would mean most of that PR can be dropped once 0.7 is out. EDIT: actually you need to use |
Yeah, not "interned by default", but it's more the result of all the alloc-elim pass work; i.e. the compiler is smarter at detecting when things are the same and avoiding unnecessary allocations when objects are "the same" ( |
Codecov Report
@@ Coverage Diff @@
## master #204 +/- ##
=========================================
- Coverage 84.5% 84.4% -0.11%
=========================================
Files 8 8
Lines 897 904 +7
=========================================
+ Hits 758 763 +5
- Misses 139 141 +2
Continue to review full report at Codecov.
|
IIUC what Stefan said on Slack (at #strings), the new behavior of I've added a commit which fixes tests on 0.7 with JuliaString/InternedStrings.jl#15 (except for a failure which is already present on master). |
Tests pass locally on 0.7 with DataStreams master. Merging. |
WeakRefString
s are dangerous in interaction withuse_mmap=true
since crashes can happenif the file is modified during the lifetime of the resulting
DataFrame
(#180). String interning can also be more efficient when the proportion of unique strings is small. Finally, returning plainVector{String}
columns is more user-friendly.Ideally, DataFrames could take advantage of the fact that strings are interned to speed up grouping, which would allow using
categorical=false
by default.A basic benchmark on 0.6 appears to indicate that interning is the fastest approach (not sure why the number of allocations is higher with
WeakRefString
):