-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify hash of VariableTimeZone #281
Conversation
I agree we can probably make this more performant but this is approach is definitely on the reckless side. If anything we should actually ignore the name and just look at the contents |
How often do you expect there to be multiple different time zones with the same name? |
TimeZone names used in practice are almost always from the standard set For Equal content will have far more collisions as there is only a bit more than 24 basic offsets |
There are a few tzdata named timezones like "MDT" that could defined by users. There are some plans to experiment time zones without pre-computed transitions which would definitely have hash collisions in that case. At a minimum we need to include the type in the hash here. I think a less reckless way would be to precompute the hashes of the tzdata timezones which would eliminate the overhead and still support custom |
Hash collisions are okay as long as they're not extremely common. It's one additional
Why? |
Codecov Report
@@ Coverage Diff @@
## master #281 +/- ##
==========================================
+ Coverage 92.41% 93.58% +1.16%
==========================================
Files 30 30
Lines 1398 1527 +129
==========================================
+ Hits 1292 1429 +137
+ Misses 106 98 -8
Continue to review full report at Codecov.
|
I'll explain some of my points in greater detail. Let's start off with the rule which under no circumstances should we break:
This is taken directly from the help for Next,
For Finally, definitely an unofficial rule:
With this particular PR we're only hashing the string which currently results in |
Under this rule it is perfectly okay to have
Not sure where you found this, as it's not in the help for
Putting the type in the hash seems fine to me then. |
They are not the same value. Ranges and vectors with the same value are considered equal by julia. julia> isequal(1:4, [1,2,3,4])
true Your second rule is just another way of saying the first.
Yeah, I didn't consider it a problem, but if you do, fair enough. |
You are correct here. I was thinking about why I thought
What about time zones with the same name but different transitions being stored in the same dictionary? |
Those can be stored as different keys no problem, since |
This is a serious bug if it exists, but I don't believe it does (anymore?). Looking at their code, they seem to use |
TIL. I definitely did not know this that hash collisions had a disambiguation check. This does change things: julia> using TimeZones
julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)
julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)
julia> Dict(tz1 => 'a', tz2 => 'b')
Dict{VariableTimeZone,Char} with 2 entries:
tz"America/Winnipeg" => 'a'
VariableTimeZone("America/Winnipeg", ...) => 'b'
julia> hash(tz1) == hash(tz2)
true |
As we haven't done so yet I'll post a benchmark. Setup: julia> using TimeZones, BenchmarkTools^C
julia> tz = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5 Current master julia> @btime hash($tz)
14.333 μs (187 allocations: 5.84 KiB)
0xe223c497ca74d779 This PR julia> @btime hash($tz)
17.395 ns (0 allocations: 0 bytes)
0x99df18304ac43999 |
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
I'll fiddle with this implementation and DataFrames to try and ensure we're not missing anything. If those tests work out I think we can proceed with this. |
Nothing strange when working with DataFrames julia> using DataFrames, TimeZones
julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)
julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)
julia> tz3 = deepcopy(tz1)
America/Winnipeg (UTC-6/UTC-5)
julia> df1 = DataFrame(:tz => tz1, :val => 1);
julia> df2 = DataFrame(:tz => tz2, :val => 2);
julia> df3 = DataFrame(:tz => tz3, :val => 3)
julia> innerjoin(df1, df2, on=:tz, makeunique=true)
0×3 DataFrame
julia> leftjoin(df1, df2, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼─────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ missing │
julia> outerjoin(df1, df2, on=:tz, makeunique=true)
2×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64? │ Int64? │
├─────┼─────────────────────────────────┼─────────┼─────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ missing │
│ 2 │ America/Winnipeg (UTC-06:28:36) │ missing │ 2 │
julia> innerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64 │ Int64 │
├─────┼────────────────────────────────┼───────┼───────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ 3 │
julia> leftjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ 3 │
julia> outerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz │ val │ val_1 │
│ │ VariableTimeZone? │ Int64? │ Int64? │
├─────┼────────────────────────────────┼────────┼────────┤
│ 1 │ America/Winnipeg (UTC-6/UTC-5) │ 1 │ 3 │
julia> leftjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true
julia> innerjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true |
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Will merge after CI completes |
Co-authored-by: Eric Davies <iamed2@gmail.com> Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
I think the gain we get out of reducing collisions for the case of two timezones with same name but different transitions and cutoffs is not worth the extra time that hashing takes.