Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify hash of VariableTimeZone #281

Merged
merged 5 commits into from
Sep 4, 2020
Merged

Conversation

oxinabox
Copy link
Contributor

@oxinabox oxinabox commented Sep 3, 2020

I think the gain we get out of reducing collisions for the case of two timezones with same name but different transitions and cutoffs is not worth the extra time that hashing takes.

@omus
Copy link
Member

omus commented Sep 3, 2020

I agree we can probably make this more performant but this is approach is definitely on the reckless side. If anything we should actually ignore the name and just look at the contents

@iamed2
Copy link
Member

iamed2 commented Sep 3, 2020

this is approach is definitely on the reckless side

How often do you expect there to be multiple different time zones with the same name?

@oxinabox
Copy link
Contributor Author

oxinabox commented Sep 3, 2020

TimeZone names used in practice are almost always from the standard set

For isequal we are requiring equal name, + equal content.
so for hash equal of any of the above is enough.

Equal content will have far more collisions as there is only a bit more than 24 basic offsets
Might be 100 i guess with tthe different transitions

@omus
Copy link
Member

omus commented Sep 3, 2020

There are a few tzdata named timezones like "MDT" that could defined by users. There are some plans to experiment time zones without pre-computed transitions which would definitely have hash collisions in that case. At a minimum we need to include the type in the hash here.

I think a less reckless way would be to precompute the hashes of the tzdata timezones which would eliminate the overhead and still support custom VariableTimeZones

@iamed2
Copy link
Member

iamed2 commented Sep 3, 2020

Hash collisions are okay as long as they're not extremely common. It's one additional isequal call.

At a minimum we need to include the type in the hash here.

Why?

@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2020

Codecov Report

Merging #281 into master will increase coverage by 1.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #281      +/-   ##
==========================================
+ Coverage   92.41%   93.58%   +1.16%     
==========================================
  Files          30       30              
  Lines        1398     1527     +129     
==========================================
+ Hits         1292     1429     +137     
+ Misses        106       98       -8     
Impacted Files Coverage Δ
src/types/variabletimezone.jl 100.00% <100.00%> (+9.09%) ⬆️
src/build.jl 100.00% <0.00%> (ø)
src/rounding.jl 100.00% <0.00%> (ø)
src/TimeZones.jl 100.00% <0.00%> (ø)
src/adjusters.jl 100.00% <0.00%> (ø)
src/interpret.jl 100.00% <0.00%> (ø)
src/conversions.jl 100.00% <0.00%> (ø)
src/tzdata/archive.jl 100.00% <0.00%> (ø)
src/types/fixedtimezone.jl 100.00% <0.00%> (ø)
src/utils.jl 97.56% <0.00%> (+0.06%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf53361...5e05f2b. Read the comment docs.

@omus
Copy link
Member

omus commented Sep 4, 2020

I'll explain some of my points in greater detail. Let's start off with the rule which under no circumstances should we break:

isequal(x,y) must imply that hash(x) == hash(y)

This is taken directly from the help for isequal. We shouldn't violate this rule (in reference to: #281 (comment))

Next,

Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)

For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same. I bring this point up just as a consideration. Using strictly the name would work here for different types but two time zones that are the same will differing names would fail this check. To me this makes me thing that names should probably be ignored for the purposes of hashing (including names in the transitions).

Finally, definitely an unofficial rule:

Dissimilar types should not hash to be same value (e.g. Int vs. Char)

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.

@iamed2
Copy link
Member

iamed2 commented Sep 4, 2020

isequal(x,y) must imply that hash(x) == hash(y)

Under this rule it is perfectly okay to have hash(x) == hash(y) and !isequal(x, y). In fact, this is inevitable if there are more unique values than there are UInts.

Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)

Not sure where you found this, as it's not in the help for isequal or hash in my Julia. This only applies when isequal(x, y) is true; this exists so that they are recorded as identical as keys a dictionary. I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.

Putting the type in the hash seems fine to me then.

@oxinabox
Copy link
Contributor Author

oxinabox commented Sep 4, 2020

For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same.

They are not the same value. isequals defines what we consider to be the same value.
and isequal says that you need to have the same name.
I can imagine that we might want to redefine isequal that way, but it is out of scope for this PR.

Ranges and vectors with the same value are considered equal by julia.

julia> isequal(1:4, [1,2,3,4])
true

Your second rule is just another way of saying the first.

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't idea

Yeah, I didn't consider it a problem, but if you do, fair enough.

@omus
Copy link
Member

omus commented Sep 4, 2020

Under this rule it is perfectly okay to have hash(x) == hash(y) and !isequal(x, y)

You are correct here. I was thinking about why I thought (hash(x) == hash(y)) == isequal(x, y) and I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions. I usually make have made (hash(x) == hash(y)) == isequal(x, y) because of this. As VariableTimeZone isn't likely to be added as a column in a DataFrame it probably doesn't matter in this case.

I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.

What about time zones with the same name but different transitions being stored in the same dictionary?

@iamed2
Copy link
Member

iamed2 commented Sep 4, 2020

What about time zones with the same name but different transitions being stored in the same dictionary?

Those can be stored as different keys no problem, since isequal will be false. Collisions are fine, they are disambiguated by a cheap isequal check.

@iamed2
Copy link
Member

iamed2 commented Sep 4, 2020

I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions

This is a serious bug if it exists, but I don't believe it does (anymore?). Looking at their code, they seem to use isequal to disambiguate, which is correct: https://github.com/JuliaData/DataFrames.jl/blob/fb4e184f3b3997da8680668fbf7fcc789811eb18/src/dataframerow/utils.jl#L126

@omus
Copy link
Member

omus commented Sep 4, 2020

Collisions are fine, they are disambiguated by a cheap isequal check.

TIL. I definitely did not know this that hash collisions had a disambiguation check. This does change things:

julia> using TimeZones

julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)

julia> Dict(tz1 => 'a', tz2 => 'b')
Dict{VariableTimeZone,Char} with 2 entries:
  tz"America/Winnipeg"                      => 'a'
  VariableTimeZone("America/Winnipeg", ...) => 'b'

julia> hash(tz1) == hash(tz2)
true

@omus
Copy link
Member

omus commented Sep 4, 2020

As we haven't done so yet I'll post a benchmark. Setup:

julia> using TimeZones, BenchmarkTools^C

julia> tz = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5

Current master

julia> @btime hash($tz)
  14.333 μs (187 allocations: 5.84 KiB)
0xe223c497ca74d779

This PR

julia> @btime hash($tz)
  17.395 ns (0 allocations: 0 bytes)
0x99df18304ac43999

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
@omus
Copy link
Member

omus commented Sep 4, 2020

I'll fiddle with this implementation and DataFrames to try and ensure we're not missing anything. If those tests work out I think we can proceed with this.

@omus
Copy link
Member

omus commented Sep 4, 2020

Nothing strange when working with DataFrames

julia> using DataFrames, TimeZones

julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)

julia> tz3 = deepcopy(tz1)
America/Winnipeg (UTC-6/UTC-5)

julia> df1 = DataFrame(:tz => tz1, :val => 1);

julia> df2 = DataFrame(:tz => tz2, :val => 2);

julia> df3 = DataFrame(:tz => tz3, :val => 3)

julia> innerjoin(df1, df2, on=:tz, makeunique=true)
0×3 DataFrame


julia> leftjoin(df1, df2, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1   │
│     │ VariableTimeZone               │ Int64 │ Int64?  │
├─────┼────────────────────────────────┼───────┼─────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1missing │

julia> outerjoin(df1, df2, on=:tz, makeunique=true)
2×3 DataFrame
│ Row │ tz                              │ val     │ val_1   │
│     │ VariableTimeZone                │ Int64?  │ Int64?  │
├─────┼─────────────────────────────────┼─────────┼─────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5)  │ 1missing │
│ 2   │ America/Winnipeg (UTC-06:28:36) │ missing2       │

julia> innerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1 │
│     │ VariableTimeZone               │ Int64 │ Int64 │
├─────┼────────────────────────────────┼───────┼───────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 13     │

julia> leftjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1  │
│     │ VariableTimeZone               │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 13      │

julia> outerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val    │ val_1  │
│     │ VariableTimeZone?              │ Int64? │ Int64? │
├─────┼────────────────────────────────┼────────┼────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 13      │

julia> leftjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true

julia> innerjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
@omus
Copy link
Member

omus commented Sep 4, 2020

Will merge after CI completes

@omus omus merged commit f2e0d46 into JuliaTime:master Sep 4, 2020
@oxinabox oxinabox deleted the patch-3 branch September 4, 2020 17:57
@omus omus added the performance Changes that impact code performance label Sep 23, 2020
kpamnany pushed a commit to RelationalAI-oss/TimeZones.jl that referenced this pull request May 5, 2023
Co-authored-by: Eric Davies <iamed2@gmail.com>
Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Changes that impact code performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants