Simplify hash of VariableTimeZone #281

oxinabox · 2020-09-03T20:55:14Z

I think the gain we get out of reducing collisions for the case of two timezones with same name but different transitions and cutoffs is not worth the extra time that hashing takes.

omus · 2020-09-03T21:00:58Z

I agree we can probably make this more performant but this is approach is definitely on the reckless side. If anything we should actually ignore the name and just look at the contents

iamed2 · 2020-09-03T21:02:55Z

this is approach is definitely on the reckless side

How often do you expect there to be multiple different time zones with the same name?

oxinabox · 2020-09-03T21:03:37Z

TimeZone names used in practice are almost always from the standard set

For isequal we are requiring equal name, + equal content.
so for hash equal of any of the above is enough.

Equal content will have far more collisions as there is only a bit more than 24 basic offsets
Might be 100 i guess with tthe different transitions

omus · 2020-09-03T21:10:35Z

There are a few tzdata named timezones like "MDT" that could defined by users. There are some plans to experiment time zones without pre-computed transitions which would definitely have hash collisions in that case. At a minimum we need to include the type in the hash here.

I think a less reckless way would be to precompute the hashes of the tzdata timezones which would eliminate the overhead and still support custom VariableTimeZones

iamed2 · 2020-09-03T21:16:26Z

Hash collisions are okay as long as they're not extremely common. It's one additional isequal call.

At a minimum we need to include the type in the hash here.

Why?

codecov-commenter · 2020-09-03T21:16:29Z

Codecov Report

Merging #281 into master will increase coverage by 1.16%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #281      +/-   ##
==========================================
+ Coverage   92.41%   93.58%   +1.16%     
==========================================
  Files          30       30              
  Lines        1398     1527     +129     
==========================================
+ Hits         1292     1429     +137     
+ Misses        106       98       -8

Impacted Files	Coverage Δ
src/types/variabletimezone.jl	`100.00% <100.00%> (+9.09%)`	⬆️
src/build.jl	`100.00% <0.00%> (ø)`
src/rounding.jl	`100.00% <0.00%> (ø)`
src/TimeZones.jl	`100.00% <0.00%> (ø)`
src/adjusters.jl	`100.00% <0.00%> (ø)`
src/interpret.jl	`100.00% <0.00%> (ø)`
src/conversions.jl	`100.00% <0.00%> (ø)`
src/tzdata/archive.jl	`100.00% <0.00%> (ø)`
src/types/fixedtimezone.jl	`100.00% <0.00%> (ø)`
src/utils.jl	`97.56% <0.00%> (+0.06%)`	⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf53361...5e05f2b. Read the comment docs.

omus · 2020-09-04T13:34:27Z

I'll explain some of my points in greater detail. Let's start off with the rule which under no circumstances should we break:

isequal(x,y) must imply that hash(x) == hash(y)

This is taken directly from the help for isequal. We shouldn't violate this rule (in reference to: #281 (comment))

Next,

Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)

For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same. I bring this point up just as a consideration. Using strictly the name would work here for different types but two time zones that are the same will differing names would fail this check. To me this makes me thing that names should probably be ignored for the purposes of hashing (including names in the transitions).

Finally, definitely an unofficial rule:

Dissimilar types should not hash to be same value (e.g. Int vs. Char)

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.

iamed2 · 2020-09-04T13:51:27Z

isequal(x,y) must imply that hash(x) == hash(y)

Under this rule it is perfectly okay to have hash(x) == hash(y) and !isequal(x, y). In fact, this is inevitable if there are more unique values than there are UInts.

Ideally, hashing is value-based and different representations have the same hash (e.g. a ranges and vectors hash to be the same if the values they include are identical)

Not sure where you found this, as it's not in the help for isequal or hash in my Julia. This only applies when isequal(x, y) is true; this exists so that they are recorded as identical as keys a dictionary. I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't ideal. We had a similar problem with hashing of ZonedDateTime and there we added a seed to avoid collisions between hash(::DateTime) and hash(::ZonedDateTime). This is why I brought up including the type in the hash.

Putting the type in the hash seems fine to me then.

oxinabox · 2020-09-04T13:56:09Z

For VariableTimeZone this would mean another representation of the same time zone (possibly one that dynamically computes transitions) should ideally hash to be the same.

They are not the same value. isequals defines what we consider to be the same value.
and isequal says that you need to have the same name.
I can imagine that we might want to redefine isequal that way, but it is out of scope for this PR.

Ranges and vectors with the same value are considered equal by julia.

julia> isequal(1:4, [1,2,3,4])
true

Your second rule is just another way of saying the first.

With this particular PR we're only hashing the string which currently results in hash("UTC") == hash(tz"UTC") which isn't idea

Yeah, I didn't consider it a problem, but if you do, fair enough.

omus · 2020-09-04T14:23:34Z

Under this rule it is perfectly okay to have hash(x) == hash(y) and !isequal(x, y)

You are correct here. I was thinking about why I thought (hash(x) == hash(y)) == isequal(x, y) and I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions. I usually make have made (hash(x) == hash(y)) == isequal(x, y) because of this. As VariableTimeZone isn't likely to be added as a column in a DataFrame it probably doesn't matter in this case.

I definitely think we should be able to store time zones with different names but the same transitions as different keys in a dictionary.

What about time zones with the same name but different transitions being stored in the same dictionary?

iamed2 · 2020-09-04T14:36:53Z

What about time zones with the same name but different transitions being stored in the same dictionary?

Those can be stored as different keys no problem, since isequal will be false. Collisions are fine, they are disambiguated by a cheap isequal check.

iamed2 · 2020-09-04T14:46:32Z

I believe it is because DataFrames used to (possibly still) uses hash for joining and which they internally label as isequal functions

This is a serious bug if it exists, but I don't believe it does (anymore?). Looking at their code, they seem to use isequal to disambiguate, which is correct: https://github.com/JuliaData/DataFrames.jl/blob/fb4e184f3b3997da8680668fbf7fcc789811eb18/src/dataframerow/utils.jl#L126

omus · 2020-09-04T14:48:11Z

Collisions are fine, they are disambiguated by a cheap isequal check.

TIL. I definitely did not know this that hash collisions had a disambiguation check. This does change things:

julia> using TimeZones

julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)

julia> Dict(tz1 => 'a', tz2 => 'b')
Dict{VariableTimeZone,Char} with 2 entries:
  tz"America/Winnipeg"                      => 'a'
  VariableTimeZone("America/Winnipeg", ...) => 'b'

julia> hash(tz1) == hash(tz2)
true

omus · 2020-09-04T14:53:13Z

As we haven't done so yet I'll post a benchmark. Setup:

julia> using TimeZones, BenchmarkTools^C

julia> tz = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5

Current master

julia> @btime hash($tz)
  14.333 μs (187 allocations: 5.84 KiB)
0xe223c497ca74d779

This PR

julia> @btime hash($tz)
  17.395 ns (0 allocations: 0 bytes)
0x99df18304ac43999

src/types/variabletimezone.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

omus · 2020-09-04T15:29:38Z

I'll fiddle with this implementation and DataFrames to try and ensure we're not missing anything. If those tests work out I think we can proceed with this.

omus · 2020-09-04T15:45:20Z

Nothing strange when working with DataFrames

julia> using DataFrames, TimeZones

julia> tz1 = tz"America/Winnipeg"
America/Winnipeg (UTC-6/UTC-5)

julia> tz2 = VariableTimeZone("America/Winnipeg", tz1.transitions[1:1], nothing)
America/Winnipeg (UTC-06:28:36)

julia> tz3 = deepcopy(tz1)
America/Winnipeg (UTC-6/UTC-5)

julia> df1 = DataFrame(:tz => tz1, :val => 1);

julia> df2 = DataFrame(:tz => tz2, :val => 2);

julia> df3 = DataFrame(:tz => tz3, :val => 3)

julia> innerjoin(df1, df2, on=:tz, makeunique=true)
0×3 DataFrame


julia> leftjoin(df1, df2, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1   │
│     │ VariableTimeZone               │ Int64 │ Int64?  │
├─────┼────────────────────────────────┼───────┼─────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1     │ missing │

julia> outerjoin(df1, df2, on=:tz, makeunique=true)
2×3 DataFrame
│ Row │ tz                              │ val     │ val_1   │
│     │ VariableTimeZone                │ Int64?  │ Int64?  │
├─────┼─────────────────────────────────┼─────────┼─────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5)  │ 1       │ missing │
│ 2   │ America/Winnipeg (UTC-06:28:36) │ missing │ 2       │

julia> innerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1 │
│     │ VariableTimeZone               │ Int64 │ Int64 │
├─────┼────────────────────────────────┼───────┼───────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1     │ 3     │

julia> leftjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val   │ val_1  │
│     │ VariableTimeZone               │ Int64 │ Int64? │
├─────┼────────────────────────────────┼───────┼────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1     │ 3      │

julia> outerjoin(df1, df3, on=:tz, makeunique=true)
1×3 DataFrame
│ Row │ tz                             │ val    │ val_1  │
│     │ VariableTimeZone?              │ Int64? │ Int64? │
├─────┼────────────────────────────────┼────────┼────────┤
│ 1   │ America/Winnipeg (UTC-6/UTC-5) │ 1      │ 3      │

julia> leftjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true

julia> innerjoin(df1, df3, on=:tz, makeunique=true).tz[1] === tz1 !== tz3
true

test/types/variabletimezone.jl

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

omus · 2020-09-04T16:07:17Z

Will merge after CI completes

Co-authored-by: Eric Davies <iamed2@gmail.com> Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Simplify hash of VariableTimeZone

fb8e1b1

Fix hash test for name-based hashing

622a54f

Include type in VariableTimeZone hash

4d2fa3d

omus reviewed Sep 4, 2020

View reviewed changes

src/types/variabletimezone.jl Outdated Show resolved Hide resolved

Update src/types/variabletimezone.jl

94fb4b7

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

omus reviewed Sep 4, 2020

View reviewed changes

test/types/variabletimezone.jl Show resolved Hide resolved

Update test/types/variabletimezone.jl

5e05f2b

Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

omus approved these changes Sep 4, 2020

View reviewed changes

omus merged commit f2e0d46 into JuliaTime:master Sep 4, 2020

oxinabox deleted the patch-3 branch September 4, 2020 17:57

omus mentioned this pull request Sep 4, 2020

POC: Make ZonedDateTime an isbits type #287

Open

omus added the performance Changes that impact code performance label Sep 23, 2020

omus mentioned this pull request May 5, 2021

why does isequal and == have different definitions for VariableTimeZone #334

Open

kpamnany pushed a commit to RelationalAI-oss/TimeZones.jl that referenced this pull request May 5, 2023

Simplify hash of VariableTimeZone (JuliaTime#281)

a218132

Co-authored-by: Eric Davies <iamed2@gmail.com> Co-authored-by: Curtis Vogt <curtis.vogt@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify hash of VariableTimeZone #281

Simplify hash of VariableTimeZone #281

oxinabox commented Sep 3, 2020

omus commented Sep 3, 2020

iamed2 commented Sep 3, 2020

oxinabox commented Sep 3, 2020 •

edited

Loading

omus commented Sep 3, 2020 •

edited

Loading

iamed2 commented Sep 3, 2020

codecov-commenter commented Sep 3, 2020 •

edited

Loading

omus commented Sep 4, 2020

iamed2 commented Sep 4, 2020

oxinabox commented Sep 4, 2020

omus commented Sep 4, 2020

iamed2 commented Sep 4, 2020

iamed2 commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

Simplify hash of VariableTimeZone #281

Simplify hash of VariableTimeZone #281

Conversation

oxinabox commented Sep 3, 2020

omus commented Sep 3, 2020

iamed2 commented Sep 3, 2020

oxinabox commented Sep 3, 2020 • edited Loading

omus commented Sep 3, 2020 • edited Loading

iamed2 commented Sep 3, 2020

codecov-commenter commented Sep 3, 2020 • edited Loading

Codecov Report

omus commented Sep 4, 2020

iamed2 commented Sep 4, 2020

oxinabox commented Sep 4, 2020

omus commented Sep 4, 2020

iamed2 commented Sep 4, 2020

iamed2 commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

omus commented Sep 4, 2020

oxinabox commented Sep 3, 2020 •

edited

Loading

omus commented Sep 3, 2020 •

edited

Loading

codecov-commenter commented Sep 3, 2020 •

edited

Loading