Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroMap2d::from_iter seems to be broken #4161

Closed
sffc opened this issue Oct 14, 2023 · 4 comments · Fixed by #4160
Closed

ZeroMap2d::from_iter seems to be broken #4161

sffc opened this issue Oct 14, 2023 · 4 comments · Fixed by #4160
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake T-bug Type: Bad behavior, security, privacy

Comments

@sffc
Copy link
Member

sffc commented Oct 14, 2023

When constructing a ZeroMap2d with the following data in the following order:

[
(tinystr!(8, "aedxb"), 0, Some(tinystr!(4, "gulf"))),
(tinystr!(8, "afkbl"), 0, Some(tinystr!(4, "afgh"))),
(tinystr!(8, "ushnl"), 0, None),
(tinystr!(8, "ushnl"), 7272660, Some(tinystr!(4, "haal"))),
(tinystr!(8, "ushnl"), 0, None),
(tinystr!(8, "ushnl"), 7272660, Some(tinystr!(4, "haal"))),
]

ZeroMap2d::from_iter produces the following ZeroMap2d:

ZeroMap2d {
  keys0: ZeroVec(["aedxb", "afkbl", "ushnl"]),
  joiner: ZeroVec([1, 2, 4]),
  keys1: ZeroVec([0, 0, 0, 7272660]),
  values: ZeroVec([None, Some("haal"), None, Some("haal")])
}

which is... wrong.

See test case in #4160

@sffc sffc added T-bug Type: Bad behavior, security, privacy C-zerovec Component: Yoke, ZeroVec, DataBake labels Oct 14, 2023
@sffc
Copy link
Member Author

sffc commented Oct 14, 2023

It is most likely due to the duplicate entries for "ushnl" in the input data.

I think this is new in CLDR 44 because they fixed up some data involving merged time zones. For example, there is the following data:

          "Pacific": {
            "Honolulu": [
              {
                "usesMetazone": {
                  "_mzone": "Alaska_Hawaii",
                  "_to": "1983-10-30 11:00"
                }
              },
              {
                "usesMetazone": {
                  "_mzone": "Hawaii_Aleutian",
                  "_from": "1983-10-30 11:00"
                }
              }
            ],
            "Johnston": [
              {
                "usesMetazone": {
                  "_mzone": "Alaska_Hawaii",
                  "_to": "1983-10-30 11:00"
                }
              },
              {
                "usesMetazone": {
                  "_mzone": "Hawaii_Aleutian",
                  "_from": "1983-10-30 11:00"
                }
              }
            ],

However, both Pacific/Honolulu and Pacific/Johnston map to ushnl:

        "ushnl": {
          "_description": "Honolulu, United States",
          "_alias": "Pacific/Honolulu US/Hawaii Pacific/Johnston"
        },

The old BCP-47 alias for Pacific/Johnston is deprecated in CLDR 44:

        "umjon": {
          "_deprecated": true,
          "_description": "Johnston Atoll, U.S. Minor Outlying Islands",
          "_preferred": "ushnl"
        },

@Manishearth
Copy link
Member

Manishearth commented Oct 16, 2023

Yeah, I don't know that much about ZM2D's code. from_iter is a hard function to write efficiently so if we optimized it at some point I bet we messed up a bit.

@sffc
Copy link
Member Author

sffc commented Oct 16, 2023

So the solution is to prevent the addition of duplicate keys, but what should we do if we encounter one in from_iter?

  1. Keep the old value (silently drop the new value)
  2. Keep the new value (silently drop the old value)
  3. Panic if the values are different
  4. Debug-panic if the values are different
    1. ...and keep the old value
    2. ...and keep the new value
  5. Log a warning if the values are different
    1. ...and keep the old value
    2. ...and keep the new value

Until the standard library adds a fallible version of from_iter, I lean toward option 3, but would be okay with option 4 or 5. Note that option 5 requires plumbing the log crate into zerovec which we may or may not want to do.

Thoughts? @Manishearth

@Manishearth
Copy link
Member

Option 3 works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake T-bug Type: Bad behavior, security, privacy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants