Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking identifiers #3995

Closed
bhousel opened this issue Jun 19, 2020 · 14 comments
Closed

Rethinking identifiers #3995

bhousel opened this issue Jun 19, 2020 · 14 comments
Labels
enhancement Actionable - add an enhancement to the source code

Comments

@bhousel
Copy link
Member

bhousel commented Jun 19, 2020

I'd like to make some changes in the index to better support more kinds of named POIs and more granular locations. This is just a brain dump of some thoughts on how we track brands and the limitations with our current approach.

a long time ago

When this project started many years ago, each entry in NSI was just a unique name that we picked out of the OSM planet file.
For example, here's what NSI looked like in 2017:

"Target": {
"count": 1104
},

This says that Target is an entry that with the name tag set to Target, and it has been used about 1000 times and sits under the shop/department_store hierarchy. iD would turn this into a preset that assigns the tags name=Target and shop=department_store.

Limitations of this:

  • couldn't track different things named "Target" (the USA and Australia ones are different)
  • couldn't add more OSM tags like brand:wikidata
  • support for alt versions of the tag and name was just - adding them to ignore lists
  • no countryCodes or any concept of where the brand was valid
  • all the names were in a single file that was starting to get too big

So we did a lot of work on the NSI in 2018-2019 to arrive at the current format.
I did a talk about it!

today

Each entry in NSI represents a unique "brand" identified by an string like:
key/value|name~(disambiguator)
The disambiguator part is optional and used in situations where distinct entities use the same literal name.

"shop/department_store|Target~(Australia)": {
"locationSet": {"include": ["au"]},
"matchTags": ["shop/supermarket"],
"nomatch": [
"shop/department_store|Target~(USA)"
],
"tags": {
"brand": "Target",
"brand:wikidata": "Q7685854",
"brand:wikipedia": "en:Target Australia",
"name": "Target",
"shop": "department_store"
}
},
"shop/department_store|Target~(USA)": {
"locationSet": {"include": ["us"]},
"matchNames": ["super target"],
"matchTags": ["shop/supermarket"],
"nomatch": [
"shop/department_store|Target~(Australia)"
],
"tags": {
"brand": "Target",
"brand:wikidata": "Q1046951",
"brand:wikipedia": "en:Target Corporation",
"name": "Target",
"shop": "department_store"
}
},

We solved the limitations from long ago, and NSI has really grown!

  • we can track any number of things with the same name (by using disambiguators)
  • we can attach whatever OSM tags we want to each entry
  • we can link each entry to brand:wikidata and fetch a wealth of related data from the Wikidata project (like logos)
  • we can support brands with alternate tagging and naming
  • we used countryCodes for a while, now locationSet which is even more flexible
  • we wrote code to fuzzy match the names and their alternates, and exported that code into a library for iD's validator to use
  • we made https://nsi.guide to browse it all, which is pretty rad

But now we have a new set of limitations:

So I'd like to think through how to rework the NSI entries to solve these limitations. More to come later...

@UKChris-osm
Copy link
Collaborator

When it comes to the name, I think it would be handy to have a way to mark this up as being "unique", after the brand has been added!

For entries like JD Wetherspoon (#3009) and Harvester (#4034) where the brand is strong, but the pub / restaurant itself is often uniquely named, rather than removing JD Wetherspoon or Harvester, keep them as they are, because they are strong brands that I feel people will type in first, but allow the NSI to flag to iD that the name is likely not "JD Wetherspoon" and has its own name, and this should be checked by the mapper.

I personally think it's better for a pub to be incorrectly named with the brand name as it's still how many people might refer to the pub, rather than have the brand not be suggested when someone searches, as they then may just add the brand them themselves, but may add it wrong, if that makes sense.

Meaning that even if the name is wrong, because it's named after the brand, at least that incorrect naming would be consistent, and so easier to find and correct, such as with a flag on the NSI guide web site that sees a branded entry with "JD Wetherspoon" as the name, and can flag it, rather than someone adding the name "Spoons", for example, and having that missed.

@1ec5
Copy link
Member

1ec5 commented Jul 29, 2020

FYI, there are quite a few Wikidata items now linking to NSI using the NSI identifier property. But the items can be bulk-updated (along with the property constraints) based on whatever decision we arrive at here.

bhousel added a commit that referenced this issue Aug 20, 2020
(re: #3995)

- currently using the format `simplename-hash`  (e.g. "starbucks-f83d44")
- where hash is MD5 fragment of `${tree} ${key} ${value} ${locationID}`

This should generate a reasonable identifier that stays stable until one of those changes.
Also we can eliminate disambiguators as long as same-named brands differ in one of these.
@bhousel
Copy link
Member Author

bhousel commented Aug 31, 2020

Quick update on some identifiers work that I did last Friday:

  • Each entry gets a generated id like "simplename-shorthash".
  • displayName is a separate thing, so we aren't relying so much on names and disambiguators. It can contain anything and can be used as the preset name in JOSM / iD presets.
  • The short hashes are stable until the key, value, or location change. This means that we can now just use the locationSet to disambiguate between same-named brands.
  • I'll keep the old ids around until nothing references them anymore (like Wikidata NSI identifiers)
{
  "brands/shop/craft": [
    {
      "displayName": "A.C. Moore",
      "id": "acmoore-286374",
      "locationSet": {"include": ["us"]},
      "oldid": "shop/craft|A.C. Moore",
      "tags": {
        "brand": "A.C. Moore",
        "brand:wikidata": "Q4647066",
        "brand:wikipedia": "en:A.C. Moore",
        "name": "A.C. Moore",
        "shop": "craft"
      }
    },
    {
      "displayName": "Hobby Lobby",
      "id": "hobbylobby-e90acf",
      "locationSet": {"include": ["in", "us"]},
      "oldid": "shop/craft|Hobby Lobby",
      "tags": {
        "brand": "Hobby Lobby",
        "brand:wikidata": "Q5874938",
        "brand:wikipedia": "en:Hobby Lobby",
        "name": "Hobby Lobby",
        "shop": "craft"
      }
    },
    {
      "displayName": "Hobbycraft",
      "id": "hobbycraft-ed2283",
      "locationSet": {"include": ["gb"]},
      "matchTags": ["shop/art"],
      "oldid": "shop/craft|Hobbycraft",
      "tags": {
        "brand": "Hobbycraft",
        "brand:wikidata": "Q16984508",
        "brand:wikipedia": "en:Hobbycraft",
        "name": "Hobbycraft",
        "shop": "craft"
      }
    },
...

@Adamant36 Adamant36 added the enhancement Actionable - add an enhancement to the source code label Sep 11, 2020
@bhousel
Copy link
Member Author

bhousel commented Sep 15, 2020

I'm almost finished with this work. 🎉

There will be a bunch of conflicts to resolve, so I've disabled merging to master temporarily until I can reconcile all the recent PRs that have been merged with the new file structure and code. Wish me luck.. 😅

@bhousel
Copy link
Member Author

bhousel commented Sep 16, 2020

OK - the new files are merged in..

  • I removed the branch protection roles so we can merge PRs again.
  • I also took this opportunity to switch the default branch from master to main - update your local forks!
  • I still need to fix some things with the nsi.guide site, and generating the files in dist/*, so this will remain open until those are settled.

bhousel added a commit that referenced this issue Sep 18, 2020
to avoid unicode or right-to-left surprises.
Also split out the id generation into its own function.
(re: #3995)

This now exposed a bunch of brands that were duplicate in the index
so the duplicates have been removed.
(e.g. an item in English and an item in Thai for the same thing, but
both with same `name:en` and same locationSet)

The duplicate removal just happened to close #4106 also.
@bhousel
Copy link
Member Author

bhousel commented Sep 18, 2020

The new code seems to be working pretty ok! This unblocks a bunch of other things that I'll tackle soon.

I'd like to leave this open until all those P8253 Name Suggestion Identifier properties on Wikidata have been updated. I'd ideally like to make this an automatic thing that the build_wikidata.js script does for us.

@camelCaseNick
Copy link
Contributor

I'd like to leave this open until all those P8253 Name Suggestion Identifier properties on Wikidata have been updated. I'd ideally like to make this an automatic thing that the build_wikidata.js script does for us.

So not bulk uploading it now, to not prevent the testing of such an automation?

@bhousel
Copy link
Member Author

bhousel commented Sep 19, 2020

So not bulk uploading it now, to not prevent the testing of such an automation?

I don't understand your question, sorry.. Mostly I just haven't bulk updated the P8253's yet because I ran out of time yesterday to implement it.

@camelCaseNick
Copy link
Contributor

camelCaseNick commented Sep 19, 2020

I don't understand your question, sorry..

I meant, that if I, or anybody else, would upload it in bulk now, it might interfere with your idea to update them automatically in the build_wikidata.js as you couldn't test it if every identifier was set correctly to the new one already?

@bhousel
Copy link
Member Author

bhousel commented Sep 19, 2020

it might interfere with your idea to update them automatically in the build_wikidata.js as you couldn't test it if every identifier was set correctly to the new one already?

I don't think it would interfere - it's a one way update from NSI -> Wikidata. If you update them all now, the script would just have less to do later.

@bhousel
Copy link
Member Author

bhousel commented Sep 21, 2020

I updated the build_wikidata.js script to push our new ids to wikidata and removed the legacy NSI ids.
This is done 🎊

@UKChris-osm
Copy link
Collaborator

Great work @bhousel 👍

Can the new updates integrate into iD easily? Can a new release of the NSI be pushed to it anytime soon?

@bhousel
Copy link
Member Author

bhousel commented Sep 21, 2020

Can the new updates integrate into iD easily? Can a new release of the NSI be pushed to it anytime soon?

I don't know, sorry. I was removed from the iD project and no longer maintain it.

@Identitaet
Copy link
Collaborator

Probably the best thing to do is just open an issue on the iD page and see what's the answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Actionable - add an enhancement to the source code
Projects
None yet
Development

No branches or pull requests

6 participants