Rethinking identifiers #3995

bhousel · 2020-06-19T15:57:07Z

I'd like to make some changes in the index to better support more kinds of named POIs and more granular locations. This is just a brain dump of some thoughts on how we track brands and the limitations with our current approach.

a long time ago

When this project started many years ago, each entry in NSI was just a unique name that we picked out of the OSM planet file.
For example, here's what NSI looked like in 2017:

name-suggestion-index/name-suggestions.json

Lines 8621 to 8623 in 50993d4

    
           "Target": { 
        
               "count": 1104 
        
           },

This says that Target is an entry that with the name tag set to Target, and it has been used about 1000 times and sits under the shop/department_store hierarchy. iD would turn this into a preset that assigns the tags name=Target and shop=department_store.

Limitations of this:

couldn't track different things named "Target" (the USA and Australia ones are different)
couldn't add more OSM tags like brand:wikidata
support for alt versions of the tag and name was just - adding them to ignore lists
no countryCodes or any concept of where the brand was valid
all the names were in a single file that was starting to get too big

So we did a lot of work on the NSI in 2018-2019 to arrive at the current format.
I did a talk about it!

today

Each entry in NSI represents a unique "brand" identified by an string like:
key/value|name~(disambiguator)
The disambiguator part is optional and used in situations where distinct entities use the same literal name.

name-suggestion-index/brands/shop/department_store.json

Lines 730 to 758 in 50278e5

    
           "shop/department_store|Target~(Australia)": { 
        
             "locationSet": {"include": ["au"]}, 
        
             "matchTags": ["shop/supermarket"], 
        
             "nomatch": [ 
        
               "shop/department_store|Target~(USA)" 
        
             ], 
        
             "tags": { 
        
               "brand": "Target", 
        
               "brand:wikidata": "Q7685854", 
        
               "brand:wikipedia": "en:Target Australia", 
        
               "name": "Target", 
        
               "shop": "department_store" 
        
             } 
        
           }, 
        
           "shop/department_store|Target~(USA)": { 
        
             "locationSet": {"include": ["us"]}, 
        
             "matchNames": ["super target"], 
        
             "matchTags": ["shop/supermarket"], 
        
             "nomatch": [ 
        
               "shop/department_store|Target~(Australia)" 
        
             ], 
        
             "tags": { 
        
               "brand": "Target", 
        
               "brand:wikidata": "Q1046951", 
        
               "brand:wikipedia": "en:Target Corporation", 
        
               "name": "Target", 
        
               "shop": "department_store" 
        
             } 
        
           },

We solved the limitations from long ago, and NSI has really grown!

we can track any number of things with the same name (by using disambiguators)
we can attach whatever OSM tags we want to each entry
we can link each entry to brand:wikidata and fetch a wealth of related data from the Wikidata project (like logos)
we can support brands with alternate tagging and naming
we used countryCodes for a while, now locationSet which is even more flexible
we wrote code to fuzzy match the names and their alternates, and exported that code into a library for iD's validator to use
we made https://nsi.guide to browse it all, which is pretty rad

But now we have a new set of limitations:

we're limited to using brand and brand:wikidata as the "key" even though other feature types might better be keyed off of operator or network
internationalization / localization is still pretty clumsy. We just create duplicate entries for each location where a brand uses different name tags (see what I did yesterday in c7373e4, and even Add Subway in China #3993 and Add Burger King in China #3994 which showed up as I was writing this)
disambiguation strings are kind of unnecessary now that locationSets can disambiguate between geographically distinct brands
- for example Target USA vs Target Australia
- In fact the matching code (and consequently iD's validator) basically ignores any brands with a disambiguator.
- downstream Warn if a brand tag is used in a non-matching country/region openstreetmap/iD#6989
We are still limited to assigning the same name tag on everything, which causes issues for some brands (see Wetherspoons Pub Add Wetherspoons Pub #3009, but also this has caused issues with things like hotels and auto dealerships)
- downstream problems with trying to add a name for post boxes that are tagged with brand openstreetmap/iD#6890 and Names with brands and qualifiers are not offered brand tags openstreetmap/iD#6561

So I'd like to think through how to rework the NSI entries to solve these limitations. More to come later...

The text was updated successfully, but these errors were encountered:

UKChris-osm · 2020-07-25T11:38:14Z

When it comes to the name, I think it would be handy to have a way to mark this up as being "unique", after the brand has been added!

For entries like JD Wetherspoon (#3009) and Harvester (#4034) where the brand is strong, but the pub / restaurant itself is often uniquely named, rather than removing JD Wetherspoon or Harvester, keep them as they are, because they are strong brands that I feel people will type in first, but allow the NSI to flag to iD that the name is likely not "JD Wetherspoon" and has its own name, and this should be checked by the mapper.

I personally think it's better for a pub to be incorrectly named with the brand name as it's still how many people might refer to the pub, rather than have the brand not be suggested when someone searches, as they then may just add the brand them themselves, but may add it wrong, if that makes sense.

Meaning that even if the name is wrong, because it's named after the brand, at least that incorrect naming would be consistent, and so easier to find and correct, such as with a flag on the NSI guide web site that sees a branded entry with "JD Wetherspoon" as the name, and can flag it, rather than someone adding the name "Spoons", for example, and having that missed.

1ec5 · 2020-07-29T18:26:53Z

FYI, there are quite a few Wikidata items now linking to NSI using the NSI identifier property. But the items can be bulk-updated (along with the property constraints) based on whatever decision we arrive at here.

(re: #3995) - currently using the format `simplename-hash` (e.g. "starbucks-f83d44") - where hash is MD5 fragment of `${tree} ${key} ${value} ${locationID}` This should generate a reasonable identifier that stays stable until one of those changes. Also we can eliminate disambiguators as long as same-named brands differ in one of these.

bhousel · 2020-08-31T18:15:10Z

Quick update on some identifiers work that I did last Friday:

Each entry gets a generated id like "simplename-shorthash".
displayName is a separate thing, so we aren't relying so much on names and disambiguators. It can contain anything and can be used as the preset name in JOSM / iD presets.
The short hashes are stable until the key, value, or location change. This means that we can now just use the locationSet to disambiguate between same-named brands.
I'll keep the old ids around until nothing references them anymore (like Wikidata NSI identifiers)

{
  "brands/shop/craft": [
    {
      "displayName": "A.C. Moore",
      "id": "acmoore-286374",
      "locationSet": {"include": ["us"]},
      "oldid": "shop/craft|A.C. Moore",
      "tags": {
        "brand": "A.C. Moore",
        "brand:wikidata": "Q4647066",
        "brand:wikipedia": "en:A.C. Moore",
        "name": "A.C. Moore",
        "shop": "craft"
      }
    },
    {
      "displayName": "Hobby Lobby",
      "id": "hobbylobby-e90acf",
      "locationSet": {"include": ["in", "us"]},
      "oldid": "shop/craft|Hobby Lobby",
      "tags": {
        "brand": "Hobby Lobby",
        "brand:wikidata": "Q5874938",
        "brand:wikipedia": "en:Hobby Lobby",
        "name": "Hobby Lobby",
        "shop": "craft"
      }
    },
    {
      "displayName": "Hobbycraft",
      "id": "hobbycraft-ed2283",
      "locationSet": {"include": ["gb"]},
      "matchTags": ["shop/art"],
      "oldid": "shop/craft|Hobbycraft",
      "tags": {
        "brand": "Hobbycraft",
        "brand:wikidata": "Q16984508",
        "brand:wikipedia": "en:Hobbycraft",
        "name": "Hobbycraft",
        "shop": "craft"
      }
    },
...

bhousel · 2020-09-15T21:20:30Z

I'm almost finished with this work. 🎉

There will be a bunch of conflicts to resolve, so I've disabled merging to master temporarily until I can reconcile all the recent PRs that have been merged with the new file structure and code. Wish me luck.. 😅

bhousel · 2020-09-16T15:39:54Z

OK - the new files are merged in..

I removed the branch protection roles so we can merge PRs again.
I also took this opportunity to switch the default branch from master to main - update your local forks!
I still need to fix some things with the nsi.guide site, and generating the files in dist/*, so this will remain open until those are settled.

to avoid unicode or right-to-left surprises. Also split out the id generation into its own function. (re: #3995) This now exposed a bunch of brands that were duplicate in the index so the duplicates have been removed. (e.g. an item in English and an item in Thai for the same thing, but both with same `name:en` and same locationSet) The duplicate removal just happened to close #4106 also.

bhousel · 2020-09-18T20:54:00Z

The new code seems to be working pretty ok! This unblocks a bunch of other things that I'll tackle soon.

I'd like to leave this open until all those P8253 Name Suggestion Identifier properties on Wikidata have been updated. I'd ideally like to make this an automatic thing that the build_wikidata.js script does for us.

camelCaseNick · 2020-09-19T01:58:07Z

I'd like to leave this open until all those P8253 Name Suggestion Identifier properties on Wikidata have been updated. I'd ideally like to make this an automatic thing that the build_wikidata.js script does for us.

So not bulk uploading it now, to not prevent the testing of such an automation?

bhousel · 2020-09-19T14:16:54Z

So not bulk uploading it now, to not prevent the testing of such an automation?

I don't understand your question, sorry.. Mostly I just haven't bulk updated the P8253's yet because I ran out of time yesterday to implement it.

camelCaseNick · 2020-09-19T16:46:08Z

I don't understand your question, sorry..

I meant, that if I, or anybody else, would upload it in bulk now, it might interfere with your idea to update them automatically in the build_wikidata.js as you couldn't test it if every identifier was set correctly to the new one already?

bhousel · 2020-09-19T18:55:56Z

it might interfere with your idea to update them automatically in the build_wikidata.js as you couldn't test it if every identifier was set correctly to the new one already?

I don't think it would interfere - it's a one way update from NSI -> Wikidata. If you update them all now, the script would just have less to do later.

…iers (re: #3995)

(re: #3995)

bhousel · 2020-09-21T20:22:53Z

I updated the build_wikidata.js script to push our new ids to wikidata and removed the legacy NSI ids.
This is done 🎊

UKChris-osm · 2020-09-21T20:37:37Z

Great work @bhousel 👍

Can the new updates integrate into iD easily? Can a new release of the NSI be pushed to it anytime soon?

bhousel · 2020-09-21T20:39:48Z

Can the new updates integrate into iD easily? Can a new release of the NSI be pushed to it anytime soon?

I don't know, sorry. I was removed from the iD project and no longer maintain it.

Identitaet · 2020-09-22T06:27:16Z

Probably the best thing to do is just open an issue on the iD page and see what's the answer

…"network" (re: #4034, #3009, #3995)

bhousel mentioned this issue Jun 22, 2020

IT: Sella Bank / Banca Sella #3967

Closed

bhousel mentioned this issue Jul 10, 2020

Vue cinemas #3420

Closed

kymckay pinned this issue Jul 10, 2020

bhousel mentioned this issue Jul 16, 2020

Is name required? #4034

Closed

bhousel mentioned this issue Aug 18, 2020

Add Quirónsalud hospital chain #4086

Closed

This was referenced Aug 25, 2020

Presets: Implicit ordering of presets from name suggestion list openstreetmap/iD#7927

Closed

Duplicate National Bank of Canada? #4099

Closed

Bump @ideditor/location-conflation from 0.4.0 to 0.5.0 #4101

Closed

Adamant36 added the enhancement Actionable - add an enhancement to the source code label Sep 11, 2020

bhousel added a commit that referenced this issue Sep 21, 2020

Add code to build_wikidata.js to automatically update the NSI identif…

d50f4e7

…iers (re: #3995)

bhousel added a commit that referenced this issue Sep 21, 2020

Perform wikidata updates deterministically, npm run wikidata

c8c2469

(re: #3995)

bhousel closed this as completed in cd930aa Sep 21, 2020

bhousel added a commit that referenced this issue Sep 22, 2020

Relax the "name" requirement, but require oneOf "brand", "operator", …

124521a

…"network" (re: #4034, #3009, #3995)

bhousel mentioned this issue Sep 22, 2020

Add Wetherspoons Pub #3009

Closed

bhousel unpinned this issue Sep 29, 2020

bhousel mentioned this issue Nov 2, 2020

Created "post_depot.json" #4638

Closed

bhousel mentioned this issue Dec 5, 2020

Generate iD presets #4718

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking identifiers #3995

Rethinking identifiers #3995

bhousel commented Jun 19, 2020

UKChris-osm commented Jul 25, 2020

1ec5 commented Jul 29, 2020

bhousel commented Aug 31, 2020

bhousel commented Sep 15, 2020

bhousel commented Sep 16, 2020

bhousel commented Sep 18, 2020

camelCaseNick commented Sep 19, 2020

bhousel commented Sep 19, 2020

camelCaseNick commented Sep 19, 2020 •

edited

Loading

bhousel commented Sep 19, 2020

bhousel commented Sep 21, 2020

UKChris-osm commented Sep 21, 2020

bhousel commented Sep 21, 2020

Identitaet commented Sep 22, 2020

Rethinking identifiers #3995

Rethinking identifiers #3995

Comments

bhousel commented Jun 19, 2020

a long time ago

today

UKChris-osm commented Jul 25, 2020

1ec5 commented Jul 29, 2020

bhousel commented Aug 31, 2020

bhousel commented Sep 15, 2020

bhousel commented Sep 16, 2020

bhousel commented Sep 18, 2020

camelCaseNick commented Sep 19, 2020

bhousel commented Sep 19, 2020

camelCaseNick commented Sep 19, 2020 • edited Loading

bhousel commented Sep 19, 2020

bhousel commented Sep 21, 2020

UKChris-osm commented Sep 21, 2020

bhousel commented Sep 21, 2020

Identitaet commented Sep 22, 2020

camelCaseNick commented Sep 19, 2020 •

edited

Loading