Deduplicate Geonames 'City of' prefixes #1609
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A common cause of missed deduplication is Geonames locality/localadmin records that start with 'City of'.
Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places.
Getting too far away from this general stance could be dangerous, but we can handle specific exceptions just fine.
Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames
locality
record with just the name, (like 'New York'), and then a Geonameslocaladmin
record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606