-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedupe Geonames records with WOF concordances #1606
Conversation
A common cause of deduplication errors is Geonames locality/localadmin records that start with 'City of'. Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places. Getting too far away from this general stance could be dangerous, but we can handle specific outliers just fine. Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames `locality` record with just the name, (like 'New York'), and then a Geonames `localadmin` record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606
A common cause of deduplication errors is Geonames locality/localadmin records that start with 'City of'. Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places. Getting too far away from this general stance could be dangerous, but we can handle specific outliers just fine. Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames `locality` record with just the name, (like 'New York'), and then a Geonames `localadmin` record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually this is great 👍
I'm not in love with the code style for some reason, maybe we can use lodash
to clean this up a bit? Something like this might be 'nicer'?
I'll add a couple in-line comments but yeah, we should add some tests and merge this at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth mentioning that it would be great it we could make this check more generalized.
For instance, if an OSM record, a WOF record and a GEONAMES record all shared a Wikidata concordance then we should be able to conflate all three using a third-party concordance.
In a more advanced strategy we could 'merge' concordances from the results so that 2nd and 3rd degree conflations adopted all synonyms regardless of which record they were defined on.
Had a shower thought about this, it might be easier & more generic to do it like this:
At a later stage we could add some logic to virally merge concordances to make them work across 3+ degree connections but that's probably out-of-scope for now. |
A common cause of deduplication errors is Geonames locality/localadmin records that start with 'City of'. Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places. Getting too far away from this general stance could be dangerous, but we can handle specific outliers just fine. Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames `locality` record with just the name, (like 'New York'), and then a Geonames `localadmin` record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606
2dd4180
to
e305b87
Compare
153dc70
to
760106d
Compare
I've finished this off with a test, use of the I like the strategy you've described for how we might generalize concordance checks in the future. It's something to keep in mind for sure. I don't think it really solves a problem we see today: the only possible deduplication between GN/OSM would be venues, and so few of them have matching Wikidata concordances it might not be worth it. It also looks like Geonames records do not themselves have Wikidata concordances. So we would have to look up the data from Wikidata. For example, see La Sagrada Familia in OSM, Geonames, and Wikidata. |
These concordances can be trusted over any other signals and really help us remove lots of bad Geonames data.
760106d
to
533884c
Compare
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited.
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited.
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited. That said, there are no regressions in our acceptance tests, and quite a few improvements.
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited. That said, there are no regressions in our acceptance tests, and quite a few improvements.
This PR implements deduplication between WOF and Geonames when there's a Geonames concordance ID on the WOF record.
These concordances mean we can be fairly certain that the two records are the same, and that we don't even have to look at the name, or any other properties
This should be able to replace #1580, since the concordance method will work even for localities (there is special logic for locality/localadmin parent IDs for Geonames records from pelias/geonames#93 that made #1580 less effective then it should be).