Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe Geonames records with WOF concordances #1606

Merged
merged 2 commits into from
Mar 10, 2022
Merged

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Feb 28, 2022

This PR implements deduplication between WOF and Geonames when there's a Geonames concordance ID on the WOF record.

These concordances mean we can be fairly certain that the two records are the same, and that we don't even have to look at the name, or any other properties

This should be able to replace #1580, since the concordance method will work even for localities (there is special logic for locality/localadmin parent IDs for Geonames records from pelias/geonames#93 that made #1580 less effective then it should be).

orangejulius added a commit that referenced this pull request Feb 28, 2022
A common cause of deduplication errors is Geonames locality/localadmin
records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at
things like punctuation, diacriticals, etc. Otherwise, we have to
consider names that are different meaning the underlying records
represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we
can handle specific outliers just fine.

Geonames records that start with 'City of' are one of these cases.
Often, there is a Geonames `locality` record with just the name, (like
'New York'), and then a Geonames `localadmin` record with the 'City of'
prefix. Usually only one of those records will have a WOF concordance,
so this is still helpful even combined with
#1606
orangejulius added a commit that referenced this pull request Feb 28, 2022
A common cause of deduplication errors is Geonames locality/localadmin
records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at
things like punctuation, diacriticals, etc. Otherwise, we have to
consider names that are different meaning the underlying records
represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we
can handle specific outliers just fine.

Geonames records that start with 'City of' are one of these cases.
Often, there is a Geonames `locality` record with just the name, (like
'New York'), and then a Geonames `localadmin` record with the 'City of'
prefix. Usually only one of those records will have a WOF concordance,
so this is still helpful even combined with
#1606
Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually this is great 👍

I'm not in love with the code style for some reason, maybe we can use lodash to clean this up a bit? Something like this might be 'nicer'?

I'll add a couple in-line comments but yeah, we should add some tests and merge this at some point.

helper/diffPlaces.js Outdated Show resolved Hide resolved
Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth mentioning that it would be great it we could make this check more generalized.

For instance, if an OSM record, a WOF record and a GEONAMES record all shared a Wikidata concordance then we should be able to conflate all three using a third-party concordance.

In a more advanced strategy we could 'merge' concordances from the results so that 2nd and 3rd degree conflations adopted all synonyms regardless of which record they were defined on.

@missinglink
Copy link
Member

Had a shower thought about this, it might be easier & more generic to do it like this:

  • decode the concordances, or create an empty {} for each record
  • append self-references, such as setting gn:id for geonames records within it's own concordances.
  • intersect all the concordance maps with each other, anything with an intersection size > 0 is a match

At a later stage we could add some logic to virally merge concordances to make them work across 3+ degree connections but that's probably out-of-scope for now.

orangejulius added a commit that referenced this pull request Mar 2, 2022
A common cause of deduplication errors is Geonames locality/localadmin
records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at
things like punctuation, diacriticals, etc. Otherwise, we have to
consider names that are different meaning the underlying records
represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we
can handle specific outliers just fine.

Geonames records that start with 'City of' are one of these cases.
Often, there is a Geonames `locality` record with just the name, (like
'New York'), and then a Geonames `localadmin` record with the 'City of'
prefix. Usually only one of those records will have a WOF concordance,
so this is still helpful even combined with
#1606
@orangejulius orangejulius marked this pull request as ready for review March 4, 2022 16:46
@orangejulius
Copy link
Member Author

I've finished this off with a test, use of the pelias-model codec, and a different code style. It now avoids any use of let, or typeof. The structure is also simpler and not as deeply nested as all the checks now bail out early.

I like the strategy you've described for how we might generalize concordance checks in the future. It's something to keep in mind for sure. I don't think it really solves a problem we see today: the only possible deduplication between GN/OSM would be venues, and so few of them have matching Wikidata concordances it might not be worth it.

It also looks like Geonames records do not themselves have Wikidata concordances. So we would have to look up the data from Wikidata. For example, see La Sagrada Familia in OSM, Geonames, and Wikidata.

These concordances can be trusted over any other signals and really help
us remove lots of bad Geonames data.
orangejulius added a commit that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.
orangejulius added a commit that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.
orangejulius added a commit that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
@orangejulius orangejulius merged commit 5d4617b into master Mar 10, 2022
@orangejulius orangejulius deleted the concordance-dedupe branch March 10, 2022 17:19
orangejulius added a commit that referenced this pull request Mar 10, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants