Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dedupe placetype in name #1371

Merged
merged 1 commit into from
Mar 7, 2022
Merged

dedupe placetype in name #1371

merged 1 commit into from
Mar 7, 2022

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Oct 9, 2019

note: branched off #1370, view diff

this PR adds some logic which allows strings such as City of Philadelphia and Philadelphia to match for the sake of deduplication.

the normalization function is only applied for the applicable layers, so in the case above the City of Philadelphia must be on the locality layer in order to be considered for normalization.

I've added a bunch of tests and also ensured that the original data is not mutated so the display labels will remain unaffected by the deduplication normalization.

Closes pelias/geonames#372
Resolves pelias/geonames#395

@missinglink missinglink force-pushed the dedupe-placetype-in-name branch 3 times, most recently from 212e206 to a573174 Compare October 9, 2019 13:24
@missinglink
Copy link
Member Author

This looks quite nice:

             master                                                         this branch

 1) Philadelphia, PA, USA					 1) Philadelphia, PA, USA
 2) Philadelphia, PA, USA				       	 2) Philadelphia, Suffolk, VA, USA
 3) Philadelphia County, PA, USA			       	 3) New Philadelphia, OH, USA
 4) City of Philadelphia, PA, USA			       	 4) City of Philadelphia, MS, USA
 5) Philadelphia, Suffolk, VA, USA			       	 5) Philadelphia, MS, USA
 6) New Philadelphia, OH, USA				       	 6) Philadelphia, Jamaica
 7) City of New Philadelphia, OH, USA			       	 7) Philadelphia, NY, USA
 8) City of Philadelphia, MS, USA			       	 8) Village of Philadelphia, NY, USA
 9) Philadelphia, MS, USA				       	 9) Borough of New Philadelphia, PA, USA
10) Philadelphia, Jamaica				       	10) New Philadelphia, PA, USA

@missinglink missinglink force-pushed the dedupe-placetype-in-name branch 2 times, most recently from 0de2b03 to b4b6d94 Compare October 9, 2019 13:40
@missinglink
Copy link
Member Author

I added an additional commit to treat the localadmin layer with the same normalization we use for locality which helps a bit.

Looking at the acceptance tests there are some changes, mostly positive I believe (this includes some changes from the base branch too):

Aggregate test results
Pass: 604
Improvements: 1
Fail: 115
Placeholders: 0
Regressions: 8
Took 109862ms
Test success rate 98.9%

The only one which stood out to me as wrong was:

/v1/autocomplete?text=california

master:
 1) California, USA

this branch:
1) California City, CA, USA

... although this is likely being caused by some other code and is being triggered by improved normalization

@orangejulius
Copy link
Member

Anything left to do here?

@missinglink
Copy link
Member Author

This PR got stale and I don't remember anything about it any longer!
In order to merge, it just requires acceptance testing.

@missinglink
Copy link
Member Author

merged #1370 and rebased against origin/master

@orangejulius
Copy link
Member

I just rebased this, and despite being 3 years old it applies just fine.

We can test this out and hopefully it can be merged to help with Geonames related deduplication issues.

@orangejulius orangejulius force-pushed the dedupe-placetype-in-name branch 2 times, most recently from 3a053bc to 51c0a7e Compare March 4, 2022 16:50
orangejulius added a commit that referenced this pull request Mar 7, 2022
It's common for US states to have either a county or city within them
that shares a name (with minor possible differences).

For example:
Arksansas City in Arkansas
Hawaii County in Hawaii
Iowa City in Iowa
California City in California
New York City in New York (of course!)

Our general deduplication logic considers an admin record that is
parented by a record sharing its name to be the same. This works well
for places like Singapore, Berlin, and Tokyo, which all have a city
(locality) that is conceptually the same as a region or country.

US states, however, are _not_ conceptually the same as any of these
cities, and they should pretty much always show up in results.

This PR adds a check that the record is _not_ a US state before
performing the general heierarchy checks. There's one exception: US
states can dedupe against other US states, so that Geonams and WOF
records can deduplicate themselves.

This PR allows us to merge #1371
@orangejulius
Copy link
Member

This is pretty much good to go. We want to add an exception for US States (#1614).

On the whole, this PR will now make quite a few places considered duplicates of each other. The 'Township' checks especially will deduplicate a lot of places. For example, in Ohio, there are 46 townships named 'Washington Township' and many of them also contain a city named 'Washington'. So we may eventually need to follow up with some targeted exceptions.

@orangejulius orangejulius merged commit ad93fc4 into master Mar 7, 2022
@orangejulius orangejulius deleted the dedupe-placetype-in-name branch March 7, 2022 19:24
orangejulius added a commit that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.
orangejulius added a commit that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.
orangejulius added a commit that referenced this pull request Mar 8, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
orangejulius added a commit that referenced this pull request Mar 10, 2022
Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate results caused by Geonames records with names like 'City of X' or 'Town of Y'
2 participants