-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplication - add isEquivalentIdentity() method #1580
base: master
Are you sure you want to change the base?
Conversation
@@ -124,20 +124,69 @@ function isNameDifferent(item1, item2, requestLanguage){ | |||
// iterate over all the languages in item2, comparing them to the | |||
// 'default' name of item1 and also against the language requested by the user. | |||
for( let lang in names2 ){ | |||
if (!names2.hasOwnProperty(lang)) { continue; } // do not iterate over inherited properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these hasOwnProperty
checks snuck in to the PR, they are not strictly related but considered good practice
this PR replaces #1554 which was much more complicated, this method is both easier to implement/test and also more accurate than string checking the various name fields. |
@@ -174,6 +223,7 @@ function isAddressDifferent(item1, item2){ | |||
* Optionally provide $requestLanguage (req.clean.lang.iso6393) to improve name deduplication. | |||
*/ | |||
function isDifferent(item1, item2, requestLanguage){ | |||
if( isEquivalentIdentity( item1, item2 ) ){ return false; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth pointing out here that isEquivalentIdentity
is different (no pun intended) than all the other checks in the isDifferent
function.
All the others are each sufficient to prove that items are different. isEquivalentIdentity
, on the other hand, is sufficient to prove that the items are the same.
I'm honestly surprised it took us this long to have such a check but we might want to divide the functions within isDifferent
into two appropriate sections with a comment or something to help us keep track of them.
Cool, this looks pretty easy to read and understand. And it definitely looks like a sound approach. We might want to add a test with some real world data to our deduplication tests to make sure we handle it, right? Also, do you have any compare links handy that show off queries that should be improved by this PR? |
Ah, bad news. The acceptance tests show a bunch of regressions from this PR, looks like we need to restrict the comparison to records at the same layer. Otherwise, for example, a city called Asia in the continent of Asia can replace the record for the continent: |
hmm.. it should already be doing that, I'll have to pick this up again next week 🤔 |
added via rebase diff --git a/helper/diffPlaces.js b/helper/diffPlaces.js
index 840f0203..d11ef0ba 100644
--- a/helper/diffPlaces.js
+++ b/helper/diffPlaces.js
@@ -168,6 +168,9 @@ const parentGID = (item, layer) => {
*/
function isEquivalentIdentity(item1, item2) {
+ // Both items must be on the same layer
+ if (item1.layer !== item2.layer) { return false; }
+
// Generate a GID value for each item
const gid1 = GID(item1);
const gid2 = GID(item2); |
d63073e
to
ec7def0
Compare
This PR adds a new method to
helper/diffPlaces.js
calledisEquivalentIdentity()
.Detecting Equal Identity is trivial, if two records have the same GID they are the same thing...
but there is also a case where record A has an equivalent identity to the parent of record B at the same layer.
An example is something like this, these two records are distinct:
... however with this PR the following two records are considered equivalent:
This is particularly relevant for inter-source deduplication as, for example, a geonames 'locality' will have a WOF 'locality' in its parent hierarchy (from PIP). In that case we can dedupe the WOF record with the GN record containing the WOF record as a parent.
note: this is backwards compatible with indices prior to pelias/schema#459 since an empty or absent
*_source
will be considered to have the value'whosonfirst'
.