improved string normalization for deduplication #1370

missinglink · 2019-10-09T10:20:45Z

this PR improves the normalization function used for deduplication:

remove diacritics
improve test coverage

missinglink · 2019-10-09T11:12:25Z

I staged this to take a look and there were three notable changes:

Aggregate test results
Pass: 610
Improvements: 1
Fail: 115
Placeholders: 0
Regressions: 2
Took 114607ms
Test success rate 99.72%

I had a look and it seems as though they are actually positive but marked as failing due to brittle test cases.

General Placeholder requests
  ✘ regression [17] "{"text":"Schöneberg, DE"}": score 2 out of 3
  diff:
    gid
      expected: whosonfirst:locality:101907543
      actual:   whosonfirst:locality:1126037037

search city/country
 ✔ improvement [15] "{"text":"Akmenė, lithuania"}"

Who's on First regions
  ✘ regression [8] "{"text":"Xaçmaz, Azerbaijan","sources":"wof"}": score 3 out of 5
  diff:
    layer
      expected: region
      actual:   locality
    name
      expected: Xacmaz
      actual:   Xaçmaz

orangejulius · 2019-11-07T19:37:14Z

Looks good to me, is there anything else to do here before merging?

Joxit

LGTM

test/unit/helper/diffPlaces.js

missinglink · 2019-11-14T09:04:27Z

I fixed some test issues and also now use our fancy new unicode.normalize() method to fix any potential unicode issues 🎉

This is good to merge once the failing test cases I mentioned above have been investigated.

missinglink · 2021-03-14T21:26:57Z

force pushed to rebase origin/master

missinglink · 2021-03-14T22:13:51Z

before

Aggregate test results
Pass: 615
Improvements: 10
Fail: 123

after

Aggregate test results
Pass: 645
Improvements: 5
Fail: 128

There is one notable test failure but I switched to dev->api to the same hash as prod and it's there too, so it's not the fault of this work 👍

search focus.point
  ✘ regression [1] "/v1/search?focus.point.lat=42.64&focus.point.lon=-84.7&text=10010": score 3 out of 4
  diff:
    priorityThresh is 1 but found at position 2

https://pelias.github.io/compare/#/v1/search?text=10010

missinglink · 2021-03-14T22:15:14Z

The three failing test cases mentioned in the previous comment are no longer failing.

missinglink requested a review from orangejulius October 9, 2019 11:13

missinglink mentioned this pull request Oct 9, 2019

dedupe placetype in name #1371

Merged

Joxit approved these changes Nov 8, 2019

View reviewed changes

missinglink commented Nov 14, 2019

View reviewed changes

test/unit/helper/diffPlaces.js Outdated Show resolved Hide resolved

test/unit/helper/diffPlaces.js Outdated Show resolved Hide resolved

missinglink force-pushed the dedupe-normalization branch from 54d1c34 to 8d34ffa Compare November 14, 2019 09:02

orangejulius force-pushed the dedupe-normalization branch from 8d34ffa to d7f7b16 Compare March 12, 2021 22:50

missinglink added 2 commits March 15, 2021 10:25

fix(dedupe): improved string normalization for deduplication

98f2452

test(dedupe): additional test coverage

8b5875f

missinglink force-pushed the dedupe-normalization branch from d7f7b16 to 8b5875f Compare March 14, 2021 21:26

missinglink merged commit 45ec00d into master Mar 14, 2021

missinglink deleted the dedupe-normalization branch March 14, 2021 22:15

missinglink added a commit to pelias/acceptance-tests that referenced this pull request Mar 15, 2021

Update test statuses from pelias/api#1370

94b12f7

missinglink mentioned this pull request Mar 15, 2021

Update test statuses from https://github.com/pelias/api/pull/1370 pelias/acceptance-tests#548

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved string normalization for deduplication #1370

improved string normalization for deduplication #1370

missinglink commented Oct 9, 2019 •

edited

Loading

missinglink commented Oct 9, 2019

orangejulius commented Nov 7, 2019

Joxit left a comment

missinglink commented Nov 14, 2019 •

edited

Loading

missinglink commented Mar 14, 2021

missinglink commented Mar 14, 2021

missinglink commented Mar 14, 2021

improved string normalization for deduplication #1370

improved string normalization for deduplication #1370

Conversation

missinglink commented Oct 9, 2019 • edited Loading

missinglink commented Oct 9, 2019

orangejulius commented Nov 7, 2019

Joxit left a comment

Choose a reason for hiding this comment

missinglink commented Nov 14, 2019 • edited Loading

missinglink commented Mar 14, 2021

missinglink commented Mar 14, 2021

missinglink commented Mar 14, 2021

missinglink commented Oct 9, 2019 •

edited

Loading

missinglink commented Nov 14, 2019 •

edited

Loading