Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonyms for french addresses #301

Closed
adefarge opened this issue Jun 8, 2018 · 23 comments
Closed

Synonyms for french addresses #301

adefarge opened this issue Jun 8, 2018 · 23 comments

Comments

@adefarge
Copy link

adefarge commented Jun 8, 2018

Hi,
French addresses don't work when using the search feature with abbreviations.
For example:

  • boulevard saint-martin works
  • bd saint-martin doesn't work
  • boulevard saint martin doesn't work either

Ideally, we would like to find the address with the input bd st martin.

I see two problems here:

  1. The synonyms in custom_street.txt don't work because the peliasStreetTokenizer doesn't split on whitespace. But I see you are fixing this in Portland synonyms #291
  2. saint martin doesn't match saint-martin because of the hyphen. This can be solved by adding - to the name and street tokenizer.
  3. Libpostal sometimes parses martin as the neighbourhood or the city instead of the street. But that is another issue.

A more complete list of localized synonyms is also needed but that is less a problem because we can just add our own in our local install of pelias.

What do you think ?

@missinglink
Copy link
Member

This all sounds good.

I am always nervous about changing the tokenizers because it changes the system so much.

But, yes; in this case, I think this change would be for the best.

In order to do this change safely we will need to run a build and test to ensure we haven't missed any edge cases.

I would suggest we start a new branch which is clean, from there we make the minimal amount of changes required to improve peliasStreetTokenizer and then go from there.

@orangejulius
Copy link
Member

On that note, I'm going to split out just the change to tokenize on whitespace from #291, so that we can test that change individually as well.

@missinglink
Copy link
Member

This would also be a good opportunity to start putting together a French test suite.

@adefarge
Copy link
Author

Okay,
For the test suite, were you talking about acceptance tests, or tests in the schema module ?
I can try to put together a few test cases for french addresses, but I will need to know what kind of data you need (just the name of a street, a full address, an OSM link ?).

I suppose you'll want to do the change to the tokenizer yourself ? Please tell me if I can help with anything.

@loicortola
Copy link

@missinglink, following up on this topic, this might be a game-changer for the adoption of Pelias in the french community :) .

@orangejulius any update on the split of #291 ?

@adefarge is available to work for a few weeks on the project, are you guys open to getting organized so that we take advantage of the momentum to safely make and test those schema changes?

If yes, please let us know how :)

Cheers!

@orangejulius
Copy link
Member

orangejulius commented Jun 26, 2018

@adefarge, we are looking for primarily french acceptance-tests. we need your local knowledge :) of course we will not say no to unit tests

@loicortola yes, if you want to do a video/phone chat sometime soon to organize, that would be great. Feel free also to join the community call later this week.

I have now split out just the whitespace tokenizing change from #291 in #307 and we can follow up with more parts of that PR.

I'd also very much like to investigate tokenizing on - and other similar characters.

@Joxit
Copy link
Member

Joxit commented Jun 26, 2018

That's cool :D
We will try to be with you Wednesday!

@missinglink
Copy link
Member

Yes, very interested in this, apologies if I haven't been giving it the attention it deserves.

Let's set it as an agenda item for the community call :)

@adefarge
Copy link
Author

adefarge commented Jul 3, 2018

Just to summarize, we need to :

  • make a build with just France.
  • make an acceptance test suite for France.
  • run our acceptance tests and some performance tests with and without Portland synonyms #291
  • report to you with the results and the good news that it all works well ! :)

We will do that on one of our servers on the next couple of weeks.

@orangejulius
Copy link
Member

@adefarge that would be great. it's only important that the build includes france. so if you have a europe or even global build already, that's fine too

@adefarge
Copy link
Author

adefarge commented Jul 3, 2018

Good point, but the full planet build we have is the one we use in production which is a big machine.
The machine we're going to use for the tests won't have the same performance, so the performance comparison will probably not be very accurate...

@orangejulius
Copy link
Member

Relative performance comparison should be ok to get us to the point where we can merge to master :)

@adefarge
Copy link
Author

adefarge commented Jul 16, 2018

I added a new test suite here pelias/acceptance-tests#477.

Here is the result with a build based on pelias/schema:master with data from france.

French addresses
[...]
  ✔ [1] "{"text":"7 Rue Pointeau du Ronceray Rennes"}"
  ✔ [2] "{"text":"20 Boulevard Saint-Martin, Paris"}"
  ✘ [2-abbrev] "{"text":"20 bd saint-martin paris"}": score 2 out of 4
  diff:
    housenumber
      expected: 20
      actual:   
    street
      expected: Boulevard Saint-martin
      actual:   
  ✘ [2-abbrev-hyphen] "{"text":"20 bd st martin paris"}": score 2 out of 4
  diff:
    housenumber
      expected: 20
      actual:   
    street
      expected: Boulevard Saint-martin
      actual:   
  ✔ [3] "{"text":"Avenue du Maine Paris"}"
  ✘ [3-abbrev] "{"text":"av du maine paris"}": score 2 out of 3
  diff:
    street
      expected: Avenue du Maine
      actual:   
  ✔ [4] "{"text":"Boulevard de la Liberte, Rennes"}"
  ✔ [5] "{"text":"st germain les arpajon"}"
  ✔ [6] "{"text":"14 impasse du parc st germain les arpajon "}"
  ✘ [7] "{"text":"6 r de bellevue mur de bretagne"}": score 2 out of 4
  diff:
    housenumber
      expected: 6
      actual:   
    street
      expected: Rue De Bellevue
      actual:   
  ✘ [8] "{"text":"r gay lussac paris"}": score 2 out of 3
  diff:
    street
      expected: Rue Gay-Lussac
      actual:   
[...]

Aggregate test results
Pass: 6
Improvements: 0
Fail: 5
Placeholders: 0
Regressions: 0
Took 810ms
Test success rate 100%

0 regressions detected. All good.

And here is the result with a build based on pelias/schema:portland_synonyms.

[...]
French addresses
  ✔ [1] "{"text":"7 Rue Pointeau du Ronceray Rennes"}"
  ✔ [2] "{"text":"20 Boulevard Saint-Martin, Paris"}"
  ✔ improvement [2-abbrev] "{"text":"20 bd saint-martin paris"}"
  ✘ [2-abbrev-hyphen] "{"text":"20 bd st martin paris"}": score 2 out of 4
  diff:
    housenumber
      expected: 20
      actual:   
    street
      expected: Boulevard Saint-martin
      actual:   
  ✔ [3] "{"text":"Avenue du Maine Paris"}"
  ✔ improvement [3-abbrev] "{"text":"av du maine paris"}"
  ✔ [4] "{"text":"Boulevard de la Liberte, Rennes"}"
  ✔ [5] "{"text":"st germain les arpajon"}"
  ✔ [6] "{"text":"14 impasse du parc st germain les arpajon "}"
  ✔ improvement [7] "{"text":"6 r de bellevue mur de bretagne"}"
  ✘ [8] "{"text":"r gay lussac paris"}": score 2 out of 3
  diff:
    street
      expected: Rue Gay-Lussac
      actual:   
[...]

Aggregate test results
Pass: 6
Improvements: 3
Fail: 2
Placeholders: 0
Regressions: 0
Took 887ms
Test success rate 100%

0 regressions detected. All good.

The execution times vary a lot but are roughly identical.
Do you need me to do more in depth performance tests (like with gatling) ? Or is this enough ?

Both builds are available on:

Both have data from France, NYC and London (except for polylines which is only for France as the import script doesn't work for multiple osm files). Feel free to test them as you like. We will keep them online for one or two weeks.

@orangejulius
Copy link
Member

This is great.

Against the current geocode.earth build, the first two tests fail, as it looks like the OSM and OA records are not returned in the same order as your build (this sort of thing often happens as the order of two equally-scoring documents is randomly determined depending on the build)
image

http://pelias.github.io/compare/#/v1/search%3Ftext=20%20Boulevard%20Saint-Martin,%20Paris

This is actually great and I think it exposes two separate bugs:
1.) The API deduplication logic should handle minor capitalization differences
2.) The OA importer should (possibly) capitalize letters after a hyphen. However, this one's complicated.

@adefarge
Copy link
Author

After looking at the source code, the dedupe logic should already handle capitalization differences.
https://github.com/pelias/api/blob/6bff132e695130a3c6998ca8a501575e1d23ae97/helper/diffPlaces.js#L174

This is the result on our build :
image
With this in the logs:

info: [api] [dupe][skipping] query=20 Boulevard Saint-Martin, Paris, previous=openaddresses, hit=20 Boulevard Saint-Martin openstreetmap:node:921648356

Apparently there is 1 result from OSM and 2 from OA. But there is only one deduplication between the result from OSM and one of the result from OA.

Anyway, this should not make the tests fail.
Another solution would maybe be to capitalize letters as a post processing controller ? To ensure minor things like syntax and capitalization are standardized.
Or we could just make the acceptance tests not care about capitalization in the first place. In the end, the user doesn't care if there is a capital letter after the hyphen or not so maybe it should just not be tested.

@Joxit
Copy link
Member

Joxit commented Jul 17, 2018

Hello there,

Here is a link with the compare for the geocode queries on our two version: https://pelias.jawg.io/.

@orangejulius
Copy link
Member

Here is a link with the compare for the geocode queries on our two version: https://pelias.jawg.io/.

Wow very nice. I see you are using your own tiles too :)

@Joxit
Copy link
Member

Joxit commented Jul 17, 2018

Thanks, yes, I changed the links to use our tiles. :)

If you need tiles for pelias demos, we're here and it's free 😉

@adefarge
Copy link
Author

Hi @missinglink @orangejulius
How is the pull request #291 going ? Do you need anything else ?

@orangejulius
Copy link
Member

Hey @adefarge. We're going to kick off a full planet build this week with these changes. Stay tuned :)

@orangejulius
Copy link
Member

Hi everyone!
It's been a while since we've updated progress. Here's where things currently stand. We have #310 in progress which merges a change to split our street field on whitespace (in addition to adding a new synonym system). We found that PR has very bad performance issues (at least partially because of common tokens like st now being generated).

However, we've found that there's an Elasticsearch query parameter called cutoff_frequency, that was previously unknown to us, and seems to essentially completely solve the issue of performance with such common tokens. We're testing that out in pelias/api#1213, and since it's merely an API change, it should be easy to test. Feedback on the performance or result quality of that PR would be appreciated.

Once #310 is merged, a change to tokenize on hyphens as well would hopefully be straightforward.

orangejulius added a commit to pelias/acceptance-tests that referenced this issue Apr 26, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Sep 12, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Sep 18, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Sep 18, 2019
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Jun 30, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Jun 30, 2020
French addresses are relevant here:
pelias/schema#301
@Joxit
Copy link
Member

Joxit commented Jul 13, 2020

This should be fixed with #453

In fact only boulevard saint martin on search is not working, but this is a libpostal issue....

{
  "parser": "libpostal",
  "parsed_text": {
    "street": "boulevard saint",
    "city": "martin"
  }
}
0) Martin, Slovaquie
1) Martil, Maroc
2) Martin, TN, USA
3) Martil, Maroc
4) comté de Martin, FL, USA
5) District de Martin, Slovaquie
6) Martin, NC, USA
7) comté de Martin, MN, USA
8) comté de Martin, KY, USA
9) comté de Martin, IN, USA

The result on autocomplete is

{
 "parser": "pelias",
  "parsed_text": {
    "subject": "boulevard",
    "locality": "boulevard",
    "region": "saint martin",
    "admin": "saint martin"
  }
}
0) Boulevard Saint-Martin, Paris, France
1) St. Martin b.L. Grubhof, Lofer, Autriche
2) St. Martin b.L. Lamprechtshöhlen, Sankt Martin bei Lofer, Autriche
3) St. Martin b.L. Vorderkaserklamm, Sankt Martin bei Lofer, Autriche
4) Boulevard Saint-Martin, Villers-Bretonneux, France
5) Boulevard Saint-Martin, St.-Martin-des-Champs, France
6) Boulevard Saint-Martin, Laval, QC, Canada
7) Boulevard Saint-Martin, Vitré, France
8) Boulevard Saint-Martin, Pessac, France
9) Boulevard Saint-Martin, Givry, France

So I close this 🎉

@Joxit Joxit closed this as completed Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Jul 13, 2020
French addresses are relevant here:
pelias/schema#301
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Jul 13, 2020
orangejulius added a commit to pelias/acceptance-tests that referenced this issue Jul 13, 2020
French addresses are relevant here:
pelias/schema#301
@missinglink
Copy link
Member

related: openvenues/libpostal#499

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants