-
-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77
Conversation
[update] unfortunately I had to change the index type for |
merged with upstream and re-deployed to dev cluster |
testing notes: peliasStreet - more emphasis on the exact street namethis analysis gives more priority to exact matches of the street name. In the example below, the old behaviour for I think it would be better to give more emphasis to the correctness of the street name rather than the number. so with the new analysis it still returns the same few
for further testing you can try any address, have a look at the long-tail (usually records 3-20), they should be different houses on the same street rather than completely different streets.
the next evolution of this strategy could be stricter enforcement of the country-code or regional segment of the query. there is also some options we have about completely removing results which don't match a minimum threshold of street and name, possibly removing anything that doesn't match both? this would reduce the results for this query to the top 3 only, a discussion for another day :) |
testing notes: peliasStreet - better understanding of how street names are formattedthis analysis supports synonyms for street suffixes such as it also supports some compass abbreviations, such as .. and it also supports removing "ordinals" from numbers, such as I think this is a no-brainer really, it makes surfacing street names much easier ;)
for further testing you can try any address in short and long form, with or without compass directions and numeric ordinals. the goal is for all variants of these address compositions to return equivalent results. |
testing notes: peliasHouseNumber - ignore apartment numbersthis is still not ideal, this analysis is more of an interim solution, which I could take-or-leave. the idea is that if we remove non-numeric parts of the housenumber we can get better matches. eg. if the housenumber is entered as I tested for ~15 mins and it doesn't seem to have much effect, be it positive or negative:
for further testing you can try any address with or without non-numeric sections. I may have missed out some cases, please try some other combinations like "100/1" and "100 apt 2" etc. more inspiration here note: I logged this issue which makes testing this feature more difficult: pelias/api#355 |
testing notes: peliasZip - handle different forms, remove punctuationas with the previous notes, I can't get this to work, it's likely due to the way which the address parser is configured. I tried to surface "name": "Hackney Cycles",
"housenumber": "507",
"street": "Hackney Road",
"postalcode": "E2 9ED", ... but searching for http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=E29ED this analysis will certainly produce better results for fully specified address queries, such as: "507 hackney rd, e29ed, london" if not for the issues noted below: the query created for the text "query": {
"text": "507 hackney rd, E29ED",
"parsed_text": {
"name": "507 hackney rd",
"number": 507,
"street": "hackney rd",
"regions": [
"E29ED"
],
"admin_parts": "E29ED"
}
} some options for fixing this would be:
to be discussed: |
testing notes: overall I think that it's an improvement in all the analysis techniques, I didn't find any regressions or negative impacts, so I'd be happy to merge it as-is and then work to improve the one interesting side-effect of this is that we started surfacing POIs solely based on their address, this is something @dianashk and I have been discussing for some time. eg. in the query below you can see 'Hackney Cycles' is now being returned by it's address
... this isn't working for surfacing |
one acceptance test showed regression: {
"id": 3,
"status": "pass",
"user": "Harish",
"type": "dev",
"in": {
"text": "450 w 37th st, new york, ny 11232"
},
"expected": {
"properties": [
{
"name": "450 37th Street",
"country_a": "USA",
"country": "United States",
"region": "New York",
"region_a": "NY",
"county": "Kings County",
"localadmin": "Brooklyn",
"locality": "New York",
"neighbourhood": "Windsor Teraace",
"postalcode": "11232",
"housenumber": "450",
"street": "37th Street",
"label": "450 37th Street, Brooklyn, NY"
}
]
}
}, TBD if this is better or worse: http://pelias.github.io/compare/#/v1/search%3Ftext=450%20w%2037th%20st,%20new%20york,%20ny%2011232 |
I saw a [thread](https://trac.openstreetmap.org/ticket/5363) about a somewhat difficult address to parse in OSM (turned out to be because of a bad relation for the Las Vegas boundary, and figured I'd add the test cases they used, particularly to check out the new address schema changes from pelias/schema#77
"numeric" : { | ||
"type" : "pattern_replace", | ||
"pattern": "[^0-9]", | ||
"replacement": " " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove non alphanumeric characters (by replacing them with empty string), but replace non numeric characters with a space? (I'm probably missing something, not actually saying we should change how it's done)
So I took a look at this using the fuzzy tests. They aren't the highest quality tests, but there are a lot of them, and indeed I found a few differences. In terms of pass/fail counts the two environments are equivalent but there are some interesting changes. Improvementshttp://pelias.github.io/compare/#/v1/search%3Ftext=65c%20dana%20st Interesting Regressionshttp://pelias.github.io/compare/#/v1/search%3Ftext=1900%20chapala%20st (since there is no city specified this might not be a good search test, but assuming santa barbara, CA is the intended result then we fail weirdly for this one. Overall the matches favor exactly matching on the street name, which is probably good, yet the address match is not in the results, kinda confusing, we should figure out why this happens) Probably Not Interesting Regressionshttp://pelias.github.io/compare/#/v1/search%3Ftext=dfw Not sure which (mostly minor changes)http://pelias.github.io/compare/#/v1/search%3Ftext=louvre It looks like, overall, we've improved address matching (with a few exceptions we can probably fix), at the expense mostly only of airports, which we should fix for real by adding ICAO and FAA codes and properly boosting. Unless someone can find other places where we regress it does seem like we should merge away :) |
@orangejulius, that fuzzy test analysis is super helpful. Would love to hear your thoughts on the process of going through them. How can it be improved? If you find improvements, are we identifying them as fixed in the tests? Can we automate any of it? Not saying this would necessarily be addressed right away, but we should create some actionable issues to tackle later. |
Looks like the reason we fail the search for 339 W Main St, Cheshire, CT, 06410 as well as 301 Commons Park S, Stamford, CT 06902 is because we strip out leading Create a separate issue. |
http://pelias.github.io/compare/#/v1/search%3Ftext=1900%20chapala%20st and http://pelias.github.io/compare/#/v1/search%3Ftext=301%20Commons%20Park%20S,%20Stamford,%20CT%2006902 are interesting, they are due to the the first one it might be better to use http://pelias.github.io/compare/#/v1/search%3Ftext=339%20W%20Main%20St,%20Cheshire,%20CT%2006410 is just completely parsed wrong. |
@dianashk the process for going through them is super manual right now: I basically opened two terminals right next to each other, ran the tests against dev in one, prod in the other, and manually looked at all the differences (I tried calling diff on the actual output but it's so jumbled it's not possible to visualize the actual differences). |
Here's a "funny" regression from @stephenkhess's awesome post office tests: For 208 1st Avenue Southwest, Ardmore, OK, 73401, we used to get the right result first, but weren't very consistent: other results were from all over the country. Now, we match very consistently on one street: the housenumbers all start with 208 and the postalcode is 73401, but it's in the Czech Republic! Update: the zip code data for this area is in OA but wasn't being used. I submitted openaddresses/openaddresses#1386 to fix it |
Added acceptance tests: pelias/acceptance-tests#159 |
@missinglink, if you don't have the double-space fix done yet it's cool. Let's merge this as-is and create a separate PR for the double-space fix. |
I just added a new token filter to remove duplicate whitespace from street names, happy to merge this now. |
address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet
Add test cases for pelias/schema/pull/77
ok! so this is a big one, it fixes up all the
analyzers
we are using for the address fields, in each case it required building a whole new analyzer, so this PR contains 3x new types of token analysis:note the analysis only reflects the tokens in the inverted index, the value returned to the user is verbatim what was entered. The idea is that by homogenizing tokens we get better matching.
"E24-DN"
->["e24dn"]
,"10 100"
->["10100"]
"100a"
->[100]
,"100/1"
->[100,1]
. It also sets the index type to 'integer' which opens it up for using numeric ranges for interpolation. This will probably need more work to remove the apartment numbers."West 26th Street"
->["w 26th st"]
. This should result in much better street matching and avoid matching other records which also contain a similar token, such as 'street' or 'union' etc. This could be further improved by removing the ordinal suffixes (2nd, 3rd, etc.)[edit] I pushed another commit to remove the ordinals so now
"West 26th Street"
->["w 26 st"]
🍷$ npm test $ npm run integration
This should be 100% backwards compatible with the
v0
apiFixes pelias/pelias#172