Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77

Merged
merged 10 commits into from
Nov 6, 2015

Conversation

missinglink
Copy link
Member

ok! so this is a big one, it fixes up all the analyzers we are using for the address fields, in each case it required building a whole new analyzer, so this PR contains 3x new types of token analysis:

note the analysis only reflects the tokens in the inverted index, the value returned to the user is verbatim what was entered. The idea is that by homogenizing tokens we get better matching.

  • peliasZip this analyzer handles both numeric and alphanumeric postcodes, it lowercases them and removes punctuation. eg: "E24-DN" -> ["e24dn"], "10 100" -> ["10100"]
  • peliasHousenumber this one is still not ideal, at this time it removes non-numeric parts. eg: "100a" -> [100], "100/1" -> [100,1]. It also sets the index type to 'integer' which opens it up for using numeric ranges for interpolation. This will probably need more work to remove the apartment numbers.
  • peliasStreet this is the most complex, it lowercases street names and then stems compass prefixes and street suffixes, the result is a single token. eg: "West 26th Street" -> ["w 26th st"]. This should result in much better street matching and avoid matching other records which also contain a similar token, such as 'street' or 'union' etc. This could be further improved by removing the ordinal suffixes (2nd, 3rd, etc.)

[edit] I pushed another commit to remove the ordinals so now "West 26th Street" -> ["w 26 st"] 🍷

$ npm test
$ npm run integration

This should be 100% backwards compatible with the v0 api

Fixes pelias/pelias#172

@missinglink
Copy link
Member Author

[update] unfortunately I had to change the index type for peliasHousenumber from integer to string.
more info: https://discuss.elastic.co/t/analyzer-unassigned-when-using-integer-type/32007

@missinglink
Copy link
Member Author

merged with upstream and re-deployed to dev cluster

@missinglink
Copy link
Member Author

testing notes:

peliasStreet - more emphasis on the exact street name

this analysis gives more priority to exact matches of the street name.

In the example below, the old behaviour for text="40 west 26th street" was to return a few W 26th st records, then it starts returning things like "40 West 26th Circle" and "40 Northwest 26th Street".

I think it would be better to give more emphasis to the correctness of the street name rather than the number.

so with the new analysis it still returns the same few W 26th st records at the top; then the long-tail is more like "4010 West 26th Street", "404 West 26th Street" etc.

/v1/search?size=40&text=40 west 26th street
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=40%20west%2026th%20street

note: relevant results marked with **
input: "40 west 26th street"

before:
 1) 40 West 26th Street, Merced, CA
 2) 40 West 26th Street, Merced, CA
 3) 40 West 26th Street, Merced, CA
 4) 40 West 26th **Circle**, Fayetteville, AR
 5) 40 West 26th **Circle**, Fayetteville, AR
 6) 40 West 26th **Avenue**, Eugene, OR

after:
 1) 40 West 26th Street, Merced, CA
 2) 40 West 26th Street, Merced, CA
 3) 40 West 26th Street, Merced, CA
 4) 4010 West 26th Street, Chicago, IL
 5) 4020-4026 West 26th Street, Chicago, IL
 6) 4037 West 26th Street, Chicago, IL

for further testing you can try any address, have a look at the long-tail (usually records 3-20), they should be different houses on the same street rather than completely different streets.

the next evolution of this strategy could be stricter enforcement of the country-code or regional segment of the query.

there is also some options we have about completely removing results which don't match a minimum threshold of street and name, possibly removing anything that doesn't match both?

this would reduce the results for this query to the top 3 only, a discussion for another day :)

@missinglink
Copy link
Member Author

testing notes:

peliasStreet - better understanding of how street names are formatted

this analysis supports synonyms for street suffixes such as cres == crescent

it also supports some compass abbreviations, such as north == n and southeast == se

.. and it also supports removing "ordinals" from numbers, such as 26th == 26

I think this is a no-brainer really, it makes surfacing street names much easier ;)

/v1/search?size=40&text=main ave
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=main%20ave

note: relevant results marked with **
input: "main ave"

before:
 1) 302 Main Ave W, Alberta, Canada
 2) 427 East Main Ave, Puyallup, WA
 3) 726 Main Ave Sourth, Brookings, SD
 4) 1412 East Main Ave, Puyallup, WA
 5) 556 Main Ave W, Alberta, Canada

after:
 1) Main Ave. Drugstore, Ebaiu, Philippines
 2) 2 Main Avenue, Ilfracombe, Australia
 3) 0 Main Avenue, Wareham, MA
 4) 5 Main Avenue, Wareham, MA
 5) 9 Main Avenue, INANDA, South Africa
/v1/search?size=40&text=30 w 26 st
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=30%20w%2026%20st

note: relevant results marked with **
input: "30 w 26 st"

before:
 1) 30 West 26 Street, Manhattan, NY
 2) 30 W 2nd St, National City, CA
 3) 30 26, Ad Dawhah, Qatar
 4) 30 26, Ad Dawhah, Qatar
 5) 30 26, Ciudad Nezahualcóyotl, Mexico

after:
 1) 30 West 26 Street, Manhattan, NY
 2) 30 West 26th Street, Manhattan, NY
 3) 30 West 26th Street, Merced, CA
 4) 30 West 26th Street, Merced, CA
 5) 30 West 26th Street, Manhattan, NY

for further testing you can try any address in short and long form, with or without compass directions and numeric ordinals.

the goal is for all variants of these address compositions to return equivalent results.

@missinglink
Copy link
Member Author

testing notes:

peliasHouseNumber - ignore apartment numbers

this is still not ideal, this analysis is more of an interim solution, which I could take-or-leave. the idea is that if we remove non-numeric parts of the housenumber we can get better matches.

eg. if the housenumber is entered as 100a it should be searchable as simply 100

I tested for ~15 mins and it doesn't seem to have much effect, be it positive or negative:

/v1/search?size=40&text=100 cliff road, nantucket
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=100%20cliff%20road,%20nantucket

note: relevant results marked with **
input: "100 cliff road, nantucket"

before:
 1) 100 Cliff Road, Nantucket, MA
 2) 100A Cliff Road, Nantucket, MA

after:
 1) 100 Cliff Road, Nantucket, MA
 2) 100A Cliff Road, Nantucket, MA
/v1/search?size=40&text=100c cliff road, nantucket
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=100c%20cliff%20road,%20nantucket

note: relevant results marked with **
input: "100c cliff road, nantucket"

before:
 1) 100 Cliff Road, Nantucket, MA
 2) 100A Cliff Road, Nantucket, MA

after:
 1) 100A Cliff Road, Nantucket, MA
 2) 100 Cliff Road, Nantucket, MA

for further testing you can try any address with or without non-numeric sections.

I may have missed out some cases, please try some other combinations like "100/1" and "100 apt 2" etc. more inspiration here

note: I logged this issue which makes testing this feature more difficult: pelias/api#355

@missinglink
Copy link
Member Author

testing notes:

peliasZip - handle different forms, remove punctuation

as with the previous notes, I can't get this to work, it's likely due to the way which the address parser is configured.

I tried to surface Hackney Cycles with the postcode E2 9ED from this record:

"name": "Hackney Cycles",
"housenumber": "507",
"street": "Hackney Road",
"postalcode": "E2 9ED",

... but searching for E29ED yielded terrible results, I think we have to rethink the query for postcodes to de-emphasise the name before we are going to be able to show results solely on the postcode.

http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=E29ED


this analysis will certainly produce better results for fully specified address queries, such as: "507 hackney rd, e29ed, london" if not for the issues noted below:


the query created for the text 507 hackney rd, E2 9ED is missing a section to match against the postalcode.

"query": {
  "text": "507 hackney rd, E29ED",
  "parsed_text": {
    "name": "507 hackney rd",
    "number": 507,
    "street": "hackney rd",
    "regions": [
      "E29ED"
    ],
    "admin_parts": "E29ED"
  }
}

some options for fixing this would be:

  • try to improve the address_parser to be better at parsing out postal codes
  • include postcode as one of the leftovers fields we query when we have parts of the query left over that we don't know where to match.

to be discussed:

> full query here

@missinglink
Copy link
Member Author

testing notes:

overall I think that it's an improvement in all the analysis techniques, I didn't find any regressions or negative impacts, so I'd be happy to merge it as-is and then work to improve the address_parser so we can benefit more from it.

one interesting side-effect of this is that we started surfacing POIs solely based on their address, this is something @dianashk and I have been discussing for some time.

eg. in the query below you can see 'Hackney Cycles' is now being returned by it's address
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=507%20hackney%20rd

"507 hackney rd"

 1) 507 Hackney Road, Cambridge Heath, Greater London
 2) Hackney Cycles, Cambridge Heath, Greater London

... this isn't working for surfacing samsung accelerator for 40 w 26th st, ny but that's only a question of tuning the scoring. to be discussed.

@missinglink
Copy link
Member Author

one acceptance test showed regression:

    {
      "id": 3,
      "status": "pass",
      "user": "Harish",
      "type": "dev",
      "in": {
        "text": "450 w 37th st, new york, ny 11232"
      },
      "expected": {
        "properties": [
          {
            "name": "450 37th Street",
            "country_a": "USA",
            "country": "United States",
            "region": "New York",
            "region_a": "NY",
            "county": "Kings County",
            "localadmin": "Brooklyn",
            "locality": "New York",
            "neighbourhood": "Windsor Teraace",
            "postalcode": "11232",
            "housenumber": "450",
            "street": "37th Street",
            "label": "450 37th Street, Brooklyn, NY"
          }
        ]
      }
    },

TBD if this is better or worse: http://pelias.github.io/compare/#/v1/search%3Ftext=450%20w%2037th%20st,%20new%20york,%20ny%2011232

orangejulius added a commit to pelias/fuzzy-tests that referenced this pull request Nov 2, 2015
I saw a [thread](https://trac.openstreetmap.org/ticket/5363) about a
somewhat difficult address to parse in OSM (turned out to be because of
a bad relation for the Las Vegas boundary, and figured I'd add the test
cases they used, particularly to check out the new address schema
changes from pelias/schema#77
"numeric" : {
"type" : "pattern_replace",
"pattern": "[^0-9]",
"replacement": " "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove non alphanumeric characters (by replacing them with empty string), but replace non numeric characters with a space? (I'm probably missing something, not actually saying we should change how it's done)

@orangejulius
Copy link
Member

So I took a look at this using the fuzzy tests. They aren't the highest quality tests, but there are a lot of them, and indeed I found a few differences. In terms of pass/fail counts the two environments are equivalent but there are some interesting changes.

Improvements

http://pelias.github.io/compare/#/v1/search%3Ftext=65c%20dana%20st
http://pelias.github.io/compare/#/v1/search%3Ftext=london%20bridge (still not the first result though)
http://pelias.github.io/compare/#/v1/search%3Ftext=1000%20flower%20street%20glendale%20ca%2091201 ( we still fail to find 1000 flower st (it might not be in the data), but with the new analyzers we only find results on the correct street)
http://pelias.github.io/compare/#/v1/search%3Ftext=207%20s%2042nd%20st%20Philadelphia,%20PA%2019104 (we used to find the correct address, now we both find the correct address AND only return results very nearby. excellent!)

Interesting Regressions

http://pelias.github.io/compare/#/v1/search%3Ftext=1900%20chapala%20st (since there is no city specified this might not be a good search test, but assuming santa barbara, CA is the intended result then we fail weirdly for this one. Overall the matches favor exactly matching on the street name, which is probably good, yet the address match is not in the results, kinda confusing, we should figure out why this happens)
http://pelias.github.io/compare/#/v1/search%3Ftext=339%20W%20Main%20St,%20Cheshire,%20CT%2006410 (this is a pure regression for a perfectly entered address that we used to match. lets figure this one out too)
http://pelias.github.io/compare/#/v1/search%3Ftext=301%20Commons%20Park%20S,%20Stamford,%20CT%2006902 (another missed address to figure out)

Probably Not Interesting Regressions

http://pelias.github.io/compare/#/v1/search%3Ftext=dfw
http://pelias.github.io/compare/#/v1/search%3Ftext=UIC (this is not a very good test)
http://pelias.github.io/compare/#/v1/search%3Ftext=BWI (similar to DFW, we weren't really returning good results before either, but at least they were related to the airport, now they aren't. probably not valid tests at this point)
http://pelias.github.io/compare/#/v1/search%3Ftext=IAH (same as above)
there were more airport code related regressions but I don't think that matters

Not sure which (mostly minor changes)

http://pelias.github.io/compare/#/v1/search%3Ftext=louvre
http://pelias.github.io/compare/#/v1/search%3Ftext=SF%20Ferry%20Building
http://pelias.github.io/compare/#/v1/search%3Ftext=dulles%20airport (looks mostly like ordering changes)
http://pelias.github.io/compare/#/v1/search%3Ftext=11%20times%20square (probably a regression assuming we are looking for the times square in NYC)

It looks like, overall, we've improved address matching (with a few exceptions we can probably fix), at the expense mostly only of airports, which we should fix for real by adding ICAO and FAA codes and properly boosting. Unless someone can find other places where we regress it does seem like we should merge away :)

@dianashk
Copy link
Contributor

dianashk commented Nov 3, 2015

@orangejulius, that fuzzy test analysis is super helpful. Would love to hear your thoughts on the process of going through them. How can it be improved? If you find improvements, are we identifying them as fixed in the tests? Can we automate any of it? Not saying this would necessarily be addressed right away, but we should create some actionable issues to tackle later.

@dianashk
Copy link
Contributor

dianashk commented Nov 3, 2015

Looks like the reason we fail the search for 339 W Main St, Cheshire, CT, 06410 as well as 301 Commons Park S, Stamford, CT 06902 is because we strip out leading 0's from the postal code numbers. That's unfortunate because looks like postal codes often will start with 0 and stripping them out results in matches to the wrong ones. Shouldn't hold up this PR.

Create a separate issue.

@missinglink
Copy link
Member Author

http://pelias.github.io/compare/#/v1/search%3Ftext=1900%20chapala%20st and http://pelias.github.io/compare/#/v1/search%3Ftext=301%20Commons%20Park%20S,%20Stamford,%20CT%2006902 are interesting, they are due to the keyword analysis. which may actually need to be fixed.

the first one 1900 chapala st actually took me ages to figure out, it's because the street name in the data is Chapala Street (note the double-spaces) and so the literal matching is failing, the other test is the same because the input parser thinks the street name is Commons Park S Stamford CT.

it might be better to use shingles or another analysis, which would unfortunately put us back in to a place where every input containing 'street' matches every other record containing 'street'

http://pelias.github.io/compare/#/v1/search%3Ftext=339%20W%20Main%20St,%20Cheshire,%20CT%2006410 is just completely parsed wrong.

@orangejulius
Copy link
Member

@dianashk the process for going through them is super manual right now: I basically opened two terminals right next to each other, ran the tests against dev in one, prod in the other, and manually looked at all the differences (I tried calling diff on the actual output but it's so jumbled it's not possible to visualize the actual differences).
I'd love to build a tool to make viewing these differences easy (it was even mentioned long ago in the original fuzzy testing ticket. @missinglink 's 2 stage test suite PR is definitely the first step towards this

@orangejulius
Copy link
Member

Here's a "funny" regression from @stephenkhess's awesome post office tests:

For 208 1st Avenue Southwest, Ardmore, OK, 73401, we used to get the right result first, but weren't very consistent: other results were from all over the country. Now, we match very consistently on one street: the housenumbers all start with 208 and the postalcode is 73401, but it's in the Czech Republic!

Update: the zip code data for this area is in OA but wasn't being used. I submitted openaddresses/openaddresses#1386 to fix it

dianashk added a commit to pelias/acceptance-tests that referenced this pull request Nov 5, 2015
dianashk added a commit to pelias/acceptance-tests that referenced this pull request Nov 5, 2015
@dianashk
Copy link
Contributor

dianashk commented Nov 5, 2015

Added acceptance tests: pelias/acceptance-tests#159

@dianashk
Copy link
Contributor

dianashk commented Nov 5, 2015

@missinglink, if you don't have the double-space fix done yet it's cool. Let's merge this as-is and create a separate PR for the double-space fix.

@missinglink
Copy link
Member Author

I just added a new token filter to remove duplicate whitespace from street names, happy to merge this now.

missinglink added a commit that referenced this pull request Nov 6, 2015
address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet
@missinglink missinglink merged commit 9e299fe into master Nov 6, 2015
missinglink added a commit to pelias/acceptance-tests that referenced this pull request Nov 6, 2015
@orangejulius orangejulius deleted the improved_address_schema branch March 25, 2016 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve address matching
3 participants