address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77

missinglink · 2015-10-06T17:41:15Z

ok! so this is a big one, it fixes up all the analyzers we are using for the address fields, in each case it required building a whole new analyzer, so this PR contains 3x new types of token analysis:

note the analysis only reflects the tokens in the inverted index, the value returned to the user is verbatim what was entered. The idea is that by homogenizing tokens we get better matching.

peliasZip this analyzer handles both numeric and alphanumeric postcodes, it lowercases them and removes punctuation. eg: "E24-DN" -> ["e24dn"], "10 100" -> ["10100"]
peliasHousenumber this one is still not ideal, at this time it removes non-numeric parts. eg: "100a" -> [100], "100/1" -> [100,1]. It also sets the index type to 'integer' which opens it up for using numeric ranges for interpolation. This will probably need more work to remove the apartment numbers.
peliasStreet this is the most complex, it lowercases street names and then stems compass prefixes and street suffixes, the result is a single token. eg: "West 26th Street" -> ["w 26th st"]. This should result in much better street matching and avoid matching other records which also contain a similar token, such as 'street' or 'union' etc. This could be further improved by removing the ordinal suffixes (2nd, 3rd, etc.)

[edit] I pushed another commit to remove the ordinals so now "West 26th Street" -> ["w 26 st"] 🍷

$ npm test
$ npm run integration

This should be 100% backwards compatible with the v0 api

Fixes pelias/pelias#172

…ss_schema

…khess

missinglink · 2015-10-13T10:33:22Z

[update] unfortunately I had to change the index type for peliasHousenumber from integer to string.
more info: https://discuss.elastic.co/t/analyzer-unassigned-when-using-integer-type/32007

missinglink · 2015-10-30T12:33:56Z

merged with upstream and re-deployed to dev cluster

missinglink · 2015-11-02T10:34:19Z

testing notes:

peliasStreet - more emphasis on the exact street name

this analysis gives more priority to exact matches of the street name.

In the example below, the old behaviour for text="40 west 26th street" was to return a few W 26th st records, then it starts returning things like "40 West 26th Circle" and "40 Northwest 26th Street".

I think it would be better to give more emphasis to the correctness of the street name rather than the number.

so with the new analysis it still returns the same few W 26th st records at the top; then the long-tail is more like "4010 West 26th Street", "404 West 26th Street" etc.

/v1/search?size=40&text=40 west 26th street
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=40%20west%2026th%20street

note: relevant results marked with **
input: "40 west 26th street"

before:
 1) 40 West 26th Street, Merced, CA
 2) 40 West 26th Street, Merced, CA
 3) 40 West 26th Street, Merced, CA
 4) 40 West 26th **Circle**, Fayetteville, AR
 5) 40 West 26th **Circle**, Fayetteville, AR
 6) 40 West 26th **Avenue**, Eugene, OR

after:
 1) 40 West 26th Street, Merced, CA
 2) 40 West 26th Street, Merced, CA
 3) 40 West 26th Street, Merced, CA
 4) 4010 West 26th Street, Chicago, IL
 5) 4020-4026 West 26th Street, Chicago, IL
 6) 4037 West 26th Street, Chicago, IL

for further testing you can try any address, have a look at the long-tail (usually records 3-20), they should be different houses on the same street rather than completely different streets.

the next evolution of this strategy could be stricter enforcement of the country-code or regional segment of the query.

there is also some options we have about completely removing results which don't match a minimum threshold of street and name, possibly removing anything that doesn't match both?

this would reduce the results for this query to the top 3 only, a discussion for another day :)

missinglink · 2015-11-02T10:55:46Z

testing notes:

peliasStreet - better understanding of how street names are formatted

this analysis supports synonyms for street suffixes such as cres == crescent

it also supports some compass abbreviations, such as north == n and southeast == se

.. and it also supports removing "ordinals" from numbers, such as 26th == 26

I think this is a no-brainer really, it makes surfacing street names much easier ;)

/v1/search?size=40&text=main ave
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=main%20ave

note: relevant results marked with **
input: "main ave"

before:
 1) 302 Main Ave W, Alberta, Canada
 2) 427 East Main Ave, Puyallup, WA
 3) 726 Main Ave Sourth, Brookings, SD
 4) 1412 East Main Ave, Puyallup, WA
 5) 556 Main Ave W, Alberta, Canada

after:
 1) Main Ave. Drugstore, Ebaiu, Philippines
 2) 2 Main Avenue, Ilfracombe, Australia
 3) 0 Main Avenue, Wareham, MA
 4) 5 Main Avenue, Wareham, MA
 5) 9 Main Avenue, INANDA, South Africa

/v1/search?size=40&text=30 w 26 st
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=30%20w%2026%20st

note: relevant results marked with **
input: "30 w 26 st"

before:
 1) 30 West 26 Street, Manhattan, NY
 2) 30 W 2nd St, National City, CA
 3) 30 26, Ad Dawhah, Qatar
 4) 30 26, Ad Dawhah, Qatar
 5) 30 26, Ciudad NezahualcÃ³yotl, Mexico

after:
 1) 30 West 26 Street, Manhattan, NY
 2) 30 West 26th Street, Manhattan, NY
 3) 30 West 26th Street, Merced, CA
 4) 30 West 26th Street, Merced, CA
 5) 30 West 26th Street, Manhattan, NY

for further testing you can try any address in short and long form, with or without compass directions and numeric ordinals.

the goal is for all variants of these address compositions to return equivalent results.

missinglink · 2015-11-02T11:20:58Z

testing notes:

peliasHouseNumber - ignore apartment numbers

this is still not ideal, this analysis is more of an interim solution, which I could take-or-leave. the idea is that if we remove non-numeric parts of the housenumber we can get better matches.

eg. if the housenumber is entered as 100a it should be searchable as simply 100

I tested for ~15 mins and it doesn't seem to have much effect, be it positive or negative:

/v1/search?size=40&text=100 cliff road, nantucket
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=100%20cliff%20road,%20nantucket

note: relevant results marked with **
input: "100 cliff road, nantucket"

before:
 1) 100 Cliff Road, Nantucket, MA
 2) 100A Cliff Road, Nantucket, MA

after:
 1) 100 Cliff Road, Nantucket, MA
 2) 100A Cliff Road, Nantucket, MA

/v1/search?size=40&text=100c cliff road, nantucket
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=100c%20cliff%20road,%20nantucket

note: relevant results marked with **
input: "100c cliff road, nantucket"

before:
 1) 100 Cliff Road, Nantucket, MA
 2) 100A Cliff Road, Nantucket, MA

after:
 1) 100A Cliff Road, Nantucket, MA
 2) 100 Cliff Road, Nantucket, MA

for further testing you can try any address with or without non-numeric sections.

I may have missed out some cases, please try some other combinations like "100/1" and "100 apt 2" etc. more inspiration here

note: I logged this issue which makes testing this feature more difficult: pelias/api#355

missinglink · 2015-11-02T11:38:38Z

testing notes:

peliasZip - handle different forms, remove punctuation

as with the previous notes, I can't get this to work, it's likely due to the way which the address parser is configured.

I tried to surface Hackney Cycles with the postcode E2 9ED from this record:

"name": "Hackney Cycles",
"housenumber": "507",
"street": "Hackney Road",
"postalcode": "E2 9ED",

... but searching for E29ED yielded terrible results, I think we have to rethink the query for postcodes to de-emphasise the name before we are going to be able to show results solely on the postcode.

http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=E29ED

this analysis will certainly produce better results for fully specified address queries, such as: "507 hackney rd, e29ed, london" if not for the issues noted below:

the query created for the text 507 hackney rd, E2 9ED is missing a section to match against the postalcode.

"query": {
  "text": "507 hackney rd, E29ED",
  "parsed_text": {
    "name": "507 hackney rd",
    "number": 507,
    "street": "hackney rd",
    "regions": [
      "E29ED"
    ],
    "admin_parts": "E29ED"
  }
}

some options for fixing this would be:

try to improve the address_parser to be better at parsing out postal codes
include postcode as one of the leftovers fields we query when we have parts of the query left over that we don't know where to match.

to be discussed:

> full query here

missinglink · 2015-11-02T11:52:36Z

testing notes:

overall I think that it's an improvement in all the analysis techniques, I didn't find any regressions or negative impacts, so I'd be happy to merge it as-is and then work to improve the address_parser so we can benefit more from it.

one interesting side-effect of this is that we started surfacing POIs solely based on their address, this is something @dianashk and I have been discussing for some time.

eg. in the query below you can see 'Hackney Cycles' is now being returned by it's address
http://pelias.github.io/compare/#/v1/search%3Fsize=40&text=507%20hackney%20rd

"507 hackney rd"

 1) 507 Hackney Road, Cambridge Heath, Greater London
 2) Hackney Cycles, Cambridge Heath, Greater London

... this isn't working for surfacing samsung accelerator for 40 w 26th st, ny but that's only a question of tuning the scoring. to be discussed.

missinglink · 2015-11-02T12:43:00Z

one acceptance test showed regression:

    {
      "id": 3,
      "status": "pass",
      "user": "Harish",
      "type": "dev",
      "in": {
        "text": "450 w 37th st, new york, ny 11232"
      },
      "expected": {
        "properties": [
          {
            "name": "450 37th Street",
            "country_a": "USA",
            "country": "United States",
            "region": "New York",
            "region_a": "NY",
            "county": "Kings County",
            "localadmin": "Brooklyn",
            "locality": "New York",
            "neighbourhood": "Windsor Teraace",
            "postalcode": "11232",
            "housenumber": "450",
            "street": "37th Street",
            "label": "450 37th Street, Brooklyn, NY"
          }
        ]
      }
    },

TBD if this is better or worse: http://pelias.github.io/compare/#/v1/search%3Ftext=450%20w%2037th%20st,%20new%20york,%20ny%2011232

I saw a [thread](https://trac.openstreetmap.org/ticket/5363) about a somewhat difficult address to parse in OSM (turned out to be because of a bad relation for the Las Vegas boundary, and figured I'd add the test cases they used, particularly to check out the new address schema changes from pelias/schema#77

orangejulius · 2015-11-02T22:12:32Z

settings.js

+        "numeric" : {
+          "type" : "pattern_replace",
+          "pattern": "[^0-9]",
+          "replacement": " "


Why do we remove non alphanumeric characters (by replacing them with empty string), but replace non numeric characters with a space? (I'm probably missing something, not actually saying we should change how it's done)

orangejulius · 2015-11-02T23:09:37Z

So I took a look at this using the fuzzy tests. They aren't the highest quality tests, but there are a lot of them, and indeed I found a few differences. In terms of pass/fail counts the two environments are equivalent but there are some interesting changes.

Improvements

http://pelias.github.io/compare/#/v1/search%3Ftext=65c%20dana%20st
http://pelias.github.io/compare/#/v1/search%3Ftext=london%20bridge (still not the first result though)
http://pelias.github.io/compare/#/v1/search%3Ftext=1000%20flower%20street%20glendale%20ca%2091201 ( we still fail to find 1000 flower st (it might not be in the data), but with the new analyzers we only find results on the correct street)
http://pelias.github.io/compare/#/v1/search%3Ftext=207%20s%2042nd%20st%20Philadelphia,%20PA%2019104 (we used to find the correct address, now we both find the correct address AND only return results very nearby. excellent!)

Interesting Regressions

http://pelias.github.io/compare/#/v1/search%3Ftext=1900%20chapala%20st (since there is no city specified this might not be a good search test, but assuming santa barbara, CA is the intended result then we fail weirdly for this one. Overall the matches favor exactly matching on the street name, which is probably good, yet the address match is not in the results, kinda confusing, we should figure out why this happens)
http://pelias.github.io/compare/#/v1/search%3Ftext=339%20W%20Main%20St,%20Cheshire,%20CT%2006410 (this is a pure regression for a perfectly entered address that we used to match. lets figure this one out too)
http://pelias.github.io/compare/#/v1/search%3Ftext=301%20Commons%20Park%20S,%20Stamford,%20CT%2006902 (another missed address to figure out)

Probably Not Interesting Regressions

http://pelias.github.io/compare/#/v1/search%3Ftext=dfw
http://pelias.github.io/compare/#/v1/search%3Ftext=UIC (this is not a very good test)
http://pelias.github.io/compare/#/v1/search%3Ftext=BWI (similar to DFW, we weren't really returning good results before either, but at least they were related to the airport, now they aren't. probably not valid tests at this point)
http://pelias.github.io/compare/#/v1/search%3Ftext=IAH (same as above)
there were more airport code related regressions but I don't think that matters

Not sure which (mostly minor changes)

http://pelias.github.io/compare/#/v1/search%3Ftext=louvre
http://pelias.github.io/compare/#/v1/search%3Ftext=SF%20Ferry%20Building
http://pelias.github.io/compare/#/v1/search%3Ftext=dulles%20airport (looks mostly like ordering changes)
http://pelias.github.io/compare/#/v1/search%3Ftext=11%20times%20square (probably a regression assuming we are looking for the times square in NYC)

It looks like, overall, we've improved address matching (with a few exceptions we can probably fix), at the expense mostly only of airports, which we should fix for real by adding ICAO and FAA codes and properly boosting. Unless someone can find other places where we regress it does seem like we should merge away :)

dianashk · 2015-11-03T00:21:47Z

@orangejulius, that fuzzy test analysis is super helpful. Would love to hear your thoughts on the process of going through them. How can it be improved? If you find improvements, are we identifying them as fixed in the tests? Can we automate any of it? Not saying this would necessarily be addressed right away, but we should create some actionable issues to tackle later.

dianashk · 2015-11-03T00:25:32Z

Looks like the reason we fail the search for 339 W Main St, Cheshire, CT, 06410 as well as 301 Commons Park S, Stamford, CT 06902 is because we strip out leading 0's from the postal code numbers. That's unfortunate because looks like postal codes often will start with 0 and stripping them out results in matches to the wrong ones. Shouldn't hold up this PR.

Create a separate issue.

missinglink · 2015-11-03T15:41:09Z

http://pelias.github.io/compare/#/v1/search%3Ftext=1900%20chapala%20st and http://pelias.github.io/compare/#/v1/search%3Ftext=301%20Commons%20Park%20S,%20Stamford,%20CT%2006902 are interesting, they are due to the keyword analysis. which may actually need to be fixed.

the first one 1900 chapala st actually took me ages to figure out, it's because the street name in the data is Chapala Street (note the double-spaces) and so the literal matching is failing, the other test is the same because the input parser thinks the street name is Commons Park S Stamford CT.

it might be better to use shingles or another analysis, which would unfortunately put us back in to a place where every input containing 'street' matches every other record containing 'street'

http://pelias.github.io/compare/#/v1/search%3Ftext=339%20W%20Main%20St,%20Cheshire,%20CT%2006410 is just completely parsed wrong.

orangejulius · 2015-11-03T18:56:44Z

@dianashk the process for going through them is super manual right now: I basically opened two terminals right next to each other, ran the tests against dev in one, prod in the other, and manually looked at all the differences (I tried calling diff on the actual output but it's so jumbled it's not possible to visualize the actual differences).
I'd love to build a tool to make viewing these differences easy (it was even mentioned long ago in the original fuzzy testing ticket. @missinglink 's 2 stage test suite PR is definitely the first step towards this

orangejulius · 2015-11-03T20:21:13Z

Here's a "funny" regression from @stephenkhess's awesome post office tests:

For 208 1st Avenue Southwest, Ardmore, OK, 73401, we used to get the right result first, but weren't very consistent: other results were from all over the country. Now, we match very consistently on one street: the housenumbers all start with 208 and the postalcode is 73401, but it's in the Czech Republic!

Update: the zip code data for this area is in OA but wasn't being used. I submitted openaddresses/openaddresses#1386 to fix it

dianashk · 2015-11-05T19:56:16Z

Added acceptance tests: pelias/acceptance-tests#159

dianashk · 2015-11-05T22:33:09Z

@missinglink, if you don't have the double-space fix done yet it's cool. Let's merge this as-is and create a separate PR for the double-space fix.

missinglink · 2015-11-06T15:20:48Z

I just added a new token filter to remove duplicate whitespace from street names, happy to merge this now.

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet

Add test cases for pelias/schema/pull/77

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet

f5c8a3e

missinglink added in progress in review and removed in progress labels Oct 6, 2015

missinglink self-assigned this Oct 6, 2015

missinglink added 2 commits October 6, 2015 20:12

remove ordinals

9403752

update dev dep

57d178d

missinglink mentioned this pull request Oct 6, 2015

Is there a "layer" for postcodes/ZIP codes? pelias/api#312

Closed

missinglink added 3 commits October 8, 2015 11:09

Merge branch 'master' of github.com:pelias/schema into improved_addre…

cb6a240

…ss_schema

update peliasStreet to make use of the new synonyms added by @stephen…

65cad00

…khess

changed peliasHousenumber to string, see PR notes on #77

91c9b4e

missinglink added 2 commits October 30, 2015 13:25

resolve merge conflicts

3cb93a7

fix end-to-end fixture

3b30e60

missinglink added a commit to pelias/api that referenced this pull request Nov 2, 2015

update address matching analyzers. related pelias/schema#77

ac01d72

missinglink mentioned this pull request Nov 2, 2015

update address matching analyzers. pelias/api#356

Merged

orangejulius mentioned this pull request Nov 2, 2015

Add tests for 7135 Decatur pelias/fuzzy-tests#6

Merged

orangejulius reviewed Nov 2, 2015
View reviewed changes

dianashk added a commit to pelias/acceptance-tests that referenced this pull request Nov 5, 2015

Add test cases for pelias/schema#77

206fe2e

dianashk added a commit to pelias/acceptance-tests that referenced this pull request Nov 5, 2015

Add test cases for pelias/schema#77

65a040e

dianashk mentioned this pull request Nov 5, 2015

Add test cases for pelias/schema/pull/77 pelias/acceptance-tests#159

Merged

missinglink added 2 commits November 6, 2015 16:04

add test for double-spaces in street name

e7126f0

add remove_duplicate_spaces filter, update tests

b5e793e

missinglink added a commit that referenced this pull request Nov 6, 2015

Merge pull request #77 from pelias/improved_address_schema

9e299fe

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet

missinglink merged commit 9e299fe into master Nov 6, 2015

missinglink removed the in review label Nov 6, 2015

missinglink added a commit to pelias/acceptance-tests that referenced this pull request Nov 6, 2015

Merge pull request #159 from pelias/imporved-address-schema

db0b2e4

Add test cases for pelias/schema/pull/77

riordan mentioned this pull request Nov 11, 2015

Searching for a venue by address doesn't find the venue pelias/api#283

Closed

orangejulius deleted the improved_address_schema branch March 25, 2016 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77

missinglink commented Oct 6, 2015

missinglink commented Oct 13, 2015

missinglink commented Oct 30, 2015

missinglink commented Nov 2, 2015

missinglink commented Nov 2, 2015

missinglink commented Nov 2, 2015

missinglink commented Nov 2, 2015

missinglink commented Nov 2, 2015

missinglink commented Nov 2, 2015

orangejulius Nov 2, 2015

orangejulius commented Nov 2, 2015

dianashk commented Nov 3, 2015

dianashk commented Nov 3, 2015

missinglink commented Nov 3, 2015

orangejulius commented Nov 3, 2015

orangejulius commented Nov 3, 2015

dianashk commented Nov 5, 2015

dianashk commented Nov 5, 2015

missinglink commented Nov 6, 2015

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77

address-specific analyzers: peliasZip, peliasHousenumber & peliasStreet #77

Conversation

missinglink commented Oct 6, 2015

missinglink commented Oct 13, 2015

missinglink commented Oct 30, 2015

missinglink commented Nov 2, 2015

peliasStreet - more emphasis on the exact street name

missinglink commented Nov 2, 2015

peliasStreet - better understanding of how street names are formatted

missinglink commented Nov 2, 2015

peliasHouseNumber - ignore apartment numbers

missinglink commented Nov 2, 2015

peliasZip - handle different forms, remove punctuation

missinglink commented Nov 2, 2015

missinglink commented Nov 2, 2015

orangejulius Nov 2, 2015

Choose a reason for hiding this comment

orangejulius commented Nov 2, 2015

Improvements

Interesting Regressions

Probably Not Interesting Regressions

Not sure which (mostly minor changes)

dianashk commented Nov 3, 2015

dianashk commented Nov 3, 2015

missinglink commented Nov 3, 2015

orangejulius commented Nov 3, 2015

orangejulius commented Nov 3, 2015

dianashk commented Nov 5, 2015

dianashk commented Nov 5, 2015

missinglink commented Nov 6, 2015