Skip to content

Brooklyn 99

Compare
Choose a tag to compare
@albarrentine albarrentine released this 07 Apr 21:48
· 435 commits to master since this release

The great parser-data merge is complete. Libpostal 1.0 features a better-than-ever international address parser which achieves 99.45% full-parse accuracy on held-out addresses. The release title is a reference to the TV show (libpostal was also created in Brooklyn and this was the first version of the model to surpass 99% accuracy). Check out the blog post for the details. Here's a sample of what it can do in a GIF:

parser

Breaking API Changes

  • Every function, struct, constant, etc. defined in the public header (libpostal.h) now uses a "libpostal_" prefix . This affects all bindings that call the C API. The bindings that are part of this Github org all have 1.0 branches.

New tags

Sub-building tags

  • unit: an apartment, unit, office, lot, or other secondary unit designator
  • level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
  • staircase: numbered/lettered staircase
  • entrance: numbered/lettered entrance
  • po_box: post office box: typically found in non-physical (mail-only) addresses

Category tags

  • category: for category queries like "restaurants", etc.
  • near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"

New admin tags

  • island: named islands e.g. "Maui"
  • country_region: informal subdivision of a country without any political status
  • world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”

No more accent-stripping/transliteration of input

There's a new transliterator which only makes simple modifications to the input (HTML entity normalization, NFC unicode normalization). Latin-ASCII transliteration is no longer used at runtime. Instead, addresses are transliterated to multiple forms during training so the parser has to deal with all the variants rather than normalizing to a single variant (which previously was not even correct in cases like Finnish, Turkish, etc.) in both places.

Trained on > 1 billion examples in every inhabited country on Earth

The training data for libpostal's parser has been greatly expanded to include every country and dependency in OpenStreetMap. We also train on a places-only data set where every city name from OSM gets some representation even if there are no addresses (higher-population cities get examples proportional to their population). A similar training set is constructed for streets, so even places which have very few addresses but do have a road network in OSM can be included.

1.0 also moves beyond OSM, training on most of the data sets in OpenAddresses, and postal codes + associated admins from Yahoo's GeoPlanet, which includes virtually every postcode in the UK, Canada, etc.

Almost 100GB of public training data

All files can be found under s3://libpostal/training_data/YYYY-MM-DD/parser/ as gzip'd tab-separated values (TSV) files formatted like:language\tcountry\taddress.

  • formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
  • formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
  • formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
  • geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
  • openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
  • uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK

If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.

Better feature extraction

  • n-grams for the "unknown" words (occurred fewer than n times in the training set)
  • for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise
  • an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).
  • for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.

More powerful machine learning model (CRF)

libpostal 1.0 uses a Conditional Random Field (CRF) instead of the greedy averaged perceptron. This more powerful machine learning method scores sequences rather than individual decisions, and can revise its previous decision if that would help a subsequent token score higher (Viterbi inference).

Improves upon the CRFsuite implementation in terms of:

  1. performance: Viterbi inference sped up by 2x
  2. scalability: training set doesn't need to fit in memory
  3. model expressiveness: libpostal's CRF adds state-transition features which can make use of both the state of the current token and the previous tag. These act just like normal features except their weights are LxL matrices (tags we could have transitioned from by tags we could transition to) instead of L vectors.

FTRL-Proximal optimization for the language classifier

The language classifier now uses a multinomial version of Google's FTRL-Proximal method, which uses a combination of L1 and L2 regularization, inducing sparsity while maintaining high accuracy. This results in a model that is more accurate than the previous classifier while being 1/10th the size. The runtime classifier is now able to load either sparse or dense weights depending on the file header.