Release Brooklyn 99 · openvenues/libpostal

The great parser-data merge is complete. Libpostal 1.0 features a better-than-ever international address parser which achieves 99.45% full-parse accuracy on held-out addresses. The release title is a reference to the TV show (libpostal was also created in Brooklyn and this was the first version of the model to surpass 99% accuracy). Check out the blog post for the details. Here's a sample of what it can do in a GIF:

Breaking API Changes

Every function, struct, constant, etc. defined in the public header (libpostal.h) now uses a "libpostal_" prefix . This affects all bindings that call the C API. The bindings that are part of this Github org all have 1.0 branches.

New tags

Sub-building tags

unit: an apartment, unit, office, lot, or other secondary unit designator
level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
staircase: numbered/lettered staircase
entrance: numbered/lettered entrance
po_box: post office box: typically found in non-physical (mail-only) addresses

Category tags

category: for category queries like "restaurants", etc.
near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"

New admin tags

island: named islands e.g. "Maui"
country_region: informal subdivision of a country without any political status
world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”

No more accent-stripping/transliteration of input

There's a new transliterator which only makes simple modifications to the input (HTML entity normalization, NFC unicode normalization). Latin-ASCII transliteration is no longer used at runtime. Instead, addresses are transliterated to multiple forms during training so the parser has to deal with all the variants rather than normalizing to a single variant (which previously was not even correct in cases like Finnish, Turkish, etc.) in both places.

Trained on > 1 billion examples in every inhabited country on Earth

The training data for libpostal's parser has been greatly expanded to include every country and dependency in OpenStreetMap. We also train on a places-only data set where every city name from OSM gets some representation even if there are no addresses (higher-population cities get examples proportional to their population). A similar training set is constructed for streets, so even places which have very few addresses but do have a road network in OSM can be included.

1.0 also moves beyond OSM, training on most of the data sets in OpenAddresses, and postal codes + associated admins from Yahoo's GeoPlanet, which includes virtually every postcode in the UK, Canada, etc.

Almost 100GB of public training data

All files can be found under s3://libpostal/training_data/YYYY-MM-DD/parser/ as gzip'd tab-separated values (TSV) files formatted like:language\tcountry\taddress.

formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK

If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.

Better feature extraction

n-grams for the "unknown" words (occurred fewer than n times in the training set)
for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise
an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).
for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.

More powerful machine learning model (CRF)

libpostal 1.0 uses a Conditional Random Field (CRF) instead of the greedy averaged perceptron. This more powerful machine learning method scores sequences rather than individual decisions, and can revise its previous decision if that would help a subsequent token score higher (Viterbi inference).

Improves upon the CRFsuite implementation in terms of:

performance: Viterbi inference sped up by 2x
scalability: training set doesn't need to fit in memory
model expressiveness: libpostal's CRF adds state-transition features which can make use of both the state of the current token and the previous tag. These act just like normal features except their weights are LxL matrices (tags we could have transitioned from by tags we could transition to) instead of L vectors.

FTRL-Proximal optimization for the language classifier

The language classifier now uses a multinomial version of Google's FTRL-Proximal method, which uses a combination of L1 and L2 regularization, inducing sparsity while maintaining high accuracy. This results in a model that is more accurate than the previous classifier while being 1/10th the size. The runtime classifier is now able to load either sparse or dense weights depending on the file header.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brooklyn 99