Parsing addresses using Google Geocoding API
There are addresses input by user that has only street name and our current procedure is not able to correctly parse those. There must be algorithm that already exists to parse addresses with free text input.
Outsourcing the task could be a good option. We can send address to the Google(or Yahoo or 3rd party) geocoder.
The geocoders return rich parsing of the address with the information we need to parse the addresses.
- Western(US, France, German, etc.) -> (premise), <street_number> <route_> (subpremise) :
767(street_number) 5th Avenue(route) #45th Floor(subpremise)
- Korea / Japan -> (subpremise) <sublocality_level_3> <sublocality_level_4> <sublocality_level_2>, <sublocality_level_1>, (locality) :
203(premise) Hoehyeon-dong(sublocality_level_2), Jung-gu(sublocality_level_1)
Confident that about 90% of 85% of addresses are correctly parsed. (We currently hold total of 8618 addresses to parse in our system.)
Some challenges I had with:
1% of the addresses(113) were not accepted by google(zero_results / invalid request) : User put random values / wrong info / google not responding
- When zero_result / invalid request from google
-> Replace carriage return from database to single space
-> just put original address from database for unresponsive ones
- previous address overriding when next one is null
-> store xml in a temporary table
- few address(137) in 11 countries
(country in ('cn', 'jp', 'hk', 'ro','eg','vn','ru','bh','th','ma','ae')
contain native characters
-> Restructure @Address value type to nvarchar instead of varchar
- and more..
Do we want to return native characters in the address?
From Geocoding documentation:
The geocoder does its best to provide a street address that is readable for both the user and locals. To achieve that goal, it returns street addresses in the local language, transliterated to a script readable by the user if necessary, observing the preferred language. All other addresses are returned in the preferred language. Address components are all returned in the same language, which is chosen from the first component.
The preferred language has a small influence on the set of results that the API chooses to return, and the order in which they are returned. The geocoder interprets abbreviations differently depending on language, such as the abbreviations for street types, or synonyms that may be valid in one language but not in another. For example, utca and tér are synonyms for street in Hungarian.
Do we want to parse the addresses differently for every country?
It adds level of complexity to analyze addresses for each country and parse them differently..
- Western(US, France, German, etc.) -> (premise), <street_number> <route_> (subpremise) :
767(street_number) 5th Avenue(route) #45th Floor(subpremise)
- Korea / Japan -> (subpremise) <sublocality_level_3> <sublocality_level_4> <sublocality_level_2>, <sublocality_level_1>, (locality) :
203(premise) Hoehyeon-dong(sublocality_level_2), Jung-gu(sublocality_level_1)
- Russia?? Egypt??
We also have an option of getting formatted addresses from google but that will include city, state, zipcode and country in address field.
On the other hand, we cannot use edited formatted address as we are not allowed to manipulate formatted addresses according to the Google's policy.
This is a code block. Total of 8618 addresses.
https://maps.googleapis.com/maps/api/geocode/xml?sensor=false&address=Ginza%20Kagayaki%20Bldg.%205F,%201-15-8,%20Ginza,%20Chuo-ku,%20104-0061%20Tokyo,%20Japan&key=AIzaSyAWWLT180oTl-EyZR3Wj7kwua7lfJ1jiR0&language=us