This repo features code, annotated data, and results for the IJGIS paper Machine learning for cross-gazetteer matching of natural features.
Jupyter notebooks are in the top-level of this repo, numbered according to the order in which they should be run, and organized into 3 numbered subsets:
- 0_ : (00, 01, 02): preparation, preprocessing
- 1_ : (10, 11, 12, 13, 14): rule-based matching
- 2_ : (20, 21): machine learning based matching using random forests
Note these notebooks rely heavily on code in the gazmatch folder.
In /data/, we share our annotated data, annotated_sample.csv as well as some serialized files, including test_set_ids.pkl for the feature-type-balanced test set used in a subset of experiments. The latest GeoNames and SwissNames3D data can be obtained online:
- GeoNames daily dumps: [http://download.geonames.org/export/dump/] then choose CH.zip for the Switzerland data
- SwissNames3D latest version: [https://shop.swisstopo.admin.ch/en/products/landscape/names3D]
Note these datasets will not be identical to the ones used in this paper, which were downloaded in 2017. In particular, SwissNames3D may change UUIDs for certain records in newer versions. Data preparation involving the raw datasets is described and performed in the preparation notebooks. Contact the first author of the associated paper with any data requests.
The /results/ folder contains tsv files used to plot the results in the paper. The /html_exports/ contains html exports of all the notebooks for easy viewing in a browser.