This is the official repository of my master thesis in Applied Data Science at Utrecht University.
My research project presents two deep learning-based NER systems to extract geographic phenomena from geo-analytical questions and classify them into core concepts of spatial information that conceptually model and distinguish spatial information. The NER systems are trained by BERT and Bi-LSTM models on 278 geo-analytical questions and tested on 31 validation questions, from a corpus that contains 309 questions in total. The evaluation and comparison results showed that the BERT model had higher accuracy, precision, recall and F1-score on recognizing core concepts in geo-analytical questions, compared to Bi-LSTM.
The project's code is available for everyone interesting but also for those who want to develop NLP solutions in the Geosciences scientific domain.
For example, in a geo-analytical question answering system, the core concepts of a geo-analytical question can be recognized and annotated in the following way:
The answer to this geo-analytical question can be visualized in a GIS software via a analytic GIS workflow as shown in the following figure :
As this research has proved, the BERT outperfoms the Bi-LSTM model in core concept recognition tasks. In order to evaluate the BERT model's performance we selected three random geoanalytical questions from an another geo-analytical question corpus proposed by Xu, et al. (2022). As shown in the first figure (left side), fire stations and school are correctly recognized as objects. Also in the second figure (middle), the model correctly captures the vegetation areas as field nominal. Finally, as it is presented in the third figure, the model recognizes accurately in the third question, the number as content amount count and the traffic accidents as events.
This repository contains the documentation and the code written about the master thesis.
Specifically:
-
The preprocessing code can be found in the PRE_PROCESSING folder of this repository. This code reads the neccessary files (i.e. geoanalytical question corpus and core concepts dictionary) and creates the unified tags (e.g. object quality --> OBJQ) for the core concept recognition from the two Deep Learning models.
-
The tokenization code can be found in the TOKENIZATION folder of this repository. Contains all the code for the tokenization and the POS (Part Of Speech) tags creation of each geoanalytical question. In addition with this code, each token from each question is matched automatically to the corresponding unified tag which was generated in the preprocessing code . The final output of this code is the generation of a .csv file which contains each geoanalytical question with the corresponding tokens, pos and and tags.
-
The two developed Deep Learning Models' can be found in the DEEP_LEARNING_MODELS folder of this repository. Contains all the developed code for the two Deep Learning models, including evalution metrics and visualizations.
As the dataset used in this research, was provided by the QuAnGIS research group, and no consent was obtained to share it publicly, I am unable to share the dataset with the geoanalytical questions, in this repository.
For questions about this repository, please contact the author Aristoteles Kandylas, or open an Issue or Pull request in this repository.
This repository is licensed under a GNU General Public License v3.0. You can view the LICENSE here.