GIS-NER-ADS-Thesis-Code

📝 About the Project

This is the official repository of my master thesis in Applied Data Science at Utrecht University.

My research project presents two deep learning-based NER systems to extract geographic phenomena from geo-analytical questions and classify them into core concepts of spatial information that conceptually model and distinguish spatial information. The NER systems are trained by BERT and Bi-LSTM models on 278 geo-analytical questions and tested on 31 validation questions, from a corpus that contains 309 questions in total. The evaluation and comparison results showed that the BERT model had higher accuracy, precision, recall and F1-score on recognizing core concepts in geo-analytical questions, compared to Bi-LSTM.

The project's code is available for everyone interesting but also for those who want to develop NLP solutions in the Geosciences scientific domain.

🔬 Demostration

For example, in a geo-analytical question answering system, the core concepts of a geo-analytical question can be recognized and annotated in the following way:

Annotation of core concepts in a geo-analytical question

The answer to this geo-analytical question can be visualized in a GIS software via a analytic GIS workflow as shown in the following figure :

Shortest network-based paths to a police station for specific PC4 areas

As this research has proved, the BERT outperfoms the Bi-LSTM model in core concept recognition tasks. In order to evaluate the BERT model's performance we selected three random geoanalytical questions from an another geo-analytical question corpus proposed by Xu, et al. (2022). As shown in the first figure (left side), fire stations and school are correctly recognized as objects. Also in the second figure (middle), the model correctly captures the vegetation areas as field nominal. Finally, as it is presented in the third figure, the model recognizes accurately in the third question, the number as content amount count and the traffic accidents as events.

💾 This repository

This repository contains the documentation and the code written about the master thesis.

Specifically:

The preprocessing code can be found in the PRE_PROCESSING folder of this repository. This code reads the neccessary files (i.e. geoanalytical question corpus and core concepts dictionary) and creates the unified tags (e.g. object quality --> OBJQ) for the core concept recognition from the two Deep Learning models.
The tokenization code can be found in the TOKENIZATION folder of this repository. Contains all the code for the tokenization and the POS (Part Of Speech) tags creation of each geoanalytical question. In addition with this code, each token from each question is matched automatically to the corresponding unified tag which was generated in the preprocessing code . The final output of this code is the generation of a .csv file which contains each geoanalytical question with the corresponding tokens, pos and and tags.
The two developed Deep Learning Models' can be found in the DEEP_LEARNING_MODELS folder of this repository. Contains all the developed code for the two Deep Learning models, including evalution metrics and visualizations.

📋 Dataset

As the dataset used in this research, was provided by the QuAnGIS research group, and no consent was obtained to share it publicly, I am unable to share the dataset with the geoanalytical questions, in this repository.

📩 Contact and contribution

For questions about this repository, please contact the author Aristoteles Kandylas, or open an Issue or Pull request in this repository.

⚖️ License

This repository is licensed under a GNU General Public License v3.0. You can view the LICENSE here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIS-NER-ADS-Thesis-Code

📝 About the Project

🔬 Demostration

Annotation of core concepts in a geo-analytical question

Shortest network-based paths to a police station for specific PC4 areas

💾 This repository

📋 Dataset

📩 Contact and contribution

⚖️ License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
DEEP_LEARNING_MODELS		DEEP_LEARNING_MODELS
PRE_PROCESSING		PRE_PROCESSING
TOKENIZATION		TOKENIZATION
LICENSE		LICENSE
README.md		README.md

License

AristotleKandylas/GIS-NER-ADS-Thesis-Code

Folders and files

Latest commit

History

Repository files navigation

GIS-NER-ADS-Thesis-Code

📝 About the Project

🔬 Demostration

Annotation of core concepts in a geo-analytical question

Shortest network-based paths to a police station for specific PC4 areas

💾 This repository

📋 Dataset

📩 Contact and contribution

⚖️ License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages