Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting

Abstract

Dialects are one of the main drivers of language variation, a major challenge for natural language processing tools. In most languages, dialects exist along a continuum, and are commonly discretized by combining the extent of several preselected linguistic variables. However, the selection of these variables is theory-driven and itself insensitive to change. We use Doc2Vec on a corpus of 16.8M anonymous online posts in the German-speaking area to learn continuous document representations of cities. These representations capture continuous regional linguistic distinctions, and can serve as input to downstream NLP tasks sensitive to regional variation. By incorporating geographic information via retrofitting and agglomerative clustering with structure, we recover dialect areas at various levels of granularity. Evaluating these clusters against an existing dialect map, we achieve a match of up to 0.77 V-score (harmonic mean of cluster completeness and homogeneity). Our results show that representation learning with retrofitting offers a robust general method to automatically expose dialectal differences and regional variation at a finer granularity than was previously possible.

Setup

Due to space constraints, neither the data nor the model could be included in the repository. Before executing any commands, please run

sh get_data.sh

and

sh train_model.sh

References

The paper appeared at EMNLP 2018:

Dirk Hovy and Christoph Purschke. 2018. Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting. In Proceedings of EMNLP.

@inproceedings{HovyPurschke2018capturing,
  title={{Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting}},
  author={Hovy, Dirk and Purschke, Christoph},
  booktitle={Proceedings of the 2018 conference on Empirical Methods in Natural Language Processing},
  year={2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
pics		pics
src		src
README.md		README.md
cluster_maps.sh		cluster_maps.sh
clustering.sh		clustering.sh
extracted_small		extracted_small
extracted_small_2000		extracted_small_2000
geolocation_baseline.sh		geolocation_baseline.sh
geolocation_results.sh		geolocation_results.sh
get_data.sh		get_data.sh
run_inference.sh		run_inference.sh
train_model.sh		train_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting

Abstract

Setup

References

About

Releases

Packages

Languages

rexcsn/capturing_variation

Folders and files

Latest commit

History

Repository files navigation

Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting

Abstract

Setup

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages