Simple word to frequency mappings for the german language based on text corpora and using CISTEM stemmer. May be useful for various purposes.
The source data already contains a frequency list, but still was preprocessed using the routine in the decow/
folder.
Word Frequencies
- decow_wordfreq_cistem.csv.7z (203MB, 672MB uncompressed)
- md5sum: 5b2797838221fbb9518f2800deee60d4
License & Attribution
The original corpus is licensed under Creative Commons Attribution 4.0.
Word Frequencies
- opensubtitles_cistem_freq.csv (13MB)
- md5sum: 7cceeaa18a8c519848ceff88350a9aef
License & Attribution
P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
Download and extract one of the archives. Then use it like this (warning: this way it may use much memory):
import pandas as pd
import nltk
word = 'Onlineumfrage'
stemmer = nltk.stem.Cistem()
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])
df.at[stemmer.stem(word), 'freq'] # => 8490