Skip to content

Simple word to frequency mappings for the german language based on text corpora and using CISTEM stemmer.

Notifications You must be signed in to change notification settings

olastor/german-word-frequencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

German Word Frequencies

Simple word to frequency mappings for the german language based on text corpora and using CISTEM stemmer. May be useful for various purposes.

Data

cow16 (~ 42 million unique stemmed words)

The source data already contains a frequency list, but still was preprocessed using the routine in the decow/ folder.

Word Frequencies

License & Attribution

The original corpus is licensed under Creative Commons Attribution 4.0.

opensubtitles (~ 900k unique stemmed words)

Word Frequencies

License & Attribution

P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Example Usage

Download and extract one of the archives. Then use it like this (warning: this way it may use much memory):

import pandas as pd
import nltk

word = 'Onlineumfrage'

stemmer = nltk.stem.Cistem()
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])
df.at[stemmer.stem(word), 'freq'] # => 8490

About

Simple word to frequency mappings for the german language based on text corpora and using CISTEM stemmer.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages