This project is a collection of two different datasets constituting legal sentences from the tenancy law of the German civil law as well as legal word2vec models.
If you use the data and publish please let us know. We may provide a paper to cite in the neat future.
All three corpora are released under the CC BY-SA 3.0 license.
601 sentences from the tenancy law of the German Civil Code (BGB, §535-§597).
The dataset is annotated sentency-by-sentence according to three different taxonomies (3 semantic types, 6 semantic types, and 9 semantic types).
312 sentences, classified according to a semantic type system consisting of 9 different classes, from German rental agreements.
A word2vec model trained on the German JRCAcquis corpus1 in 10 iterations using 300 dimension and a window size of 5. The corpus was pre-processed by the following steps:
- Removing line breaks
- Removing duplicated whitespaces
- Replacing German umlauts
- Spelling numbers
- Removing punctuation
- Removing token with less than 3 characters
Afterwards the corpus constituted 33.686.085 token.
A word2vec model trained on a corpus of judgments from the German fiscal law in 10 iterations using 300 dimension and a window size of 5. The corpus was pre-processed by the following steps:
- Removing line breaks
- Removing duplicated whitespaces
- Replacing German umlauts
- Spelling numbers
- Removing punctuation
- Removing token with less than 3 characters
Afterwards the corpus constituted 33.686.085 token.
If you have any questions, please contact:
Ingo Glaser (Technical University of Munich) ingo.glaser@tum.de
1.: Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058