Skip to content

Datasets constituting legal sentences from the tenancy law of the German civil law as well as legal word2vec models.

License

Notifications You must be signed in to change notification settings

sebischair/Legal-Sentence-Classification-Datasets-and-Models

Repository files navigation

Legal-Sentence-Classification-Datasets-and-Models

This project is a collection of two different datasets constituting legal sentences from the tenancy law of the German civil law as well as legal word2vec models.

If you use the data and publish please let us know. We may provide a paper to cite in the neat future.

License

All three corpora are released under the CC BY-SA 3.0 license.

Content

Datasets

Statutory Texts

601 sentences from the tenancy law of the German Civil Code (BGB, §535-§597).

The dataset is annotated sentency-by-sentence according to three different taxonomies (3 semantic types, 6 semantic types, and 9 semantic types).

Rental Agreements

312 sentences, classified according to a semantic type system consisting of 9 different classes, from German rental agreements.

Word2Vec Models

JRCAcquis Corpus

A word2vec model trained on the German JRCAcquis corpus1 in 10 iterations using 300 dimension and a window size of 5. The corpus was pre-processed by the following steps:

  1. Removing line breaks
  2. Removing duplicated whitespaces
  3. Replacing German umlauts
  4. Spelling numbers
  5. Removing punctuation
  6. Removing token with less than 3 characters

Afterwards the corpus constituted 33.686.085 token.

German Fiscal Law Judgments

A word2vec model trained on a corpus of judgments from the German fiscal law in 10 iterations using 300 dimension and a window size of 5. The corpus was pre-processed by the following steps:

  1. Removing line breaks
  2. Removing duplicated whitespaces
  3. Replacing German umlauts
  4. Spelling numbers
  5. Removing punctuation
  6. Removing token with less than 3 characters

Afterwards the corpus constituted 33.686.085 token.

Contact Information

If you have any questions, please contact:

Ingo Glaser (Technical University of Munich) ingo.glaser@tum.de

1.: Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058

About

Datasets constituting legal sentences from the tenancy law of the German civil law as well as legal word2vec models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published