Skip to content
View dopameter's full-sized avatar

Block or report dopameter

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
dopameter/README.md

arc


Version License: MIT GitHub contributors GitHub commit activity (branch) GitHub closed issues GitHub issues GitHub closed pull requests GitHub pull requests GitHub last commit GitHub watchers GitHub forks GitHub Repo stars Visitors Contributions welcome GitHub repo size


DOPA METER - A Tool Suite for the Metrical Document Profiling and Aggregation

This is DOPA METER - a tool suite that subsumes a wide range of established metrics of text analysis under one coverage. It is based on Python 3 and spaCy. (Running is preferred under Linux (preferred Ubuntu) and partly under Microsoft Windows.)

The system is based by a modular architecture, including a multilingual approach. It is designed in a decentralized manner estimating features of various sources at different places and merge partial results.

Three components build the basis:

  1. Text corpora as the input and possible to summarize into collections such as a preprocessing pipeline,
  2. Feature Hub: A set of features, that compute counts and metrics of text corpora and
  3. A three-parted analytics section:
    1. Summarization mode: of simple reports for whole corpora and single documents,
    2. Comparison: simple comparisons (e.g., vocabulary, $n$-grams) via intersections and differences
    3. Aggregation: clustering by k-means and t-SNE with DBSCAN

Functionality

arc

Quick Introduction

  • Installation

    • Install Python 3
    • Install spaCy language modules and other external resources via python install_languages.py lang_install.json
      • Working for German and English language and all spaCy compatible languages or languages modules.
      • Warnings:
  • Starting DOPA METER

    • Configure your text corpora: one corpus is set up by a directory including single text files
  • Configure your config.json

{
  "corpora": {
    "name_corpus": {
      "path_text_data": "/path/of/your/corpus/files/",
      "language":       "de",
      "collection":     "one"
    },
    "name_other_corpus": {
      "path_text_data": "/path/of/your/corpus/files/",
      "language":       "de",
      "collection":     "two"
    },
    "name_one_more_corpus": {
      "path_text_data": "/path/of/your/corpus/files/",
      "language":       "de",
      "collection":     "two"
    }
  },
  "settings": {
    "tasks": ["features", "counts", "corpus_characteristics"],
    "store_sources": false,
    "file_format_features": ["csv"],
    "file_format_dicts": "txt"
  },
  "output": {
      "path_features": "/define/a/path/of/your/features",
      "path_summary":  "/define/a/path/of/your/summary",
      "path_counts":   "/define/a/path/of/your/counts"
    },
  "features": {
    "token_characteristics": "default",
    "surface":               "default"
  }
}
  • Open a terminal, root in the directory of DOPA METER and type python main.py config.json

Detailed Documentation

  1. Installation
  2. Input and Data Preparation
  3. Functionality and Definition of Tasks
  4. Feature Hub
  5. Analytics
  6. Configuration and Run

How to cite

DOPA METER is presented at EMNLP 2023 Demo.

Please use the following citation:

@inproceedings{lohr-hahn-2023-dopa,
    title = "{DOPA} {METER} {--} A Tool Suite for Metrical Document Profiling and Aggregation",
    author = "Lohr, Christina and Hahn, Udo",
    editor = "Feng, Yansong and Lefever, Els",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-demo.18",
    pages = "218--228",
    abstract = "We present DOPA METER, a tool suite for the metrical investigation of written language, that provides diagnostic means for its division into discourse categories, such as registers, genres, and style. The quantitative basis of our system are 120 metrics covering a wide range of lexical, syntactic, and semantic features relevant for language profiling. The scores can be summarized, compared, and aggregated using visualization tools that can be tailored according to the users{'} needs. We also showcase an application scenario for DOPA METER.",
}

Licence

License: MIT

DOPA METER is provided as open source under the MIT License.

Funding and Support

This work was supported by the Friedrich Schiller University Jena (JULIE Lab and FUSION group) and the University Leipzig (IMISE), such as the BMBF within the projects SMITH (grants 01ZZ1803G and 01ZZ1803A) and GeMTeX as parts of the Medical Informatics Initiative Germany.

Popular repositories Loading

  1. dopameter dopameter Public

    Natural Language Processing Analytics

    Python 6