CorPy

Installation

$ python3 -m pip install corpy

Only recent versions of Python 3 (3.10+) are supported by design.

Help and feedback

If you get stuck, it's always a good idea to start by searching the documentation, the short URL to which is https://corpy.rtfd.io/.

The project is developed on GitHub. You can ask for help via GitHub discussions and report bugs and give other kinds of feedback via GitHub issues. Support is provided gladly, time and other engagements permitting, but cannot be guaranteed.

What is CorPy?

A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.

Here's an idea of what you can do with CorPy:

add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa
easily generate word clouds
run code in a sanitized global environment (useful for debugging in interactive sessions, e.g. with Jupyter notebooks in JupyterLab)
generate phonetic transcripts of Czech texts
wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
plus some command line utilities

Note

Should I pick UDPipe or MorphoDiTa?

Both are developed at ÚFAL MFF UK. UDPipe has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.

By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward -- it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.

MorphoDiTa can also help you if you just want to tokenize text and don't have a language model available.

Development

Dependencies and building the docs

corpy needs to be installed in the ReadTheDocs virtualenv for autodoc to work. The optional dependencies in the doc group are also needed. This is all configured in .readthedocs.yml.

License

Distributed under the GNU General Public License v3.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
docs		docs
src/corpy		src/corpy
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
Makefile		Makefile
README.rst		README.rst
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CorPy

Installation

Help and feedback

What is CorPy?

Development

Dependencies and building the docs

License

About

Releases

Packages

Languages

dlukes/corpy

Folders and files

Latest commit

History

Repository files navigation

CorPy

Installation

Help and feedback

What is CorPy?

Development

Dependencies and building the docs

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages