$ python3 -m pip install corpy
Only recent versions of Python 3 (3.10+) are supported by design.
If you get stuck, it's always a good idea to start by searching the documentation, the short URL to which is https://corpy.rtfd.io/.
The project is developed on GitHub. You can ask for help via GitHub discussions and report bugs and give other kinds of feedback via GitHub issues. Support is provided gladly, time and other engagements permitting, but cannot be guaranteed.
A fancy plural for corpus ;) Also, a collection of handy but not especially mutually integrated tools for dealing with linguistic data. It abstracts away functionality which is often needed in practice for teaching and/or day to day work at the Czech National Corpus, without aspiring to be a fully featured or consistent NLP framework.
Here's an idea of what you can do with CorPy:
- add linguistic annotation to raw textual data using either UDPipe or MorphoDiTa
- easily generate word clouds
- run code in a sanitized global environment (useful for debugging in interactive sessions, e.g. with Jupyter notebooks in JupyterLab)
- generate phonetic transcripts of Czech texts
- wrangle corpora in the vertical format devised originally for CWB, used also by (No)SketchEngine
- plus some command line utilities
Note
Should I pick UDPipe or MorphoDiTa?
Both are developed at ÚFAL MFF UK. UDPipe has more features at the cost of being somewhat more complex: it does both morphological tagging (including lemmatization) and syntactic parsing, and it handles a number of different input and output formats. You can also download pre-trained models for many different languages.
By contrast, MorphoDiTa only has pre-trained models for Czech and English, and only performs morphological tagging (including lemmatization). However, its output is more straightforward -- it just splits your text into tokens and annotates them, whereas UDPipe can (depending on the model) introduce additional tokens necessary for a more explicit analysis, add multi-word tokens etc. This is because UDPipe is tailored to the type of linguistic analysis conducted within the UniversalDependencies project, using the CoNLL-U data format.
MorphoDiTa can also help you if you just want to tokenize text and don't have a language model available.
corpy
needs to be installed in the ReadTheDocs virtualenv for autodoc
to
work. The optional dependencies in the doc
group are also needed. This is
all configured in .readthedocs.yml
.
Copyright © 2016--present ÚČNK/David Lukeš
Distributed under the GNU General Public License v3.