Skip to content

Commit

Permalink
Update Preprocess Text documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ajdapretnar committed Nov 27, 2020
1 parent 8ecb806 commit 50a65cb
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 11 deletions.
Binary file removed doc/widgets/images/Preprocess-Text-stamped.png
Binary file not shown.
Binary file added doc/widgets/images/PreprocessText.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 8 additions & 11 deletions doc/widgets/preprocesstext.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,11 @@ Preprocesses corpus with selected methods.

- Corpus: Preprocessed corpus.

**Preprocess Text** splits your text into smaller units (tokens), filters them, runs [normalization](https://en.wikipedia.org/wiki/Stemming) (stemming, lemmatization), creates [n-grams](https://en.wikipedia.org/wiki/N-gram) and tags tokens with [part-of-speech](https://en.wikipedia.org/wiki/Part_of_speech) labels. Steps in the analysis are applied sequentially and can be turned on or off.
**Preprocess Text** splits your text into smaller units (tokens), filters them, runs [normalization](https://en.wikipedia.org/wiki/Stemming) (stemming, lemmatization), creates [n-grams](https://en.wikipedia.org/wiki/N-gram) and tags tokens with [part-of-speech](https://en.wikipedia.org/wiki/Part_of_speech) labels. Steps in the analysis are applied sequentially and can be reordered. Click and drag the preprocessor to change the order.

![](images/Preprocess-Text-stamped.png)
![](images/PreprocessText.png)

1. **Information on preprocessed data**.
*Document count* reports on the number of documents on the input.
*Total tokens* counts all the tokens in corpus.
*Unique tokens* excludes duplicate tokens and reports only on unique tokens in the corpus.
1. Available preprocessors.
2. **Transformation** transforms input data. It applies lowercase transformation by default.
- *Lowercase* will turn all text to lowercase.
- *Remove accents* will remove all diacritics/accents in text.
Expand All @@ -37,27 +34,27 @@ Preprocesses corpus with selected methods.
- [Regexp](https://en.wikipedia.org/wiki/Regular_expression) will split the text by provided regex. It splits by words only by default (omits punctuation).
- *Tweet* will split the text by pre-trained Twitter model, which keeps hashtags, emoticons and other special symbols.
This example. :-) #simple → (This), (example), (.), (:-)), (#simple)
4. **Normalization** applies stemming and lemmatization to words. (I've always loved cats. → I have alway love cat.) For languages other than English use Snowball Stemmer (offers languages available in its NLTK implementation).
4. **Normalization** applies stemming and lemmatization to words. (I've always loved cats. → I have alway love cat.) For languages other than English use Snowball Stemmer (offers languages available in its NLTK implementation) or UDPipe.
- [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) applies the original Porter stemmer.
- [Snowball Stemmer](http://snowballstem.org/) applies an improved version of Porter stemmer (Porter2). Set the language for normalization, default is English.
- [WordNet Lemmatizer](http://wordnet.princeton.edu/) applies a networks of cognitive synonyms to tokens based on a large lexical database of English.
- [UDPipe](http://ufal.mff.cuni.cz/udpipe/1) applies a [pre-trained model](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998?show=full) for normalizing data.
5. **Filtering** removes or keeps a selection of words.
- *Stopwords* removes stopwords from text (e.g. removes 'and', 'or', 'in'...). Select the language to filter by, English is set as default. You can also load your own list of stopwords provided in a simple \*.txt file with one stopword per line.
![](images/stopwords.png)
Click 'browse' icon to select the file containing stopwords. If the file was properly loaded, its name will be displayed next to pre-loaded stopwords. Change 'English' to 'None' if you wish to filter out only the provided stopwords. Click 'reload' icon to reload the list of stopwords.
- *Lexicon* keeps only words provided in the file. Load a \*.txt file with one word per line to use as lexicon. Click 'reload' icon to reload the lexicon.
- *Regexp* removes words that match the regular expression. Default is set to remove punctuation.
- *Document frequency* keeps tokens that appear in not less than and not more than the specified number / percentage of documents. If you provide integers as parameters, it keeps only tokens that appear in the specified number of documents. E.g. DF = (3, 5) keeps only tokens that appear in 3 or more and 5 or less documents. If you provide floats as parameters, it keeps only tokens that appear in the specified percentage of documents. E.g. DF = (0.3, 0.5) keeps only tokens that appear in 30% to 50% of documents. Default returns all tokens.
- *Document frequency* keeps tokens that appear in not less than and not more than the specified number / percentage of documents. Absolute keeps only tokens that appear in the specified number of documents. E.g. DF = (3, 5) keeps only tokens that appear in 3 or more and 5 or less documents. Relative keeps only tokens that appear in the specified percentage of documents. E.g. DF = (0.3, 0.5) keeps only tokens that appear in 30% to 50% of documents.
- *Most frequent tokens* keeps only the specified number of most frequent tokens. Default is a 100 most frequent tokens.
6. **N-grams Range** creates n-grams from tokens. Numbers specify the range of n-grams. Default returns one-grams and two-grams.
7. **POS Tagger** runs part-of-speech tagging on tokens.
- [Averaged Perceptron Tagger](https://spacy.io/blog/part-of-speech-pos-tagger-in-python) runs POS tagging with Matthew Honnibal's averaged perceptron tagger.
- [Treebank POS Tagger (MaxEnt)](http://web.mit.edu/6.863/www/fall2012/projects/writeups/max-entropy-nltk.pdf) runs POS tagging with a trained Penn Treebank model.
- [Stanford POS Tagger](http://nlp.stanford.edu/software/tagger.shtml#Download) runs a log-linear part-of-speech tagger designed by Toutanova et al. Please download it from the provided website and load it in Orange. You have to load the language-specific model in Model and load *stanford-postagger.jar* in the Tagger section.
8. Produce a report.
8. Preview of preprocessed data.
9. If *Commit Automatically* is on, changes are communicated automatically. Alternatively press *Commit*.

**Note**! Preprocess Text applies preprocessing steps in the order they are listed. This means it will first transform the text, then apply tokenization, POS tags, normalization, filtering and finally constructs n-grams based on given tokens. This is especially important for WordNet Lemmatizer since it requires POS tags for proper normalization.
**Note**! Preprocess Text applies preprocessing steps in the order they are listed. A good order is to first transform the text, then apply tokenization, POS tags, normalization, filtering and finally constructs n-grams based on given tokens. This is especially important for WordNet Lemmatizer since it requires POS tags for proper normalization.

Useful Regular Expressions
--------------------------
Expand Down

0 comments on commit 50a65cb

Please sign in to comment.