Skip to content

ELI-Data-Mining-Group/pelitk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

PELITK: Pitt English Language Institute ToolKit

pelitk is a python package that contains implementations of various lexical analysis tools that are useful for Second Language Acquisition (SLA) work. These modules can be imported and used in Python. At present, there are two modules available:

  1. conc.py - functions for creating concordances to show selected key words in context
  2. lex.py - functions measuring lexical sophistication and diversity using a range of indices

Folder contents

File File type Description
docs folder contains CONC.MD and LEX.MD
CONC.MD markdown describes the conc.py module
LEX.MD markdown describes the lex.py module
LICENSE.txt text General Public License allowing reproduction but not editing of pelitk
pelitk folder contains the data/wordlists folder and the Python modules conc.py and lex.py
data/wordlists folder contains the wordlists required for lex.py
conc.py Python script Python module for installing concordancing functions
lex.py Python script Python module for installing lexical measurement functions
README.md markdown describes pelitk
requirements.txt text list of the Python modules that need to be installed for pelitk to function
setup.py Python script contains pelitk information and code required for installation

Installation

To install pelitk, enter the following into command line:

pip install git+https://github.com/ELI-Data-Mining-Group/pelitk.git@master

In addition, the following are Python modules required for lex.py:


conc.py

Essentially, a concordance is a list of words or phrases from a text, presented with their immediate contexts. Concordancing has long been an integral part of corpus investigations; as John Sinclair describes,

"The normal starting point for a corpus investigation is the concordance, which from early days in computing has used the [Key Word In Context (KWIC)] format, where instances of a chosen word or phrase (the NODE) are presented in a layout that aligns occurrences of the node vertically, but otherwise keeps them in the order in which they appear in the corpus."

Sinclair (2003, xiii)

conc.py creates a concordance list based on key words in a text, and it has options to allow for greater user flexibility. In the example usage below, there is a short text of two sentences which has been tokenized (split into a list of strings) to analyze the key word platypus. The output (presented in two formats) demonstrates how concordance lines provide a useful format for quickly seeing how a word (or phrase) is used in different contexts.

>>> from pelitk import conc
>>> tok_text = ['The', 'key', 'word', 'in', 'this', 'text', 'is', 'the', 'noun', 'platypus', '.',
               'I', 'want', 'to', 'see', 'the', 'cotext', 'every', 'time', 'the', 'word', 'platypus', 'occurs', '.']

>>> print(conc.concordance(tok_text,'platypus',5))
[('this text is the noun', 'platypus', '. I want to see'),
('cotext every time the word', 'platypus', 'occurs .   ')]

>>> print(conc.concordance(tok_text,'platypus',5,pretty=True))
['                   this text is the noun   platypus   . I want to see                         ',
 '              cotext every time the word   platypus   occurs .                                ']

Looking at the function more closely, we see that there are three required arguments and two optional arguments:

Argument Description
tok_text a list of tokenized text or list of tuples, e.g. ['the','word'] or [('the', 'DT'), ('word', 'NN')]
node the node word or tuple that will be the the focus of concordance lines
num the size of the collocation span, i.e. how many words on either side of the node
pos optional True/False argument (default is 'False'). Set to 'True' if the tok_text is a list of tuples with POS tags (see example above)
pretty optional True/False argument (default is 'False'). If True, the output will be formatted so that all the node words are aligned in each row and joined in a single string.

Returning to the example, we see that we have selected a span of 5 words on either side of the key word (or node), which is a common span size, but which could be increased to allow for greater context. The second output shows the difference when the pretty argument is set to 'True'. In the 'pretty' format, it is easier to scan visually, but it is more difficult to further process the data.

It is also possible to use conc.py with a list of key words, rather than a single key word. For a demonstration of how to do so, see the PELIC_concordancing_tutorial which compiles a concordance list with a list of nine different verbs.

For more example code and a full description of the functions (including their arguments and sub-functions), see CONC.md and conc.py.


lex.py

There are a number of quantitative measures used for understanding and describing lexical proficiency and development. In particular, many researchers have focused on lexical sophistication (the variation in ‘basic’ and ‘advanced’ words used in a text) and lexical diversity (the percentage of unique words in a text). For a complete discussion of lexical proficiency, see Leńko-Szymańska (2019). lex.py provides functions to calculate a number of the more commonly used metrics of sophistication and diversity, summarized briefly below.

For example code and a full description of the functions (including their arguments and sub-functions), see LEX.md and lex.py.


adv_guiraud
Calculates Advanced Guiraud (AG):

  • measure of lexical sophistication
  • formula = advanced types / sqrt(number of tokens).
  • By default, the function uses NGSL top 2k words as frequency list of common types to ignore. Optionally, other lists can be used instead.

vocd
Calculates vocD:

  • measure of lexical diversity
  • formula = calculating TTR from a number of random samples then fitting a curve and reporting the parameter value
  • the default requires a minimum text length of 35 words (the default number of sub-samples), though this can be optionally adjusted

ttr
Calculates Type-Token_Ratio (TTR):

  • simple measure of lexical diversity
  • formula = number of types / number of tokens in a text
  • practical to calculate but sensitive to text length (shorter texts have higher TTR)

mtld
Calculates Measure of Textual Lexical Diversity (MTLD):

  • measure of lexical diversity
  • formula = complex sequential analysis of samples, generating a score based on TTR scores in the samples.

maas
Calculates Maas (log 2):

  • measure of lexical diversity
  • formula = TTR with log correction


Back to top

About

Pitt English Language Institute ToolKit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages