Add and refactor text statistics functionality #350

bdewilde · 2021-11-22T02:25:11Z

Description

implements a variety of text statistics measuring lexical diversity
- ttr(): type-token-ratio, in three variations (standard, root, corrected)
- log_ttr(): logarithmically-scaled TTR, in three variations (herdan, summer, dugast)
- segmented_ttr(): segmented TTR, in two variations (mean, moving average)
- mtld(): Measure of Textual Lexical Diversity, a modern and text length-agnostic measure
- hdd(): Hypergeometric Distribution Diversity, a(nother) modern and text length-agnostic measure
implements functions for counting morphological, part-of-speech, and dependency annotations in a document
updates all text statistics functions to accept a Doc object as their first positional argument (i.e. Callable[[Doc, ...], int | float] , so they're more directly usable and accessible
- adds caching to some functions and improves performance of .get_words() utils func to reduce overhead
consolidates all TextStats readability properties under a single TextStats.readability() method, where individual statistics are specified by name; similarly, adds TextStats.diversity() and .counts() methods for accessing those statistics
replaces TextStatsComponent custom spaCy language component with a suite of Doc property and method extensions, settable via textacy.text_stats.set_doc_extensions(); the outcome is roughly the same -- various text stats accessible via spaCy's Doc._ -- but the process is different

Motivation and Context

There are lots of use cases and methods for quantifying aspects of a document, and I wanted textacy to capture more of them. I even used something like annotation counts for a research project. Also, the text_stats functionality was clunky in many ways, and I wanted to make it easier for people to use.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation, and I have updated it accordingly.

partial factors should now be computed correctly

this lets us use these functions for Doc property/method extensions! to avoid re-computing potentially expensive operations, we cache a couple functions

I know, I overuse utils modules... but I think it's a bit nicer than importing private funcs from all over the place, or plopping them down in strange in places.

this allows us to use these functions as doc extensions, it makes them easier to call directly a la carte, it moves lang-checking into the functions that need it. only downside is a slight performance hit, but i'll take it.

there's a better way

bdewilde added 26 commits November 20, 2021 20:20

Add morph labels set to constants

3dabfe4

Add base funcs for moph stats

13b29ee

Add morph counts to textstats class

190f942

Blacken textstats class

3f15567

Add functions for computing lexical diversity

6df70b2

Add docs, error msgs to lex diversity mod

36b289a

Tidy up text stats imports, formatting

44e3c77

Add textstats methods for readability, diversity

7a3b563

Remove readability textstats properties

a4a1324

Test ts readability method rather than attrs

bfcb67b

Fix MTLD calculation

d6f9aba

partial factors should now be computed correctly

Add tests for lexical diversity stats

0a5a72f

Update textstats documentation

23343a9

Fix quickstart and quickstart tests for textstats

04a139d

Simplify text stats component, update tests

8026bb1

Add DocOrTokens type to pkg

e6244fe

Update basic stats to always take doc input

ce70516

this lets us use these functions for Doc property/method extensions! to avoid re-computing potentially expensive operations, we cache a couple functions

Update textstats calls to basics funcs

0dd8936

Move funcs into text stats utils module

782fdaa

I know, I overuse utils modules... but I think it's a bit nicer than importing private funcs from all over the place, or plopping them down in strange in places.

Handle too-short doc case in diversity funcs

d65462f

Compute all readability stats from doc objs

1864110

this allows us to use these functions as doc extensions, it makes them easier to call directly a la carte, it moves lang-checking into the functions that need it. only downside is a slight performance hit, but i'll take it.

Simplify TextStats calls to diversity+readability funcs

0ba4ff8

Fix readability tests for doc inputs

5f336b8

Remove text stats custom lang component

f8409d1

there's a better way

Add text stats funcs as doc extensions

3251241

Update text stats docs

6a230c6

bdewilde marked this pull request as ready for review November 30, 2021 04:01

bdewilde added 3 commits November 30, 2021 20:33

Add funcs for counting token annotations

8d9f1ce

Add annot counts to text stats api

2387ffd

Add tests for annot counts

ab43777

bdewilde added 4 commits November 30, 2021 20:34

Move morph func into spacier utils

ae17a5b

Delete superseded morph stats

617112e

Update text stats docs

9aa1f82

Delete superseded morph counts property

32c8f66

bdewilde merged commit ceeb925 into develop Dec 1, 2021

bdewilde deleted the add-morph-stats branch December 1, 2021 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add and refactor text statistics functionality #350

Add and refactor text statistics functionality #350

bdewilde commented Nov 22, 2021 •

edited

Loading

Add and refactor text statistics functionality #350

Add and refactor text statistics functionality #350

Conversation

bdewilde commented Nov 22, 2021 • edited Loading

Description

Motivation and Context

Types of changes

Checklist:

bdewilde commented Nov 22, 2021 •

edited

Loading