Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize "@" in gender-neutral word endings as part of the token #60

Merged
merged 4 commits into from
Jul 24, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## Version 2.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this section still needs to be written?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. I started thinking I should write it, then changed my mind because I didn't know what date to put on it, and decided I'd write it after the PR was merged. But then I left it like this. I could just write the section and put tomorrow's date on it.

## Version 2.1 (2018-06-18)

Data changes:
Expand Down
23 changes: 17 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,13 @@ frequency as a decimal between 0 and 1.
1.07e-05

>>> word_frequency('café', 'en')
5.89e-06
5.75e-06

>>> word_frequency('cafe', 'fr')
1.51e-06

>>> word_frequency('café', 'fr')
5.25e-05
5.13e-05


`zipf_frequency` is a variation on `word_frequency` that aims to return the
Expand All @@ -78,10 +78,10 @@ one occurrence per billion words.
5.29

>>> zipf_frequency('frequency', 'en')
4.42
4.43

>>> zipf_frequency('zipf', 'en')
1.55
1.57

>>> zipf_frequency('zipf', 'en', wordlist='small')
0.0
Expand Down Expand Up @@ -276,7 +276,8 @@ produces tokens that follow the recommendations in [Unicode
Annex #29, Text Segmentation][uax29], including the optional rule that
splits words between apostrophes and vowels.

There are language-specific exceptions:
There are exceptions where we change the tokenization to work better
with certain languages:

- In Arabic and Hebrew, it additionally normalizes ligatures and removes
combining marks.
Expand All @@ -288,19 +289,29 @@ There are language-specific exceptions:
- In Chinese, it uses the external Python library `jieba`, another optional
dependency.

- While the @ sign is usually considered a symbol and not part of a word,
wordfreq will allow a word to end with "@" or "@s". This is one way of
writing gender-neutral words in Spanish and Portuguese.

[uax29]: http://unicode.org/reports/tr29/

When wordfreq's frequency lists are built in the first place, the words are
tokenized according to this function.

>>> from wordfreq import tokenize
>>> tokenize('l@s niñ@s', 'es')
['l@s', 'niñ@s']
>>> zipf_frequency('l@s', 'es')
2.8

Because tokenization in the real world is far from consistent, wordfreq will
also try to deal gracefully when you query it with texts that actually break
into multiple tokens:

>>> zipf_frequency('New York', 'en')
5.28
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.57
3.61

The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese,
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@

setup(
name="wordfreq",
version='2.1.0',
version='2.2.0',
maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq/',
Expand Down
109 changes: 109 additions & 0 deletions tests/test_at_sign.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
from wordfreq import tokenize, lossy_tokenize, word_frequency


def test_gender_neutral_at():
# Recognize the gender-neutral @ in Spanish as part of the word
text = "La protección de los derechos de tod@s l@s trabajador@s migrantes"
assert tokenize(text, "es") == [
"la",
"protección",
"de",
"los",
"derechos",
"de",
"tod@s",
"l@s",
"trabajador@s",
"migrantes"
]

text = "el distrito 22@ de Barcelona"
assert tokenize(text, 'es') == ["el", "distrito", "22@", "de", "barcelona"]
assert lossy_tokenize(text, 'es') == ["el", "distrito", "00@", "de", "barcelona"]

# It also appears in Portuguese
text = "direitos e deveres para @s membr@s da comunidade virtual"
assert tokenize(text, "pt") == [
"direitos",
"e",
"deveres",
"para",
"@s",
"membr@s",
"da",
"comunidade",
"virtual"
]

# Because this is part of our tokenization, the language code doesn't
# actually matter, as long as it's a language with Unicode tokenization
text = "@s membr@s da comunidade virtual"
assert tokenize(text, "en") == ["@s", "membr@s", "da", "comunidade", "virtual"]


def test_at_in_corpus():
# We have a word frequency for "l@s"
assert word_frequency('l@s', 'es') > 0

# It's not just treated as a word break
assert word_frequency('l@s', 'es') < word_frequency('l s', 'es')


def test_punctuation_at():
# If the @ appears alone in a word, we consider it to be punctuation
text = "operadores de canal, que são aqueles que têm um @ ao lado do nick"
assert tokenize(text, "pt") == [
"operadores",
"de",
"canal",
"que",
"são",
"aqueles",
"que",
"têm",
"um",
"ao",
"lado",
"do",
"nick"
]

assert tokenize(text, "pt", include_punctuation=True) == [
"operadores",
"de",
"canal",
",",
"que",
"são",
"aqueles",
"que",
"têm",
"um",
"@",
"ao",
"lado",
"do",
"nick"
]

# If the @ is not at the end of the word or part of the word ending '@s',
# it is also punctuation
text = "un archivo hosts.deny que contiene la línea ALL:ALL@ALL"
assert tokenize(text, "es") == [
"un",
"archivo",
"hosts.deny",
"que",
"contiene",
"la",
"línea",
"all:all",
"all"
]

# Make sure not to catch e-mail addresses
text = "info@something.example"
assert tokenize(text, "en") == [
"info",
"something.example"
]
2 changes: 1 addition & 1 deletion tests/test_chinese.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def test_tokens():

def test_combination():
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
assert word_frequency('谢谢谢谢', 'zh') == pytest.approx(xiexie_freq / 20)
assert word_frequency('谢谢谢谢', 'zh') == pytest.approx(xiexie_freq / 20, rel=0.01)


def test_alternate_codes():
Expand Down
Loading