Recognize "@" in gender-neutral word endings as part of the token #60

rspeer · 2018-07-03T17:28:51Z

This PR changes the big tokenization regex to handle cases where @ or @s appears at the end of the word. The regex now works around Unicode's default segmentation to treat this @ as a letter, because this is a way of writing gender-neutral word endings in Spanish, Portuguese, and particularly far-left Italian.

As an example, the text "l@s niñ@s" should be tokenized as ["l@s", "niñ@s"], not as ["l", "s", "niñ", "s"].

The endings "x" and "xs" are becoming more common in Spanish for this purpose, but these are already tokenized correctly. On the other hand, only the "@" version is attested in Portuguese. This steered me away from my initial plan to replace "@" with "x" in these endings in a pre-processing step.

This version now includes the new data from exquisite-corpus, so it has the words with @ in them, as well as some cleaner data from ParaCrawl.

Tahnan · 2018-07-23T19:35:36Z

CHANGELOG.md

@@ -1,3 +1,5 @@
+## Version 2.2
+


It looks like this section still needs to be written?

Ah yeah. I started thinking I should write it, then changed my mind because I didn't know what date to put on it, and decided I'd write it after the PR was merged. But then I left it like this. I could just write the section and put tomorrow's date on it.

Recognize "@" in gender-neutral word endings as part of the token

Rob Speer added 3 commits July 3, 2018 13:22

Recognize "@" in gender-neutral word endings as part of the token

b2d242e

include data from xc rebuild

d06a6a4

Update README to describe @ tokenization

0644c89

Tahnan reviewed Jul 23, 2018

View reviewed changes

update the changelog for version 2.2

d9fc6ec

Tahnan merged commit bc12599 into master Jul 24, 2018

Tahnan deleted the gender-neutral-at branch July 24, 2018 22:16

rspeer pushed a commit that referenced this pull request Jun 25, 2024

Merge pull request #60 from LuminosoInsight/gender-neutral-at

2f8600e

Recognize "@" in gender-neutral word endings as part of the token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize "@" in gender-neutral word endings as part of the token #60

Recognize "@" in gender-neutral word endings as part of the token #60

rspeer commented Jul 3, 2018 •

edited

Loading

Tahnan Jul 23, 2018

rspeer Jul 23, 2018

Recognize "@" in gender-neutral word endings as part of the token #60

Recognize "@" in gender-neutral word endings as part of the token #60

Conversation

rspeer commented Jul 3, 2018 • edited Loading

Tahnan Jul 23, 2018

Choose a reason for hiding this comment

rspeer Jul 23, 2018

Choose a reason for hiding this comment

rspeer commented Jul 3, 2018 •

edited

Loading