Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize "@" in gender-neutral word endings as part of the token #60

Merged
merged 4 commits into from
Jul 24, 2018

Conversation

rspeer
Copy link
Owner

@rspeer rspeer commented Jul 3, 2018

This PR changes the big tokenization regex to handle cases where @ or @s appears at the end of the word. The regex now works around Unicode's default segmentation to treat this @ as a letter, because this is a way of writing gender-neutral word endings in Spanish, Portuguese, and particularly far-left Italian.

As an example, the text "l@s niñ@s" should be tokenized as ["l@s", "niñ@s"], not as ["l", "s", "niñ", "s"].

The endings "x" and "xs" are becoming more common in Spanish for this purpose, but these are already tokenized correctly. On the other hand, only the "@" version is attested in Portuguese. This steered me away from my initial plan to replace "@" with "x" in these endings in a pre-processing step.

This version now includes the new data from exquisite-corpus, so it has the words with @ in them, as well as some cleaner data from ParaCrawl.

CHANGELOG.md Outdated
@@ -1,3 +1,5 @@
## Version 2.2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this section still needs to be written?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. I started thinking I should write it, then changed my mind because I didn't know what date to put on it, and decided I'd write it after the PR was merged. But then I left it like this. I could just write the section and put tomorrow's date on it.

@Tahnan Tahnan merged commit bc12599 into master Jul 24, 2018
@Tahnan Tahnan deleted the gender-neutral-at branch July 24, 2018 22:16
rspeer pushed a commit that referenced this pull request Jun 25, 2024
Recognize "@" in gender-neutral word endings as part of the token
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants