Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance text preprocessing functionality #253

Merged
merged 20 commits into from
Jun 27, 2019

Conversation

bdewilde
Copy link
Collaborator

@bdewilde bdewilde commented Jun 27, 2019

Description

  • moved existing text preprocessing functionality from a top-level preprocess module into a preprocessing sub-package, and reorganized it a bit
  • add new functionality
    • replace_hashtags() to replace hashtags like #FollowFriday or #spacyIRL2019 with _TAG_
    • replace_user_handles() to replace user handles like @bjdewilde or @spacy_io with _USER_
    • normalize_hyphenated_words() to join hyphenated words back together, like antici- pation => anticipation
    • normalize_quotation_marks() to replace "fancy" quotation marks with simple ascii equivalents, like “the god particle” => "the god particle"
  • changed a couple functions for clarity and consistency
    • replace_currency_symbols() now replaces all dedicated ascii and unicode currency symbols with _CUR_, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like $ => USD)
    • remove_punct() now has a fast (bool) kwarg rather than method (str) because it's easier and clarifies the difference between the two options
  • removed some bad/awkward functionality
    • normalize_contractions(): this was a clunky, slow, and very limited attempt; better to use a separate, dedicated package
    • preprocess_text(): this was an awkward attempt at user convenience; better to let users mix and match their preprocessing pipeline as needed
  • added more and better tests for all of the above

Motivation and Context

This part of the code base has acquired some cobwebs over the years, and Issue #250 reminded me that more work was required than a hotfix.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation, and I have updated it accordingly.

@bdewilde bdewilde marked this pull request as ready for review June 27, 2019 18:18
bdewilde added 4 commits June 27, 2019 13:30
since they've since been moved into preprocessing.resources
this is not the solution we want, and a better one is much more involved
@bdewilde bdewilde merged commit 8cf298e into develop Jun 27, 2019
@bdewilde bdewilde deleted the update-preprocessing-options branch June 27, 2019 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant