-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove lower and nfc_normalization from default cleaners #482
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just finicky comments about re-installing your black to the agreed version.
everyvoice/text/text_processor.py
Outdated
) -> list[str]: | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is black 23 formatting, please re-install requirements.dev.txt
to get black 24 and revert this change.
everyvoice/wizard/dataset.py
Outdated
self.state[ | ||
"model_target_training_text_representation" | ||
] = apply_automatic_text_conversions( | ||
self.state["filelist_data"], | ||
self.state[StepNames.filelist_text_representation_step], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More black 23 formatting.
everyvoice/wizard/dataset.py
Outdated
self.state[ | ||
"model_target_training_text_representation" | ||
] = apply_automatic_text_conversions( | ||
self.state["filelist_data"], | ||
self.state[StepNames.filelist_text_representation_step], | ||
global_isocode=isocode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more black 23 formatting
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev.ap/data-attestation #482 +/- ##
===========================================================
+ Coverage 73.96% 73.98% +0.02%
===========================================================
Files 45 45
Lines 2865 2868 +3
Branches 444 445 +1
===========================================================
+ Hits 2119 2122 +3
Misses 661 661
Partials 85 85 ☔ View full report in Codecov by Sentry. |
this should be fixed at a later date by fixing #359
fixes #321
PR Goal?
This removes lowercasing and NFC normalization as default cleaners and allows them instead to be specified in the wizard as expected given the wizard's question.
Fixes?
#321
Feedback sought?
try it out. I think as discussed with @MENGZHEGENG we should actually allow for dataset-specific cleaners instead of these global cleaners. In this PR, if someone selects an NFC cleaner for one dataset, it will get added to the global cleaners list, and apply to all datasets, which isn't correct (see #359). However, this is better than applying lowercasing and NFC normalization to every dataset regardless of what the user selects in the wizard. Let me know if you agree. I think the quick alternative would be to both remove
lower
from the default cleaner and remove the text processing question altogether until we fix this issue.Priority?
alpha
Tests added?
adjusted unittests to fit new paradigm
How to test?
the SENCOTEN data should work now for example
Confidence?
medium
Version change?
normally yes, but N/A because pre-alpha
Related PRs?