fix: remove lower and nfc_normalization from default cleaners #482

roedoejet · 2024-06-21T17:45:48Z

fixes #321

PR Goal?

This removes lowercasing and NFC normalization as default cleaners and allows them instead to be specified in the wizard as expected given the wizard's question.

Fixes?

#321

Feedback sought?

try it out. I think as discussed with @MENGZHEGENG we should actually allow for dataset-specific cleaners instead of these global cleaners. In this PR, if someone selects an NFC cleaner for one dataset, it will get added to the global cleaners list, and apply to all datasets, which isn't correct (see #359). However, this is better than applying lowercasing and NFC normalization to every dataset regardless of what the user selects in the wizard. Let me know if you agree. I think the quick alternative would be to both remove lower from the default cleaner and remove the text processing question altogether until we fix this issue.

Priority?

alpha

Tests added?

adjusted unittests to fit new paradigm

How to test?

the SENCOTEN data should work now for example

Confidence?

medium

Version change?

normally yes, but N/A because pre-alpha

Related PRs?

joanise

Just finicky comments about re-installing your black to the agreed version.

joanise · 2024-06-21T19:09:07Z

everyvoice/text/text_processor.py

+    ) -> list[str]:
+        ...


This is black 23 formatting, please re-install requirements.dev.txt to get black 24 and revert this change.

joanise · 2024-06-21T19:09:30Z

everyvoice/wizard/dataset.py

+        self.state[
+            "model_target_training_text_representation"
+        ] = apply_automatic_text_conversions(
+            self.state["filelist_data"],
+            self.state[StepNames.filelist_text_representation_step],


More black 23 formatting.

joanise · 2024-06-21T19:09:46Z

everyvoice/wizard/dataset.py

+        self.state[
+            "model_target_training_text_representation"
+        ] = apply_automatic_text_conversions(
+            self.state["filelist_data"],
+            self.state[StepNames.filelist_text_representation_step],
+            global_isocode=isocode,


more black 23 formatting

fixes #321

github-actions · 2024-06-21T21:13:38Z

CLI load time: 0:00.23
Pull Request HEAD: 9ef98dd2021aabefb8956e73bad5447c8a254c46
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov · 2024-06-21T21:15:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.98%. Comparing base (7eb38e0) to head (9ef98dd).

Additional details and impacted files

@@                     Coverage Diff                     @@
##           dev.ap/data-attestation     #482      +/-   ##
===========================================================
+ Coverage                    73.96%   73.98%   +0.02%     
===========================================================
  Files                           45       45              
  Lines                         2865     2868       +3     
  Branches                       444      445       +1     
===========================================================
+ Hits                          2119     2122       +3     
  Misses                         661      661              
  Partials                        85       85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

this should be fixed at a later date by fixing #359

roedoejet requested review from SamuelLarkin, joanise and MENGZHEGENG June 21, 2024 17:45

roedoejet force-pushed the dev.ap/lower branch from b605a66 to 6f47f9c Compare June 21, 2024 17:50

joanise reviewed Jun 21, 2024

View reviewed changes

roedoejet force-pushed the dev.ap/lower branch from 6f47f9c to 0db8178 Compare June 21, 2024 19:39

roedoejet requested a review from joanise June 21, 2024 20:03

fix: remove lower and nfc_normalization from default cleaners

7c87773

fixes #321

roedoejet force-pushed the dev.ap/lower branch from 0db8178 to 7c87773 Compare June 21, 2024 21:10

joanise and others added 2 commits June 21, 2024 17:25

style: apply black

feba08f

fix: set global cleaners as the union of dataset cleaners in the wizard

9ef98dd

this should be fixed at a later date by fixing #359

roedoejet merged commit 6eb551e into dev.ap/data-attestation Jun 21, 2024
4 checks passed

roedoejet deleted the dev.ap/lower branch June 21, 2024 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove lower and nfc_normalization from default cleaners #482

fix: remove lower and nfc_normalization from default cleaners #482

roedoejet commented Jun 21, 2024 •

edited

Loading

joanise left a comment

joanise Jun 21, 2024

joanise Jun 21, 2024 •

edited

Loading

joanise Jun 21, 2024

github-actions bot commented Jun 21, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading

fix: remove lower and nfc_normalization from default cleaners #482

fix: remove lower and nfc_normalization from default cleaners #482

Conversation

roedoejet commented Jun 21, 2024 • edited Loading

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

Related PRs?

joanise left a comment

Choose a reason for hiding this comment

joanise Jun 21, 2024

Choose a reason for hiding this comment

joanise Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

joanise Jun 21, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 21, 2024 • edited Loading

codecov bot commented Jun 21, 2024 • edited Loading

Codecov Report

roedoejet commented Jun 21, 2024 •

edited

Loading

joanise Jun 21, 2024 •

edited

Loading

github-actions bot commented Jun 21, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading