Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove lower and nfc_normalization from default cleaners #482

Merged
merged 3 commits into from
Jun 21, 2024

Conversation

roedoejet
Copy link
Member

@roedoejet roedoejet commented Jun 21, 2024

fixes #321

PR Goal?

This removes lowercasing and NFC normalization as default cleaners and allows them instead to be specified in the wizard as expected given the wizard's question.

Fixes?

#321

Feedback sought?

try it out. I think as discussed with @MENGZHEGENG we should actually allow for dataset-specific cleaners instead of these global cleaners. In this PR, if someone selects an NFC cleaner for one dataset, it will get added to the global cleaners list, and apply to all datasets, which isn't correct (see #359). However, this is better than applying lowercasing and NFC normalization to every dataset regardless of what the user selects in the wizard. Let me know if you agree. I think the quick alternative would be to both remove lower from the default cleaner and remove the text processing question altogether until we fix this issue.

Priority?

alpha

Tests added?

adjusted unittests to fit new paradigm

How to test?

the SENCOTEN data should work now for example

Confidence?

medium

Version change?

normally yes, but N/A because pre-alpha

Related PRs?

Copy link
Member

@joanise joanise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finicky comments about re-installing your black to the agreed version.

Comment on lines 422 to 423
) -> list[str]:
...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is black 23 formatting, please re-install requirements.dev.txt to get black 24 and revert this change.

Comment on lines 432 to 436
self.state[
"model_target_training_text_representation"
] = apply_automatic_text_conversions(
self.state["filelist_data"],
self.state[StepNames.filelist_text_representation_step],
Copy link
Member

@joanise joanise Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More black 23 formatting.

Comment on lines 579 to 584
self.state[
"model_target_training_text_representation"
] = apply_automatic_text_conversions(
self.state["filelist_data"],
self.state[StepNames.filelist_text_representation_step],
global_isocode=isocode,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more black 23 formatting

Copy link
Contributor

github-actions bot commented Jun 21, 2024

CLI load time: 0:00.23
Pull Request HEAD: 9ef98dd2021aabefb8956e73bad5447c8a254c46
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Jun 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.98%. Comparing base (7eb38e0) to head (9ef98dd).

Additional details and impacted files
@@                     Coverage Diff                     @@
##           dev.ap/data-attestation     #482      +/-   ##
===========================================================
+ Coverage                    73.96%   73.98%   +0.02%     
===========================================================
  Files                           45       45              
  Lines                         2865     2868       +3     
  Branches                       444      445       +1     
===========================================================
+ Hits                          2119     2122       +3     
  Misses                         661      661              
  Partials                        85       85              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roedoejet roedoejet merged commit 6eb551e into dev.ap/data-attestation Jun 21, 2024
4 checks passed
@roedoejet roedoejet deleted the dev.ap/lower branch June 21, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants