Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaners and to-replace should also be dataset-specific #359

Open
roedoejet opened this issue Mar 26, 2024 · 4 comments
Open

Cleaners and to-replace should also be dataset-specific #359

roedoejet opened this issue Mar 26, 2024 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@roedoejet
Copy link
Member

I still think we should have cleaners defined on the everyvoice.config.text_config.TextConfig but we should rename them to global_cleaners and global_to_replace. There are some cleaners/to_replace rules that only apply to certain datasets, and those should be defined on everyvoice.config.preprocessing_config.Dataset.

In addition to adding the cleaners here, we also need to:

  • update the wizard to add the cleaners to the dataset instead of globally
  • use the wizard to ask about any global cleaners (suggest collapse whitespace for example)
@roedoejet roedoejet changed the title Cleaners should be dataset-specific Cleaners and to-replace should also be dataset-specific Mar 26, 2024
@roedoejet roedoejet added the enhancement New feature or request label Mar 26, 2024
@MENGZHEGENG
Copy link
Collaborator

It seems that we don't have to-replace supported in the wizard. Will take a look at this.

@roedoejet
Copy link
Member Author

It seems that we don't have to-replace supported in the wizard. Will take a look at this.

I think that's fine. It's a bit advanced, and there isn't an obvious way (to me) to create the interaction in the wizard. I think it's alright if we just document it in the docs and tell people to adjust the configuration file if necessary.

@MENGZHEGENG
Copy link
Collaborator

It may confuse the user to set global_cleaner and dataset-specific_cleaner separately, while I totally agree that we should set these two cleaners. How about we set global_cleaner to collapse_white_space by default (in everyvoice.config.text_config.TextConfig), and ask the user to set the dataset-specific cleaners (in everyvoice.config.preprocessing_config.Dataset)?

@roedoejet
Copy link
Member Author

It may confuse the user to set global_cleaner and dataset-specific_cleaner separately, while I totally agree that we should set these two cleaners. How about we set global_cleaner to collapse_white_space by default (in everyvoice.config.text_config.TextConfig), and ask the user to set the dataset-specific cleaners (in everyvoice.config.preprocessing_config.Dataset)?

Good idea!

@roedoejet roedoejet added this to the beta milestone May 1, 2024
roedoejet added a commit that referenced this issue Jun 21, 2024
roedoejet added a commit that referenced this issue Jun 21, 2024
roedoejet added a commit that referenced this issue Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants