Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing parmater values to extreme in parameters_filtering.py doesn't change the no. f documents being removed #42

Open
dk-github-acc opened this issue Feb 14, 2024 · 0 comments

Comments

@dk-github-acc
Copy link

Hi! Kudos to the author for an end-to-end piepline for cleaning and filtering a large corpus. I was working with main_filtering.py and was trying to change the parameter values in parameters_filtering.py, hoping to increase/decrease the no. of documents that were being removed out. But I observe no changes.

  1. I have english dataset so I set parameters_filtering_en, and I have experimented with the given values and some modifications in 1/more conditions and cutoffs.
  2. I have also tried out parameters_filtering_default where I do observe changes in documents being filtered out. The no. was different from those in parameters_filtering_en.
  • The parameters_filtering_default has some error. I modified languages_id.py to account for "defualt" as langauge but used flagged_/stop_words of english language.
  1. Within parameters_filtering_default or parameters_filtering_en, when parameter values are changed no changes are observed in no. of documents or documents which are getting removed.

Kindly review the code and let me know the solutions. Also let me know if I'm missing something.

Thank You!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant