-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM pipeline] Update text normalization component #335
[LLM pipeline] Update text normalization component #335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr!
Changes to the component look good. Left some comments on the testing strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you re-add a component test in this PR as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr!
For the tests, I would go for option 1. It is simple and readable. I don't think the limited value provided by the abstractions warrants the extra complexity / "magic".
Alright. I have reduced the code to option 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Add minor improvements to the text normalization component. Mainly based on the work of [Penedo et al ](https://arxiv.org/pdf/2306.01116.pdf) Quality can be improved by removing specific patterns in single lines: > We analyse documents line-by-line, and discard or edit the lines based on the following rules: • If it is mainly composed of uppercase characters (discard); • If it is only composed of numerical characters (discard); • If it is a counter (e.g. 3 likes) (discard); • If it only contains one word (discard);
Add minor improvements to the text normalization component. Mainly based on the work of Penedo et al
Quality can be improved by removing specific patterns in single lines: