[LLM pipeline] Update text normalization component #335

mrchtr · 2023-08-03T06:26:58Z

Add minor improvements to the text normalization component. Mainly based on the work of Penedo et al

Quality can be improved by removing specific patterns in single lines:

We analyse documents line-by-line, and
discard or edit the lines based on the following rules:
• If it is mainly composed of uppercase characters (discard);
• If it is only composed of numerical characters (discard);
• If it is a counter (e.g. 3 likes) (discard);
• If it only contains one word (discard);

RobbeSneyders

Thanks @mrchtr!

Changes to the component look good. Left some comments on the testing strategy.

components/text_normalization/tests/utils_test.py

components/text_normalization/fondant_component.yaml

components/text_normalization/src/main.py

components/text_normalization/src/utils.py

components/text_normalization/tests/fixtures/apply_all.json

components/text_normalization/tests/component_test.py

components/text_normalization/tests/conftest.py

…Test

RobbeSneyders

Will you re-add a component test in this PR as well?

components/text_normalization/Dockerfile

RobbeSneyders

Thanks @mrchtr!

For the tests, I would go for option 1. It is simple and readable. I don't think the limited value provided by the abstractions warrants the extra complexity / "magic".

components/text_normalization/Dockerfile

mrchtr · 2023-08-15T05:47:05Z

For the tests, I would go for option 1. It is simple and readable. I don't think the limited value provided by the abstractions warrants the extra complexity / "magic".

Alright. I have reduced the code to option 1.

RobbeSneyders

Thanks!

Add minor improvements to the text normalization component. Mainly based on the work of [Penedo et al ](https://arxiv.org/pdf/2306.01116.pdf) Quality can be improved by removing specific patterns in single lines: > We analyse documents line-by-line, and discard or edit the lines based on the following rules: • If it is mainly composed of uppercase characters (discard); • If it is only composed of numerical characters (discard); • If it is a counter (e.g. 3 likes) (discard); • If it only contains one word (discard);

mrchtr added 4 commits August 1, 2023 13:16

Add readme and component cleaning

a2c6239

Refactor text normalization component

21e040b

Refactor text normalization component

8f0897b

Add component readme.md

efe5c49

mrchtr requested review from NielsRogge and RobbeSneyders August 3, 2023 06:26

RobbeSneyders reviewed Aug 7, 2023

View reviewed changes

mrchtr added 3 commits August 7, 2023 20:39

Addressing comments

1d35b0d

Update docsstrings, adapt component test to use the AbstractComponent…

e0c0c8c

…Test

Update docker file

d5a508f

RobbeSneyders reviewed Aug 8, 2023

View reviewed changes

components/text_normalization/Dockerfile Show resolved Hide resolved

Testing strategy drafts

2a7a733

RobbeSneyders reviewed Aug 14, 2023

View reviewed changes

components/text_normalization/Dockerfile Show resolved Hide resolved

Refactor unit tests

e914376

shayorshay mentioned this pull request Aug 15, 2023

[Commoncrawl pipeline] Add metadata for target_language #357

Merged

RobbeSneyders approved these changes Aug 16, 2023

View reviewed changes

RobbeSneyders merged commit e3e078d into ml6team:main Aug 16, 2023
5 checks passed

This was referenced Aug 17, 2023

Make download_component concurrent #354

Merged

Remove Abstract test class and update tests #367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM pipeline] Update text normalization component #335

[LLM pipeline] Update text normalization component #335

mrchtr commented Aug 3, 2023

RobbeSneyders left a comment

RobbeSneyders left a comment

RobbeSneyders left a comment

mrchtr commented Aug 15, 2023

RobbeSneyders left a comment

[LLM pipeline] Update text normalization component #335

[LLM pipeline] Update text normalization component #335

Conversation

mrchtr commented Aug 3, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

mrchtr commented Aug 15, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment