Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update script to correct wrong languages for tags #10711

Closed
wants to merge 4 commits into from

Conversation

benbenben2
Copy link
Collaborator

update script to correct wrong languages for tags

What

Rework of the previous script to tackle tags in wrong languages for different tags type: categories, countries, labels, origins, traces. Not for allergens as updates seems not be working for those, and ingredients as it is not simple tags like others fields

Update due to result per page on api call. Make more file during the process to be able to restart the script if it fails.

Previous script:
#9581

@benbenben2 benbenben2 self-assigned this Aug 18, 2024
@benbenben2 benbenben2 requested a review from a team as a code owner August 18, 2024 19:07
@benbenben2 benbenben2 marked this pull request as draft August 19, 2024 17:17
@benbenben2
Copy link
Collaborator Author

improvements

  • sort/group by barcode, and update per barcode. That way, if barcode has 3-4 wrongs tags it will make only one post request.
  • ignore when xx: is among the results (search unknown tag in the taxonomy).
  • duplicated rows are removed in the output files after extracting all products for tags we want to update
  • create a new file 'products to update' for tags that are not unknown anymore but because some products have not been updated for long time they are still considered as having this unknown tag
  • reworked first function api call: https://world.openfoodfacts.org/countries?status=unknown
  • handle case when tag found in the taxonomy is in the same language as the unknown tag, it means that the tag is known but simply need to be updated

example for countries
update_tags_per_languages_countries_exist -> 154 lines

current tag;new tag;products
en:francia;an:francia;162
en:deutschland;de:deutschland;135
en:belgique;fr:belgique;118
en:schweiz;da:schweiz;106
en:frankreich;de:frankreich;106
fr:francia;an:francia;68
en:nederland;af:nederland;36
en:espagne;fr:espagne;34
en:algerie;nb:algerie;31
en:belgica;co:belgica;24
...

update_tags_per_languages_countries_new -> 462 lines

current tag;products
en:turkiye;192
en:en;145
fr:angleterre;77
en:francia-espana;66
en:worldwide;37
en:england;30
ru:nyl;29
en:angleterre;21
en:indonesie;21
en:emirats-arabes-unis;21
...

update_tags_per_languages_countries_to_update -> 2 lines

current tag;products
en:republic-of-macedonia;96
en:xk;2

@benbenben2 benbenben2 marked this pull request as draft August 30, 2024 16:59
@benbenben2
Copy link
Collaborator Author

more improvements

when tags occurs more than 1 time in the same field (https://world.openfoodfacts.net/product/7610095231406/protein-paprika-snack-vaya) it will remove the duplicates before to update the field

For origins, include countries taxonomy.

Copy link

@benbenben2 benbenben2 closed this Dec 16, 2024
@benbenben2 benbenben2 deleted the dq_correct_wrong_lang_for_tags_2 branch December 16, 2024 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants