feat: parse origin of ingredients for Japanese #9125

benbenben2 · 2023-10-08T13:03:52Z

What

parse origin of ingredients for Japanese

draft of the changes requested on Slack.

Screenshot

Using same as in tests

      {
        "ciqual_food_code": "11058",
        "id": "en:salt",
        "origins": "en:japan",
        "percent_estimate": 54.1666666666667,
        "percent_max": 100,
        "percent_min": 8.33333333333333,
        "text": "塩",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "ciqual_food_code": "19402",
        "id": "en:creme-fraiche",
        "origins": "en:japan",
        "percent_estimate": 22.9166666666667,
        "percent_max": 50,
        "percent_min": 0,
        "text": "クレームフレーシュ",
        "vegan": "no",
        "vegetarian": "yes"
      },
      {
        "id": "en:meat",
        "origins": "en:australia",
        "percent_estimate": 11.4583333333333,
        "percent_max": 33.3333333333333,
        "percent_min": 0,
        "text": "肉",
        "vegan": "no",
        "vegetarian": "no"
      },
      {
        "from_palm_oil": "no",
        "id": "en:olive-oil",
        "origins": "en:brazil,en:ethiopia",
        "percent_estimate": 5.72916666666667,
        "percent_max": 25,
        "percent_min": 0,
        "text": "オリーブ油",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "ciqual_food_code": "11018",
        "id": "en:white-wine-vinegar",
        "origins": "en:australia,en:finland",
        "percent_estimate": 2.86458333333334,
        "percent_max": 20,
        "percent_min": 0,
        "text": "白ワインビネガー",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "id": "en:malt",
        "origins": "en:japan,en:south-korea",
        "percent_estimate": 1.43229166666667,
        "percent_max": 16.6666666666667,
        "percent_min": 0,
        "text": "麦芽",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "id": "en:sugar",
        "origins": "en:outside-japan,en:japan",
        "percent_estimate": 0.716145833333336,
        "percent_max": 14.2857142857143,
        "percent_min": 0,
        "text": "糖類",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "id": "en:cocoa",
        "origins": "en:outside-japan,ja:国 (5%未満)",
        "percent_estimate": 0.358072916666671,
        "percent_max": 12.5,
        "percent_min": 0,
        "text": "ココア",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "ciqual_food_code": "20901",
        "id": "en:edamame",
        "origins": "en:hokkaido",
        "percent_estimate": 0.179036458333336,
        "percent_max": 11.1111111111111,
        "percent_min": 0,
        "text": "えだまめ",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "id": "en:breadfruit",
        "origins": "en:sanriku",
        "percent_estimate": 0.0895182291666714,
        "percent_max": 10,
        "percent_min": 0,
        "text": "パンの実",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "ciqual_food_code": "13082",
        "id": "en:clementine",
        "origins": "en:kyushu",
        "percent_estimate": 0.0447591145833357,
        "percent_max": 9.09090909090909,
        "percent_min": 0,
        "text": "クレメンタイン",
        "vegan": "yes",
        "vegetarian": "yes"
      },
      {
        "ciqual_food_code": "20057",
        "id": "en:broccoli",
        "origins": "en:outside-japan",
        "percent_estimate": 0.0447591145833428,
        "percent_max": 8.33333333333333,
        "percent_min": 0,
        "text": "ブロッコリー",
        "vegan": "yes",
        "vegetarian": "yes"
      }
    ],

Related issue(s) and discussion

"origins": "en:australia,en:finland", -> dropped "and more" (オーストラリア又はフィンランド又はその他)
"origins": "en:outside-japan,en:japan", -> that is questionable
"origins": "en:outside-japan,ja:国 (5%未満)", -> not done

Lots of things to review in the code (search for "TODO" in the code)

codecov-commenter · 2023-10-08T13:51:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0c9912d) 48.14% compared to head (0be0aa7) 48.70%.
Report is 14 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9125      +/-   ##
==========================================
+ Coverage   48.14%   48.70%   +0.55%     
==========================================
  Files          65       65              
  Lines       20341    20276      -65     
  Branches     4931     4901      -30     
==========================================
+ Hits         9794     9875      +81     
+ Misses       9296     9141     -155     
- Partials     1251     1260       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stephanegigandet · 2023-10-09T08:12:14Z

lib/ProductOpener/Ingredients.pm

@@ -1625,11 +1627,20 @@ sub parse_ingredients_text ($product_ref) {
 					}

 					# sel marin (France, Italie)
-					# -> if we have origins, put "origins:" before
-					if (    ($between =~ $separators)
+					# -> if we have origins, put "origins:" before or "製造" at the end for Japanese


That should not be needed, "origins: [some origin in Japanese]" should be recognized later I think.

* Add all prefectures of Japan * Add no-sufix aliases for prefectures

Naruyoko · 2023-10-14T17:46:00Z

lib/ProductOpener/Ingredients.pm

@@ -765,6 +765,7 @@ my %min_regexp = (
 	en => "min|min\.|minimum",
 	es => "min|min\.|mín|mín\.|mínimo|minimo|minimum",
 	fr => "min|min\.|mini|minimum",
+	ja => "未満",


未満 means "less than", not "minimum"

sonarqubecloud · 2023-10-15T09:42:56Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

stephanegigandet · 2023-10-25T14:33:22Z

lib/ProductOpener/Ingredients.pm

-						and (exists_taxonomy_tag("origins", canonicalize_taxonomy_tag($ingredients_lc, "origins", $`))))
+					if (
+						(
+							($between =~ /$separators| and /)


Maybe this instead?
It's for things like "Sel marin (France et Italie)"? Maybe add a test for it.

Suggested change

($between =~ /$separators| and /)

($between =~ /$separators|$and/)

Not sure... Japenese for " and " would be "と" (although not in Ingredients.pm yet in %and) and it has to be without space. Then if you have japanese-words-with-と separated by と, that will split the japanese-words-with-と.

Eventually, we could try this:

Suggested change

($between =~ /$separators| and /)

($between =~ /$separators| and |$and/)

That would use " and " for Japanese and $and for other defined languages.

What do you think?

What I don't understand is why you changed "($between =~ $separators)" to "($between =~ /$separators| and /)". Why add "| and "? It would help parse something like "Tomatoes (France and Italy), in which case we could add a test for it. But it would not match "Tomates (France et Italie).

That's why I suggested ($between =~ /$separators|$and/)

Note that we don't have と in %and.

And we have "my $and = $and{$ingredients_lc} || " and ";" meaning that in Japanese, $and will be equal to " and ", so it wont match something with と in it.

To make sure it works, could you add test cases like "Tomatoes (France and Italy)", "Tomates (France and Italy)", and a Japanese one with a と in it which could be "トマト(ときがわ町])" for instance.

I mixed up %and and $and... :-(

stephanegigandet · 2023-10-25T14:37:18Z

lib/ProductOpener/Ingredients.pm

+
+								# rm additional parenthesis and its content that are sub-ingredient of origing (not parsed for now)
+								# example: "トマト (輸入又は国産 (未満 5%))"" (i.e., "Tomatoes (imported or domestically produced (less than 5%)))"")
+								$origin_string =~ s/\s*\([^)]*\)//g;


I'm a bit wary of removing anything that is inside parenthesis, as it could be anything. Maybe leave it as-is, even if we don't parse it yet.

We are in the following conditions:

if ($sep =~ /(:|\[|\{|\(|\N{U+FF08})/i) { -> in parenthesis

else -> no separator found or is origin or contains percent

else -> does not contain percent

if ($between =~ /\s*(?:de origine|d'origine|origine|origin|origins|alkuperä|ursprung|oorsprong)\s?:?\s?\b(.*)$/i) -> this is origin
-> then, we remove additional parenthesis and its content

That is, if we have additional information in parenthesis after origin that is already in parenthesis.
That is very specific.
However, I cannot tell how many products would be impacted by this change.
So, if you think that it is safer to leave it as-is, please confirm me and I will remove that line.

Ok, we can remove it.

stephanegigandet · 2023-10-25T14:37:38Z

lib/ProductOpener/Ingredients.pm

 									$origin = join(",",
 										map {canonicalize_taxonomy_tag($ingredients_lc, "origins", $_)}
-											split(/,/, $origin_string));
+											split(/$commas| and /, $origin_string));


Suggested change

split(/$commas| and /, $origin_string));

split(/$commas|$and/, $origin_string));

stephanegigandet

Looks good to me, thank you! Some minor comments / suggestions.

teolemon · 2023-11-09T14:11:26Z

@benbenben2 Can you apply @stephanegigandet 's suggestions so that we can merge ?

benbenben2 · 2023-11-09T17:02:55Z

@benbenben2 Can you apply @stephanegigandet 's suggestions so that we can merge ?

Would like to but there are unresolved discussions @teolemon. See my last 2 comments.

sonarqubecloud · 2023-11-21T19:00:35Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

benbenben2 · 2023-11-23T19:15:52Z

@stephanegigandet, I let you review the last changes and merge if everything is fine for you.

stephanegigandet

Perfect, thank you very much @benbenben2

init origin for ja

c5dc33a

benbenben2 added 🥗 Ingredients 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis 📍 Origins Origins are used for Eco-Score computation. We want to have structured origins. labels Oct 8, 2023

benbenben2 self-assigned this Oct 8, 2023

github-actions bot added 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 additives 🧪 tests labels Oct 8, 2023

stephanegigandet reviewed Oct 9, 2023

View reviewed changes

taxonomy: Add all prefectures of Japan (#9145)

a6cf366

* Add all prefectures of Japan * Add no-sufix aliases for prefectures

benbenben2 assigned benbenben2 and unassigned benbenben2 Oct 12, 2023

improvements

8884018

benbenben2 marked this pull request as ready for review October 14, 2023 17:17

benbenben2 requested a review from a team as a code owner October 14, 2023 17:17

Naruyoko reviewed Oct 14, 2023

View reviewed changes

benbenben2 added 6 commits October 14, 2023 20:42

Merge branch 'main' into fix_jp_origin

88f9dbb

upd taxo

0be0ae7

improvements

fbf2c5c

rm typo less then != minimum

bd2b575

make lint

cb7963b

rm text in parentheseses after origin

ca6923a

stephanegigandet reviewed Oct 25, 2023

View reviewed changes

stephanegigandet approved these changes Oct 25, 2023

View reviewed changes

teolemon added the 🇯🇵 Japan https://jp.openfoodfacts.org/ label Oct 25, 2023

benbenben2 requested a review from stephanegigandet November 7, 2023 16:51

teolemon changed the title ~~feat: init origin for ja~~ feat: parse origin of ingredients for Japanese Nov 9, 2023

Merge branch 'main' into fix_jp_origin

890108c

github-actions bot added ingredients ingredients analysis labels Nov 17, 2023

benbenben2 added 2 commits November 21, 2023 17:34

apply suggested changes

e1a3e8c

added test results

0be0aa7

teolemon removed the ingredients analysis label Nov 24, 2023

stephanegigandet approved these changes Nov 30, 2023

View reviewed changes

stephanegigandet merged commit 730f621 into main Nov 30, 2023
13 checks passed

stephanegigandet deleted the fix_jp_origin branch November 30, 2023 12:28

openfoodfacts-bot mentioned this pull request Nov 30, 2023

chore(main): 🚀 Open Food Facts Backend - Product Opener - Release 2.23.0. #9387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parse origin of ingredients for Japanese #9125

feat: parse origin of ingredients for Japanese #9125

benbenben2 commented Oct 8, 2023

codecov-commenter commented Oct 8, 2023 •

edited

Loading

stephanegigandet Oct 9, 2023

Naruyoko Oct 14, 2023

sonarqubecloud bot commented Oct 15, 2023

stephanegigandet Oct 25, 2023

benbenben2 Oct 26, 2023

stephanegigandet Nov 20, 2023 •

edited

Loading

benbenben2 Nov 21, 2023

stephanegigandet Oct 25, 2023

benbenben2 Oct 26, 2023

stephanegigandet Nov 20, 2023

stephanegigandet Oct 25, 2023

stephanegigandet left a comment

teolemon commented Nov 9, 2023

benbenben2 commented Nov 9, 2023 •

edited

Loading

sonarqubecloud bot commented Nov 21, 2023

benbenben2 commented Nov 23, 2023

stephanegigandet left a comment

	($between =~ /$separators\| and /)
	($between =~ /$separators\|$and/)

	($between =~ /$separators\| and /)
	($between =~ /$separators\| and \|$and/)

	split(/$commas\| and /, $origin_string));
	split(/$commas\|$and/, $origin_string));

feat: parse origin of ingredients for Japanese #9125

feat: parse origin of ingredients for Japanese #9125

Conversation

benbenben2 commented Oct 8, 2023

What

Screenshot

Related issue(s) and discussion

codecov-commenter commented Oct 8, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Oct 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanegigandet Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanegigandet left a comment

Choose a reason for hiding this comment

teolemon commented Nov 9, 2023

benbenben2 commented Nov 9, 2023 • edited Loading

sonarqubecloud bot commented Nov 21, 2023

benbenben2 commented Nov 23, 2023

stephanegigandet left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 8, 2023 •

edited

Loading

stephanegigandet Nov 20, 2023 •

edited

Loading

benbenben2 commented Nov 9, 2023 •

edited

Loading