-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: parse origin of ingredients for Japanese #9125
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #9125 +/- ##
==========================================
+ Coverage 48.14% 48.70% +0.55%
==========================================
Files 65 65
Lines 20341 20276 -65
Branches 4931 4901 -30
==========================================
+ Hits 9794 9875 +81
+ Misses 9296 9141 -155
- Partials 1251 1260 +9 ☔ View full report in Codecov by Sentry. |
lib/ProductOpener/Ingredients.pm
Outdated
@@ -1625,11 +1627,20 @@ sub parse_ingredients_text ($product_ref) { | |||
} | |||
|
|||
# sel marin (France, Italie) | |||
# -> if we have origins, put "origins:" before | |||
if ( ($between =~ $separators) | |||
# -> if we have origins, put "origins:" before or "製造" at the end for Japanese |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should not be needed, "origins: [some origin in Japanese]" should be recognized later I think.
* Add all prefectures of Japan * Add no-sufix aliases for prefectures
lib/ProductOpener/Ingredients.pm
Outdated
@@ -765,6 +765,7 @@ my %min_regexp = ( | |||
en => "min|min\.|minimum", | |||
es => "min|min\.|mín|mín\.|mínimo|minimo|minimum", | |||
fr => "min|min\.|mini|minimum", | |||
ja => "未満", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
未満 means "less than", not "minimum"
Kudos, SonarCloud Quality Gate passed! |
lib/ProductOpener/Ingredients.pm
Outdated
and (exists_taxonomy_tag("origins", canonicalize_taxonomy_tag($ingredients_lc, "origins", $`)))) | ||
if ( | ||
( | ||
($between =~ /$separators| and /) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this instead?
It's for things like "Sel marin (France et Italie)"? Maybe add a test for it.
($between =~ /$separators| and /) | |
($between =~ /$separators|$and/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure... Japenese for " and " would be "と" (although not in Ingredients.pm yet in %and) and it has to be without space. Then if you have japanese-words-with-と separated by と, that will split the japanese-words-with-と.
Eventually, we could try this:
($between =~ /$separators| and /) | |
($between =~ /$separators| and |$and/) |
That would use " and " for Japanese and $and for other defined languages.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't understand is why you changed "($between =~ $separators)" to "($between =~ /$separators| and /)". Why add "| and "? It would help parse something like "Tomatoes (France and Italy), in which case we could add a test for it. But it would not match "Tomates (France et Italie).
That's why I suggested ($between =~ /$separators|$and/)
Note that we don't have と in %and.
And we have "my $and = $and{$ingredients_lc} || " and ";" meaning that in Japanese, $and will be equal to " and ", so it wont match something with と in it.
To make sure it works, could you add test cases like "Tomatoes (France and Italy)", "Tomates (France and Italy)", and a Japanese one with a と in it which could be "トマト(ときがわ町])" for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mixed up %and and $and... :-(
|
||
# rm additional parenthesis and its content that are sub-ingredient of origing (not parsed for now) | ||
# example: "トマト (輸入又は国産 (未満 5%))"" (i.e., "Tomatoes (imported or domestically produced (less than 5%)))"") | ||
$origin_string =~ s/\s*\([^)]*\)//g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of removing anything that is inside parenthesis, as it could be anything. Maybe leave it as-is, even if we don't parse it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in the following conditions:
if ($sep =~ /(:|\[|\{|\(|\N{U+FF08})/i) {
-> in parenthesis
else
-> no separator found or is origin or contains percent
else
-> does not contain percent
if ($between =~ /\s*(?:de origine|d'origine|origine|origin|origins|alkuperä|ursprung|oorsprong)\s?:?\s?\b(.*)$/i)
-> this is origin
-> then, we remove additional parenthesis and its content
That is, if we have additional information in parenthesis after origin that is already in parenthesis.
That is very specific.
However, I cannot tell how many products would be impacted by this change.
So, if you think that it is safer to leave it as-is, please confirm me and I will remove that line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can remove it.
lib/ProductOpener/Ingredients.pm
Outdated
$origin = join(",", | ||
map {canonicalize_taxonomy_tag($ingredients_lc, "origins", $_)} | ||
split(/,/, $origin_string)); | ||
split(/$commas| and /, $origin_string)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split(/$commas| and /, $origin_string)); | |
split(/$commas|$and/, $origin_string)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you! Some minor comments / suggestions.
@benbenben2 Can you apply @stephanegigandet 's suggestions so that we can merge ? |
Would like to but there are unresolved discussions @teolemon. See my last 2 comments. |
Kudos, SonarCloud Quality Gate passed! |
@stephanegigandet, I let you review the last changes and merge if everything is fine for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you very much @benbenben2
What
parse origin of ingredients for Japanese
draft of the changes requested on Slack.
Screenshot
Using same as in tests
Related issue(s) and discussion
"origins": "en:australia,en:finland", -> dropped "and more" (オーストラリア又はフィンランド又はその他)
"origins": "en:outside-japan,en:japan", -> that is questionable
"origins": "en:outside-japan,ja:国 (5%未満)", -> not done
Lots of things to review in the code (search for "TODO" in the code)