Only populate OCR results in selected language #11215

github-throwaway · 2025-01-08T16:41:39Z

Problem

Im always frustrated when the OCR extracts ingredients from a different language.

Proposed solution

Thanks to the language setting the OCR already knows where to start the extraction. When it now hits the keyword in a different language it should stop the extraction.

Time per product

4 seconds saved.

Video example of problem

trim.3634E555-DDBF-4A2D-B2A6-B78E57AB58E0.MOV

Part of

#9096

benbenben2 · 2025-01-09T20:29:59Z

The OCR extracts all the text. It is then processed and by using so-called stopwords it cut the text before and after the ingredients list.
In this case it removes everything before "Zutaten" (it can be other words) and after some stopword expected at the end of the list (like keep in dry place, etc.).

It does not work all the time. It depends of the known stopwords.

In this particular example, I guess "ingredients" is not a German word, so we could add it to the stopwords for German language.

It is in this file: Ingredients.pm

(Maybe, maybe, just sharing some thoughts, we could add all stopwords before ingredients as stopwords after ingredients. I dont know if it would work. That would need some investigations (for example in cases where same word for ingredients is used in different languages that would be problematic). Ping @stephane, @aleene. )

At least for that particular example in your issue @github-throwaway you can add "Ingr(e|é)dients" for the German stopwords.

stephanegigandet · 2025-01-10T14:12:58Z

That's a great idea. I was hoping that Google Cloud Vision would give us enough data to see which text is in which language, but I tried in one example and it didn't work: https://images.openfoodfacts.org/images/products/304/514/010/5502/ingredients_fr.553.json

textAnnotations: [
{
locale: "fr",
boundingPoly: {
vertices: [
{
x: 64,
y: 60
},
{
y: 60,
x: 1517
},
{
x: 1517,
y: 1817
},
{
x: 64,
y: 1817
}
]
},
description: "BOChocolat au lait du pays alpin.
Ingrédients : Sucre, beurre de cacao, pâte
de cacao, LAIT écrémé en poudre,
lactosérum en poudre (de LAIT), BEURRE
concentré, émulsifiant (lecithines de
SOJA), pâte de NOISETTE, arome. Cacao:
33 % minimum. PEUT CONTENIR AUTRES
FRUITS À COQUE ET BLE.
CCD Melkchocolade (van Alpenmelk).
Ingrediënten: suiker, cacaoboter, cacaomassa,
magere MELKPOEDER, weipoeder
(van MELK), MELKVET, emulgator
(SOJALECITHINEN), HAZELNOOTPASTA,
aroma. Cacao: ten minste 33 %. KAN
ANDERE NOTEN EN TARWE BEVATTEN."
},

github-throwaway · 2025-01-10T14:32:54Z

What about chaining this API after the OCR?

https://cloud.google.com/translate/docs/advanced/detecting-language-v3

And add some spell checking?

https://hunspell.github.io/

### What  ### Related issue(s) and discussion  - Part to solve #11215

teolemon added this to 🍊 Open Food Facts Server issues Jan 8, 2025

github-project-automation bot moved this to To discuss and validate in 🍊 Open Food Facts Server issues Jan 8, 2025

teolemon added OCR ingredient-list-cutting labels Jan 8, 2025

github-project-automation bot added this to Ingredient analysis Jan 8, 2025

github-project-automation bot moved this to To do in Ingredient analysis Jan 8, 2025

github-throwaway changed the title ~~Only return OCR results in selected language~~ Only populate OCR results in selected language Jan 8, 2025

github-throwaway mentioned this issue Jan 18, 2025

fix: Add more german stopwords #11266

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only populate OCR results in selected language #11215

Only populate OCR results in selected language #11215

github-throwaway commented Jan 8, 2025 •

edited

Loading

benbenben2 commented Jan 9, 2025

stephanegigandet commented Jan 10, 2025

github-throwaway commented Jan 10, 2025

Only populate OCR results in selected language #11215

Only populate OCR results in selected language #11215

Comments

github-throwaway commented Jan 8, 2025 • edited Loading

Problem

Proposed solution

Time per product

Video example of problem

Part of

benbenben2 commented Jan 9, 2025

stephanegigandet commented Jan 10, 2025

github-throwaway commented Jan 10, 2025

github-throwaway commented Jan 8, 2025 •

edited

Loading