Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV export revision (Adding new fields and removing some) #2325

Open
Tracked by #6429
CharlesNepote opened this issue Sep 13, 2019 · 18 comments
Open
Tracked by #6429

CSV export revision (Adding new fields and removing some) #2325

CharlesNepote opened this issue Sep 13, 2019 · 18 comments
Assignees
Labels
CSV exports Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🧽 Data quality https://wiki.openfoodfacts.org/Quality export

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Sep 13, 2019

CSV has not evolved since a long time. Some people need new data in the CSV. This issue allows to discuss which new fields could be exported.

Fields to add:

  • ingredients_analysis_tags

    • description: this field is computed from ingredients' analysis to know if the product is or not vegan, vegetarian, and/or with palm-oil
    • Eg. ingredients_analysis_tags: ["en:palm-oil","en:non-vegan","en:vegetarian-status-unknown"]
    • rationale: it can be a helper to control data quality
  • nutrient_levels_tags?

    • description: "It represents the traffic lights system made by the UK FSA (Food Safety Administration). It is used by some manufacturers on a voluntary basis in Great Britain, but was rejected by the European Commission in 2010. On Open Food Facts, when the nutritional values are known, the traffic lights are displayed on the product pages. The calculation formula defining the colors of the lights is described on a dedicated page of Open Food Facts website. Nutrient levels can be found here on the website: https://world.openfoodfacts.org/nutrient-levels
    • Example: nutrient_levels_tags: ["en:fat-in-high-quantity","en:saturated-fat-in-high-quantity","en:sugars-in-high-quantity","en:salt-in-low-quantity"]
    • rationale: it can be a helper to control data quality
  • product_quantity

    • description: "This is the normalized quantity of the product in grams (ISO system)."
    • Example: product_quantity: "1500", computed from quantity: "1,5 L"; product_quantity: "320", computed from quantity: "2 x 160 g"
    • rationale: it's easier to play with this data as Open Food Facts already do the computation. We already provide serving_quantity.
  • owner

    • description: "This is the owner of the product, which have sent the product's data to Open Food Facts. The list of owners can be found at https://world.openfoodfacts.org/owners"
    • Example: "owner: "org-carrefour"
    • rationale: it can be a helper to control data quality
  • data_quality_errors_tags

    • description: "Returns a list of all detected errors for the product."
    • Example: data_quality_errors_tags: ["en:nutrition-saturated-fat-greater-than-fat"] (speaks for itself).
    • rationale: 1. Open Food Facts reusers might want to remove the products with quality issues. 2. It can ease to build tools for data quality. It represents ~30K products as of 2022-10.
  • unique_scans_n (contains the number of unique scans of a product (~33% of products))

    • description: "Returns an integer which represent the number of users who have scanned the product at least one time. "Users" are identified by different IPs. This value is not computed in real time but once a year."
    • Example: unique_scans_n: "8".
    • rationale: this is a good proxy to understand which are the most consumed products.
  • popularity_tags

    • description: "The popularity of a product is computed thanks to its number of unique scans. The popularity_tags field groups products by different levels of popularity by year, either in the world, either in the countries where it is popular."
    • Example: popularity_tags: ["top-50000-scans-2019","top-100000-scans-2019","at-least-5-scans-2019","at-least-10-scans-2019","top-75-percent-scans-2019","top-80-percent-scans-2019","top-85-percent-scans-2019","top-90-percent-scans-2019","top-50000-fr-scans-2019","top-100000-fr-scans-2019","top-country-fr-scans-2019","at-least-5-fr-scans-2019","at-least-10-fr-scans-2019"]
    • rationale: this field might be more clear than unique_scans_n, as the latter could suggest this number is a fresh data if not real-time
  • completeness:

    • description: completeness is a float number, between 0 and 1.1, measuring how complete is the product (the higher, the most complete). Currently, we check for more than 10 items: image completeness, product_name, quantity, packaging, brands, categories, origins, emb_codes, expiration_date, ingreditents_text, nutriments (or no_nutrition_data if it is on).
    • Eg. completeness: 0.7625
  • last_image_t:

    • description: It is the date (Unix format) of the last image being uploaded for the product.
    • eg. last_image_t: 1666661491
    • rationale: it can be a helper to control data quality: if the product does not have any image, it should be impossible to fix data quality issues

Fields to delete ??

  • should we keep both created_t / lastmodified_t vs created_datetime / last_modified_datetime (51 Mb lost)
  • additives, which is empty
  • one of the states* three fields; states and states_tags are almost identical, the only difference is that states contains spaces
  • brands or brands_tags: the latter is only the normalized version of the first one (lowercased, unaccented, and replacing spaces and typographic signs by a "-")

Fields that could evolve:

Process:

  • build a new CSV file as a beta version, with a different name than the current one, and let people discuss about it
  • when beta version turn stable/official, keep old version and tell everyone that old version will be unavailable in xx months

Implementation

[to be completed]

@CharlesNepote CharlesNepote added the ✨ Feature Features or enhancements to Open Food Facts server label Sep 13, 2019
@bredowmax
Copy link

Columns I’d definitely love to see included are “vegan?”, “vegetarian?” and “co2 footprint” and from_category_averaged nutrition values for unbranded/unpackaged/fresh foods

@maxmeetrobin
Copy link

My suggestion for fields to be deleted:
-Alpha-Linolenic-Acid 100G | -Arachidic-Acid 100G | -Arachidonic-Acid 100G | -Behenic-Acid 100G | -Butyric-Acid 100G | -Capric-Acid 100G | -Caproic-Acid 100G | -Caprylic-Acid 100G | -Cerotic-Acid 100G | -Dihomo-Gamma-Linolenic-Acid 100G | -Docosahexaenoic-Acid 100G | -Eicosapentaenoic-Acid 100G | -Elaidic-Acid 100G | -Erucic-Acid 100G | -Fructose 100G | -Gamma-Linolenic-Acid 100G | -Glucose 100G | -Gondoic-Acid 100G | -Lactose 100G | -Lauric-Acid 100G | -Lignoceric-Acid 100G | -Linoleic-Acid 100G | -Maltodextrins 100G | -Maltose 100G | -Mead-Acid 100G | -Melissic-Acid 100G | -Montanic-Acid 100G | -Myristic-Acid 100G | -Nervonic-Acid 100G | -Oleic-Acid 100G | -Palmitic-Acid 100G | -Stearic-Acid 100G | -Sucrose 100G

@maxmeetrobin
Copy link

More fields I suggest to be deleted:

Carnitine 100G | Casein 100G | Collagen-Meat-Protein-Ratio 100G | Copper 100G | Folates 100G | Glycemic-Index 100G | Image Ingredients Small Url | Image Nutrition Small Url | Image Small Url | Ingredients From Palm Oil | Manganese 100G | Molybdenum 100G | No Nutriments | Nucleotides 100G | Nutrition-Score-Fr 100G | Omega-6-Fat 100G | Omega-9-Fat 100G | Polyols 100G | Potassium 100G | Selenium 100G | Serum-Proteins 100G | Starch 100G | Taurine 100G

@maxmeetrobin
Copy link

More fields I'd love to add:

  • completeness
  • nutrient_levels (all 4 of them)
  • unique_scans_n
  • ingredients_text

@maxmeetrobin
Copy link

While we're at it, I'd love to move Code as the unique identifier to the very left/beginning of each product listing

@maxmeetrobin
Copy link

For some fields, there are several translations. To keep the CSV as compact as possible, I suggest keeping only the -en translated version

@CharlesNepote
Copy link
Member Author

While we're at it, I'd love to move Code as the unique identifier to the very left/beginning of each product listing

Code is already at the left/beginning of each product, isn't it? See:

code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_fr,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_fr,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_fr,ingredients_text,allergens,allergens_fr,traces,traces_tags,traces_fr,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_fr,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_fr,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_fr,main_category,main_category_fr,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,butyric-acid_100g,caproic-acid_100g,caprylic-acid_100g,capric-acid_100g,lauric-acid_100g,myristic-acid_100g,palmitic-acid_100g,stearic-acid_100g,arachidic-acid_100g,behenic-acid_100g,lignoceric-acid_100g,cerotic-acid_100g,montanic-acid_100g,melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,alpha-linolenic-acid_100g,eicosapentaenoic-acid_100g,docosahexaenoic-acid_100g,omega-6-fat_100g,linoleic-acid_100g,arachidonic-acid_100g,gamma-linolenic-acid_100g,dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,oleic-acid_100g,elaidic-acid_100g,gondoic-acid_100g,mead-acid_100g,erucic-acid_100g,nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,sucrose_100g,glucose_100g,fructose_100g,lactose_100g,maltose_100g,maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0000000000017,http://world-fr.openfoodfacts.org/produit/0000000000017/vitoria-crackers,kiliweb,1529059080,2018-06-15T10:38:00Z,1529059204,2018-06-15T10:40:04Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,0,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:categories-to-be-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-validated, en:photos-uploaded","en:to-be-completed,en:nutrition-facts-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:packaging-code-to-be-completed,en:characteristics-to-be-completed,en:categories-to-be-completed,en:brands-to-be-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:product-name-completed,en:photos-to-be-validated,en:photos-uploaded","A compléter,Informations nutritionnelles complétées,Ingrédients à compléter,Date limite à compléter,Code emballeur à compléter,Caractéristiques à compléter,Catégories à compléter,Marques à compléter,Emballage à compléter,Quantité à compléter,Nom du produit complete,Photos à valider,Photos envoyées",,,https://static.openfoodfacts.org/images/products/000/000/000/0017/front_fr.4.400.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/front_fr.4.200.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/ingredients_fr.7.400.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/ingredients_fr.7.200.jpg,,,1569,,7,3.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,70.1,15,,,,,,,,,,7.8,,,,1.4,0.551181102362205,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Sep 28, 2019

@bredowmax:

My suggestion for fields to be deleted:
[...]

Can you provide a reason why? Just notice it is very easy to delete fields after download while you can't build them back. I think it would also be better that you give an explication for each field you want to add / delete / modify.

Keep in mind that some people are downloading the CSV every day for very different reasons (they probably need different fields than you need).

@bredowmax
Copy link

It was my suggestion out of efficiency: the columns I suggested seem mostly empty or redundant. The file size is already 2gb and it will be larger with every column and row we add.

Example Nutriscore: there is nutriscore EN and FR, and they are identical

There are other examples where there is not a single value

I guess a good way would be a poll among the people that use the csv

@teolemon teolemon added Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data export labels Jan 10, 2020
@pieterwillaert
Copy link

Some fields I would like to be added:

  • ecoscore_data.grade
  • ecoscore_data.status
  • ecoscore_data.score

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Sep 7, 2022

Empty fields to be deleted:

  • additives
  • nutrition-score-uk_100g

... and I suspect other ones.

@CharlesNepote CharlesNepote added the 🧽 Data quality https://wiki.openfoodfacts.org/Quality label Oct 15, 2022
@stephanegigandet
Copy link
Contributor

I would not remove any field.

additives: it would be best to fix it (generate it from additives_tags), in order to be consistent with categories etc.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Oct 18, 2022

I would not remove any field

Ok.

additives: it would be best to fix it (generate it from additives_tags), in order to be consistent with categories etc.

categories is filled by users while additives_tags are computed from ingredients_text I guess (or ingredient_[something]). So it's not consistent to generate additives or to call it that way, don't you think so? Generating additives would also lead to duplicate data like for states, states_tags and sates_en which are nearly the same!

  • states: "en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:origins-to-be-completed, en:categories-to-be-completed, en:brands-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-uploaded"
  • states_tags: "en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:packaging-code-to-be-completed,en:characteristics-to-be-completed,en:origins-to-be-completed,en:categories-to-be-completed,en:brands-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:product-name-completed,en:photos-to-be-uploaded"
  • states_en: "To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Packaging code to be completed,Characteristics to be completed,Origins to be completed,Categories to be completed,Brands completed,Packaging to be completed,Quantity to be completed,Product name completed,Photos to be uploaded"

I think I prefer to keep the additives empty.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Dec 6, 2022

last_modified_by would be also nice to improve data quality management.

misc as it provides useful information about the product. Eg. en:packagings-with-all-weights.

erythritol

In the CSV there is currently no data to know whether the nutrients per 100g are computed from the values per serving or per 100 g. This is annoying because the values per 100g can be buggy in case of low serving size and rounded values.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Jan 6, 2023

An update as of today

Since last check, -erythritol has been automatically added.

There are more empty fields than I expected. 4 fields + all fields beginning with a hyphen. Some of them should be useful, such as -sucrose_100g, etc., -erythritol_100g.

cities
allergens_en
no_nutriments
additives

-butyric-acid_100g
-caproic-acid_100g
-caprylic-acid_100g
-capric-acid_100g
-lauric-acid_100g
-myristic-acid_100g
-palmitic-acid_100g
-stearic-acid_100g
-arachidic-acid_100g
-behenic-acid_100g
-lignoceric-acid_100g
-cerotic-acid_100g
-montanic-acid_100g
-melissic-acid_100g

-alpha-linolenic-acid_100g
-eicosapentaenoic-acid_100g
-docosahexaenoic-acid_100g

-linoleic-acid_100g
-arachidonic-acid_100g
-gamma-linolenic-acid_100g
-dihomo-gamma-linolenic-acid_100g

-oleic-acid_100g
-elaidic-acid_100g
-gondoic-acid_100g
-mead-acid_100g
-erucic-acid_100g
-nervonic-acid_100g

-sucrose_100g
-glucose_100g
-fructose_100g
-lactose_100g
-maltose_100g
-maltodextrins_100g

-erythritol_100g

For the fields beginning with, I never used them and didn't notice there were empty until now.

Due to the fact that no-one complained on last update, I suggest to:

  • maybe try to keep the place of each existing column; or not if we decide it's not important
  • replace wrong column names by the right ones (1 case + columns beginning by a hyphen)
  • maybe replace useless fields by new fields: it's a bit hard as there is a logical order of the fields

What we could do:

  • cities does not exist in the API; it's after first_packaging_code_geo and before cities_tags; we can keep it empty or remove it
  • allergens_en does not exist in the API; it's after allergens and before traces; we could replace it by something else.
  • no_nutriments does not exist in the API; it's after serving_quantity and before additives_n; as no_nutrition_data exists in the API I suggest to change the name.
  • additives does not exist and is in between additives_n and additives_tags; we could either keep it empty or remove it
  • add last_modified_by either at the end, either after last_modified_datetime
  • add misc either at the end, either after states

@stephanegigandet
Copy link
Contributor

There are more empty fields than I expected. 4 fields + all fields beginning with a hyphen. Some of them should be useful, such as -sucrose_100g, etc., -erythritol_100g.

It's probably because we added sub sub nutriments, but did not update the export file.

https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl#L258

	$nid =~ s/^-//g;

should be changed to:

	$nid =~ s/^(-+)//g;

@CharlesNepote
Copy link
Member Author

Indeed, I can see it in Food.pm.

@teolemon
Copy link
Member

We clearly need a lighter CSV in addition to the full one. 8GB compressed ? That's too much. We should aim at having one sub 1GB files available as well for beginners/people with older machines…

@teolemon teolemon moved this to To discuss and validate in 🍊 Open Food Facts Server issues Apr 23, 2024
@teolemon teolemon removed the ✨ Feature Features or enhancements to Open Food Facts server label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CSV exports Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data 🧽 Data quality https://wiki.openfoodfacts.org/Quality export
Projects
Status: To discuss and validate
Status: In progress
Development

No branches or pull requests

6 participants