CSV export revision (Adding new fields and removing some) #2325

CharlesNepote · 2019-09-13T17:27:24Z

CSV has not evolved since a long time. Some people need new data in the CSV. This issue allows to discuss which new fields could be exported.

Fields to add:

ingredients_analysis_tags
- description: this field is computed from ingredients' analysis to know if the product is or not vegan, vegetarian, and/or with palm-oil
- Eg. ingredients_analysis_tags: ["en:palm-oil","en:non-vegan","en:vegetarian-status-unknown"]
- rationale: it can be a helper to control data quality
nutrient_levels_tags?
- description: "It represents the traffic lights system made by the UK FSA (Food Safety Administration). It is used by some manufacturers on a voluntary basis in Great Britain, but was rejected by the European Commission in 2010. On Open Food Facts, when the nutritional values are known, the traffic lights are displayed on the product pages. The calculation formula defining the colors of the lights is described on a dedicated page of Open Food Facts website. Nutrient levels can be found here on the website: https://world.openfoodfacts.org/nutrient-levels
- Example: nutrient_levels_tags: ["en:fat-in-high-quantity","en:saturated-fat-in-high-quantity","en:sugars-in-high-quantity","en:salt-in-low-quantity"]
- rationale: it can be a helper to control data quality
product_quantity
- description: "This is the normalized quantity of the product in grams (ISO system)."
- Example: product_quantity: "1500", computed from quantity: "1,5 L"; product_quantity: "320", computed from quantity: "2 x 160 g"
- rationale: it's easier to play with this data as Open Food Facts already do the computation. We already provide serving_quantity.
owner
- description: "This is the owner of the product, which have sent the product's data to Open Food Facts. The list of owners can be found at https://world.openfoodfacts.org/owners"
- Example: "owner: "org-carrefour"
- rationale: it can be a helper to control data quality
data_quality_errors_tags
- description: "Returns a list of all detected errors for the product."
- Example: data_quality_errors_tags: ["en:nutrition-saturated-fat-greater-than-fat"] (speaks for itself).
- rationale: 1. Open Food Facts reusers might want to remove the products with quality issues. 2. It can ease to build tools for data quality. It represents ~30K products as of 2022-10.
unique_scans_n (contains the number of unique scans of a product (~33% of products))
- description: "Returns an integer which represent the number of users who have scanned the product at least one time. "Users" are identified by different IPs. This value is not computed in real time but once a year."
- Example: unique_scans_n: "8".
- rationale: this is a good proxy to understand which are the most consumed products.
popularity_tags
- description: "The popularity of a product is computed thanks to its number of unique scans. The popularity_tags field groups products by different levels of popularity by year, either in the world, either in the countries where it is popular."
- Example: popularity_tags: ["top-50000-scans-2019","top-100000-scans-2019","at-least-5-scans-2019","at-least-10-scans-2019","top-75-percent-scans-2019","top-80-percent-scans-2019","top-85-percent-scans-2019","top-90-percent-scans-2019","top-50000-fr-scans-2019","top-100000-fr-scans-2019","top-country-fr-scans-2019","at-least-5-fr-scans-2019","at-least-10-fr-scans-2019"]
- rationale: this field might be more clear than unique_scans_n, as the latter could suggest this number is a fresh data if not real-time
completeness:
- description: completeness is a float number, between 0 and 1.1, measuring how complete is the product (the higher, the most complete). Currently, we check for more than 10 items: image completeness, product_name, quantity, packaging, brands, categories, origins, emb_codes, expiration_date, ingreditents_text, nutriments (or no_nutrition_data if it is on).
- Eg. completeness: 0.7625
last_image_t:
- description: It is the date (Unix format) of the last image being uploaded for the product.
- eg. last_image_t: 1666661491
- rationale: it can be a helper to control data quality: if the product does not have any image, it should be impossible to fix data quality issues

Fields to delete ??

should we keep both created_t / lastmodified_t vs created_datetime / last_modified_datetime (51 Mb lost)
additives, which is empty
one of the states* three fields; states and states_tags are almost identical, the only difference is that states contains spaces
brands or brands_tags: the latter is only the normalized version of the first one (lowercased, unaccented, and replacing spaces and typographic signs by a "-")

Fields that could evolve:

fields with URL take a huge weight
- all image URLs begin by https://xxxxx.openfoodfacts.org/images/products/
- It is 866 Mb of data as of 2022-10-12, representing ~13% of the CSV size.

Process:

build a new CSV file as a beta version, with a different name than the current one, and let people discuss about it
when beta version turn stable/official, keep old version and tell everyone that old version will be unavailable in xx months

Implementation

First modify: @export_fields in https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Config_off.pm
Then modify export script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl
And finally end with the documentation: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/html/data/data-fields.txt

[to be completed]

The text was updated successfully, but these errors were encountered:

bredowmax · 2019-09-13T17:30:39Z

Columns I’d definitely love to see included are “vegan?”, “vegetarian?” and “co2 footprint” and from_category_averaged nutrition values for unbranded/unpackaged/fresh foods

maxmeetrobin · 2019-09-13T17:33:39Z

maxmeetrobin · 2019-09-13T17:41:42Z

More fields I suggest to be deleted:

maxmeetrobin · 2019-09-13T17:50:36Z

More fields I'd love to add:

completeness
nutrient_levels (all 4 of them)
unique_scans_n
ingredients_text

maxmeetrobin · 2019-09-13T17:51:38Z

While we're at it, I'd love to move Code as the unique identifier to the very left/beginning of each product listing

maxmeetrobin · 2019-09-13T17:53:16Z

For some fields, there are several translations. To keep the CSV as compact as possible, I suggest keeping only the -en translated version

CharlesNepote · 2019-09-28T08:15:01Z

While we're at it, I'd love to move Code as the unique identifier to the very left/beginning of each product listing

Code is already at the left/beginning of each product, isn't it? See:

code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,packaging,packaging_tags,brands,brands_tags,categories,categories_tags,categories_fr,origins,origins_tags,manufacturing_places,manufacturing_places_tags,labels,labels_tags,labels_fr,emb_codes,emb_codes_tags,first_packaging_code_geo,cities,cities_tags,purchase_places,stores,countries,countries_tags,countries_fr,ingredients_text,allergens,allergens_fr,traces,traces_tags,traces_fr,serving_size,serving_quantity,no_nutriments,additives_n,additives,additives_tags,additives_fr,ingredients_from_palm_oil_n,ingredients_from_palm_oil,ingredients_from_palm_oil_tags,ingredients_that_may_be_from_palm_oil_n,ingredients_that_may_be_from_palm_oil,ingredients_that_may_be_from_palm_oil_tags,nutrition_grade_fr,nova_group,pnns_groups_1,pnns_groups_2,states,states_tags,states_fr,main_category,main_category_fr,image_url,image_small_url,image_ingredients_url,image_ingredients_small_url,image_nutrition_url,image_nutrition_small_url,energy_100g,energy-from-fat_100g,fat_100g,saturated-fat_100g,butyric-acid_100g,caproic-acid_100g,caprylic-acid_100g,capric-acid_100g,lauric-acid_100g,myristic-acid_100g,palmitic-acid_100g,stearic-acid_100g,arachidic-acid_100g,behenic-acid_100g,lignoceric-acid_100g,cerotic-acid_100g,montanic-acid_100g,melissic-acid_100g,monounsaturated-fat_100g,polyunsaturated-fat_100g,omega-3-fat_100g,alpha-linolenic-acid_100g,eicosapentaenoic-acid_100g,docosahexaenoic-acid_100g,omega-6-fat_100g,linoleic-acid_100g,arachidonic-acid_100g,gamma-linolenic-acid_100g,dihomo-gamma-linolenic-acid_100g,omega-9-fat_100g,oleic-acid_100g,elaidic-acid_100g,gondoic-acid_100g,mead-acid_100g,erucic-acid_100g,nervonic-acid_100g,trans-fat_100g,cholesterol_100g,carbohydrates_100g,sugars_100g,sucrose_100g,glucose_100g,fructose_100g,lactose_100g,maltose_100g,maltodextrins_100g,starch_100g,polyols_100g,fiber_100g,proteins_100g,casein_100g,serum-proteins_100g,nucleotides_100g,salt_100g,sodium_100g,alcohol_100g,vitamin-a_100g,beta-carotene_100g,vitamin-d_100g,vitamin-e_100g,vitamin-k_100g,vitamin-c_100g,vitamin-b1_100g,vitamin-b2_100g,vitamin-pp_100g,vitamin-b6_100g,vitamin-b9_100g,folates_100g,vitamin-b12_100g,biotin_100g,pantothenic-acid_100g,silica_100g,bicarbonate_100g,potassium_100g,chloride_100g,calcium_100g,phosphorus_100g,iron_100g,magnesium_100g,zinc_100g,copper_100g,manganese_100g,fluoride_100g,selenium_100g,chromium_100g,molybdenum_100g,iodine_100g,caffeine_100g,taurine_100g,ph_100g,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g,choline_100g,phylloquinone_100g,beta-glucan_100g,inositol_100g,carnitine_100g
0000000000017,http://world-fr.openfoodfacts.org/produit/0000000000017/vitoria-crackers,kiliweb,1529059080,2018-06-15T10:38:00Z,1529059204,2018-06-15T10:40:04Z,Vitória crackers,,,,,,,,,,,,,,,,,,,,,,,,France,en:france,France,,,,,,,,0,,,,,,,,,,,,,,,,"en:to-be-completed, en:nutrition-facts-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:categories-to-be-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-validated, en:photos-uploaded","en:to-be-completed,en:nutrition-facts-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:packaging-code-to-be-completed,en:characteristics-to-be-completed,en:categories-to-be-completed,en:brands-to-be-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:product-name-completed,en:photos-to-be-validated,en:photos-uploaded","A compléter,Informations nutritionnelles complétées,Ingrédients à compléter,Date limite à compléter,Code emballeur à compléter,Caractéristiques à compléter,Catégories à compléter,Marques à compléter,Emballage à compléter,Quantité à compléter,Nom du produit complete,Photos à valider,Photos envoyées",,,https://static.openfoodfacts.org/images/products/000/000/000/0017/front_fr.4.400.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/front_fr.4.200.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/ingredients_fr.7.400.jpg,https://static.openfoodfacts.org/images/products/000/000/000/0017/ingredients_fr.7.200.jpg,,,1569,,7,3.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,70.1,15,,,,,,,,,,7.8,,,,1.4,0.551181102362205,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

CharlesNepote · 2019-09-28T08:22:08Z

@bredowmax:

My suggestion for fields to be deleted:
[...]

Can you provide a reason why? Just notice it is very easy to delete fields after download while you can't build them back. I think it would also be better that you give an explication for each field you want to add / delete / modify.

Keep in mind that some people are downloading the CSV every day for very different reasons (they probably need different fields than you need).

bredowmax · 2019-09-28T09:52:59Z

It was my suggestion out of efficiency: the columns I suggested seem mostly empty or redundant. The file size is already 2gb and it will be larger with every column and row we add.

Example Nutriscore: there is nutriscore EN and FR, and they are identical

There are other examples where there is not a single value

I guess a good way would be a poll among the people that use the csv

pieterwillaert · 2021-02-10T11:02:04Z

Some fields I would like to be added:

ecoscore_data.grade
ecoscore_data.status
ecoscore_data.score

CharlesNepote · 2022-09-07T14:22:38Z

Empty fields to be deleted:

additives
nutrition-score-uk_100g

... and I suspect other ones.

stephanegigandet · 2022-10-18T13:56:47Z

I would not remove any field.

additives: it would be best to fix it (generate it from additives_tags), in order to be consistent with categories etc.

CharlesNepote · 2022-10-18T14:31:45Z

I would not remove any field

Ok.

additives: it would be best to fix it (generate it from additives_tags), in order to be consistent with categories etc.

categories is filled by users while additives_tags are computed from ingredients_text I guess (or ingredient_[something]). So it's not consistent to generate additives or to call it that way, don't you think so? Generating additives would also lead to duplicate data like for states, states_tags and sates_en which are nearly the same!

states: "en:to-be-completed, en:nutrition-facts-to-be-completed, en:ingredients-to-be-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:origins-to-be-completed, en:categories-to-be-completed, en:brands-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-uploaded"
states_tags: "en:to-be-completed,en:nutrition-facts-to-be-completed,en:ingredients-to-be-completed,en:expiration-date-to-be-completed,en:packaging-code-to-be-completed,en:characteristics-to-be-completed,en:origins-to-be-completed,en:categories-to-be-completed,en:brands-completed,en:packaging-to-be-completed,en:quantity-to-be-completed,en:product-name-completed,en:photos-to-be-uploaded"
states_en: "To be completed,Nutrition facts to be completed,Ingredients to be completed,Expiration date to be completed,Packaging code to be completed,Characteristics to be completed,Origins to be completed,Categories to be completed,Brands completed,Packaging to be completed,Quantity to be completed,Product name completed,Photos to be uploaded"

I think I prefer to keep the additives empty.

CharlesNepote · 2022-12-06T20:48:33Z

last_modified_by would be also nice to improve data quality management.

misc as it provides useful information about the product. Eg. en:packagings-with-all-weights.

erythritol

In the CSV there is currently no data to know whether the nutrients per 100g are computed from the values per serving or per 100 g. This is annoying because the values per 100g can be buggy in case of low serving size and rounded values.

CharlesNepote · 2023-01-06T15:25:57Z

An update as of today

Since last check, -erythritol has been automatically added.

There are more empty fields than I expected. 4 fields + all fields beginning with a hyphen. Some of them should be useful, such as -sucrose_100g, etc., -erythritol_100g.

cities
allergens_en
no_nutriments
additives

-butyric-acid_100g
-caproic-acid_100g
-caprylic-acid_100g
-capric-acid_100g
-lauric-acid_100g
-myristic-acid_100g
-palmitic-acid_100g
-stearic-acid_100g
-arachidic-acid_100g
-behenic-acid_100g
-lignoceric-acid_100g
-cerotic-acid_100g
-montanic-acid_100g
-melissic-acid_100g

-alpha-linolenic-acid_100g
-eicosapentaenoic-acid_100g
-docosahexaenoic-acid_100g

-linoleic-acid_100g
-arachidonic-acid_100g
-gamma-linolenic-acid_100g
-dihomo-gamma-linolenic-acid_100g

-oleic-acid_100g
-elaidic-acid_100g
-gondoic-acid_100g
-mead-acid_100g
-erucic-acid_100g
-nervonic-acid_100g

-sucrose_100g
-glucose_100g
-fructose_100g
-lactose_100g
-maltose_100g
-maltodextrins_100g

-erythritol_100g

For the fields beginning with, I never used them and didn't notice there were empty until now.

Due to the fact that no-one complained on last update, I suggest to:

maybe try to keep the place of each existing column; or not if we decide it's not important
replace wrong column names by the right ones (1 case + columns beginning by a hyphen)
maybe replace useless fields by new fields: it's a bit hard as there is a logical order of the fields

What we could do:

cities does not exist in the API; it's after first_packaging_code_geo and before cities_tags; we can keep it empty or remove it
allergens_en does not exist in the API; it's after allergens and before traces; we could replace it by something else.
no_nutriments does not exist in the API; it's after serving_quantity and before additives_n; as no_nutrition_data exists in the API I suggest to change the name.
additives does not exist and is in between additives_n and additives_tags; we could either keep it empty or remove it
add last_modified_by either at the end, either after last_modified_datetime
add misc either at the end, either after states

stephanegigandet · 2023-01-06T16:20:09Z

There are more empty fields than I expected. 4 fields + all fields beginning with a hyphen. Some of them should be useful, such as -sucrose_100g, etc., -erythritol_100g.

It's probably because we added sub sub nutriments, but did not update the export file.

https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl#L258

	$nid =~ s/^-//g;

should be changed to:

	$nid =~ s/^(-+)//g;

CharlesNepote · 2023-01-06T16:31:11Z

Indeed, I can see it in Food.pm.

teolemon · 2023-08-28T14:44:52Z

We clearly need a lighter CSV in addition to the full one. 8GB compressed ? That's too much. We should aim at having one sub 1GB files available as well for beginners/people with older machines…

CharlesNepote added the ✨ Feature Features or enhancements to Open Food Facts server label Sep 13, 2019

teolemon added Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data export labels Jan 10, 2020

teolemon added the CSV exports label Jul 19, 2021

stephanegigandet mentioned this issue Feb 17, 2022

Refactor CSV / XSLX exports to have 1 export format (and code) instead of 3 #6424

Open

teolemon mentioned this issue Feb 18, 2022

⬇️ Data exports (tracker) #6429

Open

teolemon changed the title ~~CSV export revision~~ CSV export revision (Adding new fields and removing some) Feb 18, 2022

teolemon added this to 🍊 Open Food Facts Server issues Mar 31, 2022

CharlesNepote added the 🧽 Data quality https://wiki.openfoodfacts.org/Quality label Oct 15, 2022

CharlesNepote mentioned this issue Oct 18, 2022

feat: more fields in CSV (issue 2325) #7570

Merged

CharlesNepote added this to 🧽 Ensuring Data Quality Dec 20, 2022

CharlesNepote moved this to To do in 🧽 Ensuring Data Quality Dec 20, 2022

CharlesNepote self-assigned this Jan 5, 2023

CharlesNepote mentioned this issue Jan 10, 2023

feat: CSV export enhancements + fix #7970

Merged

CharlesNepote moved this from To do to In progress in 🧽 Ensuring Data Quality Jan 11, 2023

CharlesNepote mentioned this issue Dec 20, 2023

CSV export: add the main language of each product, and delete useless fields #9563

Open

teolemon moved this to To discuss and validate in 🍊 Open Food Facts Server issues Apr 23, 2024

teolemon removed the ✨ Feature Features or enhancements to Open Food Facts server label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV export revision (Adding new fields and removing some) #2325

CSV export revision (Adding new fields and removing some) #2325

CharlesNepote commented Sep 13, 2019 •

edited

Loading

bredowmax commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

CharlesNepote commented Sep 28, 2019

CharlesNepote commented Sep 28, 2019 •

edited

Loading

bredowmax commented Sep 28, 2019

pieterwillaert commented Feb 10, 2021

CharlesNepote commented Sep 7, 2022 •

edited

Loading

stephanegigandet commented Oct 18, 2022

CharlesNepote commented Oct 18, 2022 •

edited

Loading

CharlesNepote commented Dec 6, 2022 •

edited

Loading

CharlesNepote commented Jan 6, 2023 •

edited

Loading

stephanegigandet commented Jan 6, 2023

CharlesNepote commented Jan 6, 2023

teolemon commented Aug 28, 2023

CSV export revision (Adding new fields and removing some) #2325

CSV export revision (Adding new fields and removing some) #2325

Comments

CharlesNepote commented Sep 13, 2019 • edited Loading

bredowmax commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

maxmeetrobin commented Sep 13, 2019

CharlesNepote commented Sep 28, 2019

CharlesNepote commented Sep 28, 2019 • edited Loading

bredowmax commented Sep 28, 2019

pieterwillaert commented Feb 10, 2021

CharlesNepote commented Sep 7, 2022 • edited Loading

stephanegigandet commented Oct 18, 2022

CharlesNepote commented Oct 18, 2022 • edited Loading

CharlesNepote commented Dec 6, 2022 • edited Loading

CharlesNepote commented Jan 6, 2023 • edited Loading

stephanegigandet commented Jan 6, 2023

CharlesNepote commented Jan 6, 2023

teolemon commented Aug 28, 2023

CharlesNepote commented Sep 13, 2019 •

edited

Loading

CharlesNepote commented Sep 28, 2019 •

edited

Loading

CharlesNepote commented Sep 7, 2022 •

edited

Loading

CharlesNepote commented Oct 18, 2022 •

edited

Loading

CharlesNepote commented Dec 6, 2022 •

edited

Loading

CharlesNepote commented Jan 6, 2023 •

edited

Loading