-
-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate and update CSV and RDF exports through a PostgreSQL table #10326
Comments
I did a bit of an experiment here using the Postgres database that contains all the product data, as previously mentioned here: #8620 You can get a lot of the data needed using the standard Postgres COPY TO command, e.g. copy (
select
code
,'http://world-en.openfoodfacts.org/product/' || code url
,data->'created_t' created_t
,to_char(to_timestamp((data->>'created_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') created_datetime
,data->'last_modified_t' last_modified_t
,to_char(to_timestamp((data->>'last_modified_t')::int), 'YYYY-MM-DD"T"HH24:MI:SS"Z"') last_modified_datetime
,data->>'product_name' product_name
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'packaging_tags')),', ') packaging_tags
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'brands_tags')),', ') brands_tags
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'categories_tags')),', ') categories_tags
,data->>'ingredients_text' ingredients_text
,data->'ecoscore_data'->'score' ecoscore_score
,data->'ecoscore_data'->>'grade' ecoscore_grade
,array_to_string(ARRAY(SELECT json_array_elements_text(data->'data_quality_errors_tags')),', ') data_quality_errors_tags
,data->'nutriments'->'energy-kj_100g' "energy-kj_100g"
,data->'nutriments'->'energy-kcal_100g' "energy-kcal_100g"
,data->'nutriments'->'energy_100g' "energy_100g"
from product
limit 1000
) to '/mnt/e/products.tsv' DELIMITER E'\t' null as ' ' CSV HEADER; However, there are a number of issues:
From my previous experience with off-query I think that storing the data in a relational format rather than JSON will fix the performance issue (I did a small experiment and 1000 records took about 100ms using this approach). My suggestion would be that we pre-process the data before loading into the off-query database to resolve issues 2, 3 and 5. In addition, in order to address 2, we would need the taxonomies to be loaded into off-query (I have some code to do that) and we would need to canonicalize all of the tag values of the product before saving so that we can do a direct SQL join to the taxonomy translations table. Adding this extra processing in the off-query import code will slow down the import a bit, but shouldn't have a big impact on the incremental sync from Redis. In summary, the work items to address this using an off-query approach would be:
|
It currently takes 4 hours every day (with 1 cpu used all time) to generate the 4 CSV and RDF exports:
(+ a long query to MongoDB to go through all products)
The export is done with this script: https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/export_database.pl
The script goes through products one by one, copy some fields, converts some fields (e.g. UNIX timestamps to ISO 8601), adds French and English translations for a few tag fields (categories, labels etc.), flattens some arrays into strings, and that's it.
I'm thinking that instead of going through all products in MongoDB every day, we could instead maintain a separate database similar to the one used in https://github.com/openfoodfacts/openfoodfacts-query with the same columns we export in CSV. We would then update only changed products using Redis.
There is also a RDF export that we generate at the same time as the CSV, one solution could be to generate and store the RDF for each product in a separate column.
The text was updated successfully, but these errors were encountered: