Skip to content

Commit

Permalink
feat: rebuild taxonomies as they change fixes #6895 (#8027)
Browse files Browse the repository at this point in the history
* Initial version with just a local cache and all still in version control

* Additional cache for current versions

* Move pattern matching into script

* Remove built files and cache from git

* Ignore patterns

* Make sure taxonmies aren't loaded before they are built

* Load taxonomies after build, but packaging still failing

* Packaging taxonomy fixes for build

* Trying to avoid circular references

* Test data changes TBA

* More check / test fixes

* Make sure lang is always built before taxonomies

* Create build_taxonomies_test target when main containers aren't available

* Adding txt files back for analysis

* Keep results files for analysis

* Re-copied from latest main

* Taxonomy rebuild copied from main

* Example issue. After rebuild of taxonomies this change still wasn't picked up

* Fixes so packaging can build

* Lang fix

* Fixes to ensure building in correct order

* Fixing issues with building from scratch

* Tidy up

* Fix test results

* Remove result.txt again

* Include nutrient_levels in git in case taxonomies built before lang

* Add build-cache submodule

* More test fixes

* Cache update

* Updated to track head

* Update cache reference

* Use github for cache

* Build traces from allergens rather than using symlink

* Remove symnlinks for all and put logic in retrieve_tags_taxonomy

* tidy fix

* Remove module and use GitHub API to push

* Perl tidy and removed excess logging

* Reverting unecessary test expected result changes

* Use taxonomy build cache in pull_request

* Use taxonomy build cache in code_cov.yml

* fix: typo

* Addressing PR feedback

* Typos and ensure cache is updated on build_lang

* Perltidy fix

* More PR feedback

* Perltidy fix

* Perl critic fix

---------

Co-authored-by: Alex Garel <alex@garel.org>
  • Loading branch information
john-gom and alexgarel authored Feb 27, 2023
1 parent 5584cae commit 2a79845
Show file tree
Hide file tree
Showing 91 changed files with 1,826 additions and 520,408 deletions.
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,5 @@ GEOLITE2_ACCOUNT_ID=
ELASTICSEARCH_HOSTS=
LOG_LEVEL_ROOT=TRACE
LOG_LEVEL_MONGODB=TRACE

BUILD_CACHE_REPO=openfoodfacts/openfoodfacts-build-cache
4 changes: 2 additions & 2 deletions .github/workflows/codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
- name: rebuild taxonomies
run: |
git ls-files taxonomies/ | xargs -I{} git log -1 --date=format:%Y%m%d%H%M.%S --format='touch -t %ad "{}"' "{}" | bash
make build_taxonomies
make build_taxonomies GITHUB_TOKEN="${{ secrets.TAXONOMY_CACHE_GITHUB_TOKEN }}"
- uses: actions/checkout@master
- name: generate coverage results
run: make cover
Expand All @@ -25,4 +25,4 @@ jobs:
- uses: codecov/codecov-action@v3
if: always()
with:
files: cover_db/codecov.json
files: cover_db/codecov.json
2 changes: 1 addition & 1 deletion .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ jobs:
# see https://stackoverflow.com/a/60984318/2886726
run: |
git ls-files taxonomies/ | xargs -I{} git log -1 --date=format:%Y%m%d%H%M.%S --format='touch -t %ad "{}"' "{}" | bash
make build_taxonomies
make build_taxonomies GITHUB_TOKEN="${{ secrets.TAXONOMY_CACHE_GITHUB_TOKEN }}"
- name: test
run: make tests

Expand Down
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
lib/ProductOpener/Config.pm
lib/ProductOpener/Config2.pm
lib/ProductOpener/SiteLang.pm
taxonomies/nutrient_levels.txt
po/site-specific

# tests outputs
Expand Down Expand Up @@ -67,3 +66,12 @@ html/js/sigma*

# Env files
.env*

# Taxonomies and cache
cache
taxonomies/*.sto
taxonomies/*.result.txt
taxonomies/*.all.txt
/build-cache/taxonomies/*.json
/build-cache/taxonomies/*.sto
/build-cache/taxonomies/*.result.txt
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -190,11 +190,12 @@ RUN \
done && \
chown www-data:www-data -R /mnt/podata && \
# Create symlinks of data files that are indeed conf data in /mnt/podata (because we currently mix data and conf data)
for path in data-default ecoscore emb_codes forest-footprint ingredients packager-codes po taxonomies templates; do \
for path in data-default ecoscore emb_codes forest-footprint ingredients packager-codes po taxonomies templates build-cache; do \
ln -sf /opt/product-opener/${path} /mnt/podata/${path}; \
done && \
# Create some necessary files to ensure permissions in volumes
mkdir -p /opt/product-opener/html/data/ && \
mkdir -p /opt/product-opener/html/data/taxonomies/ && \
mkdir -p /opt/product-opener/html/images/ && \
chown www-data:www-data -R /opt/product-opener/html/ && \
# logs dir
Expand Down
24 changes: 12 additions & 12 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -158,20 +158,21 @@ coverage_txt:
#----------#
build_lang:
@echo "🥫 Rebuild language"
# Run build_lang.pl
${DOCKER_COMPOSE} run --rm backend perl -I/opt/product-opener/lib -I/opt/perl/local/lib/perl5 /opt/product-opener/scripts/build_lang.pl
# Run build_lang.pl
# Languages may build taxonomies on-the-fly so include GITHUB_TOKEN so results can be cached
${DOCKER_COMPOSE} run --rm -e GITHUB_TOKEN=${GITHUB_TOKEN} backend perl -I/opt/product-opener/lib -I/opt/perl/local/lib/perl5 /opt/product-opener/scripts/build_lang.pl

build_lang_test:
# Run build_lang.pl in test env
${DOCKER_COMPOSE_TEST} run --rm backend perl -I/opt/product-opener/lib -I/opt/perl/local/lib/perl5 /opt/product-opener/scripts/build_lang.pl
${DOCKER_COMPOSE_TEST} run --rm -e GITHUB_TOKEN=${GITHUB_TOKEN} backend perl -I/opt/product-opener/lib -I/opt/perl/local/lib/perl5 /opt/product-opener/scripts/build_lang.pl

# use this in dev if you messed up with permissions or user uid/gid
reset_owner:
@echo "🥫 reset owner"
${DOCKER_COMPOSE} run --rm --no-deps --user root backend chown www-data:www-data -R /opt/product-opener/ /mnt/podata /var/log/apache2 /var/log/httpd || true
${DOCKER_COMPOSE} run --rm --no-deps --user root frontend chown www-data:www-data -R /opt/product-opener/html/images/icons/dist /opt/product-opener/html/js/dist /opt/product-opener/html/css/dist
${DOCKER_COMPOSE_TEST} run --rm --no-deps --user root backend chown www-data:www-data -R /opt/product-opener/ /mnt/podata /var/log/apache2 /var/log/httpd || true
${DOCKER_COMPOSE_TEST} run --rm --no-deps --user root frontend chown www-data:www-data -R /opt/product-opener/html/images/icons/dist /opt/product-opener/html/js/dist /opt/product-opener/html/css/dist

init_backend: build_lang
init_backend: build_lang build_taxonomies

create_mongodb_indexes:
@echo "🥫 Creating MongoDB indexes …"
Expand Down Expand Up @@ -324,13 +325,12 @@ check_critic:
# Compilation #
#-------------#

build_taxonomies:
@echo "🥫 build taxonomies on ${CPU_COUNT} procs"
${DOCKER_COMPOSE} run --no-deps --rm backend make -C taxonomies -j ${CPU_COUNT}
build_taxonomies: build_lang # build_lang generates the nutrient_level taxonomy source file
@echo "🥫 build taxonomies"
# GITHUB_TOKEN might be empty, but if it's a valid token it enables pushing taxonomies to build cache repository
${DOCKER_COMPOSE} run --no-deps --rm -e GITHUB_TOKEN=${GITHUB_TOKEN} backend /opt/product-opener/scripts/build_tags_taxonomy.pl ${name}

rebuild_taxonomies:
@echo "🥫 re-build all taxonomies on ${CPU_COUNT} procs"
${DOCKER_COMPOSE} run --rm backend make -C taxonomies all_taxonomies -j ${CPU_COUNT}
rebuild_taxonomies: build_taxonomies

#------------#
# Production #
Expand Down
7 changes: 7 additions & 0 deletions build-cache/taxonomies/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Cached copies of taxonomy build results are stored here.

If no local cache is available then https://github.com/openfoodfacts/openfoodfacts-build-cache is checked for a copy.

If the taxonomy needs to be built then this will be uploaded back to the repo if the GITHUB_TOKEN environment variable is set.

The token is a personal access token, created here: https://github.com/settings/tokens. Only the public_repo scope is needed.
1 change: 1 addition & 0 deletions conf/apache.conf
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ PerlPassEnv POSTGRES_PASSWORD
PerlPassEnv LOG_LEVEL_ROOT
PerlPassEnv LOG_LEVEL_MONGODB
PerlPassEnv OFF_LOG_EMAILS
PerlPassEnv BUILD_CACHE_REPO


<IfDefine PERLDB>
Expand Down
1 change: 1 addition & 0 deletions cpanfile
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ requires 'JSON::Parse';
requires 'Data::DeepAccess';
requires 'XML::XML2JSON';
requires 'Redis';
requires 'Digest::SHA1';


# Mojolicious/Minion
Expand Down
1 change: 1 addition & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ x-backend-conf: &backend-conf
- LOG_LEVEL_ROOT
- LOG_LEVEL_MONGODB
- INFLUXDB_HOST
- BUILD_CACHE_REPO
depends_on:
- memcached
volumes:
Expand Down
28 changes: 28 additions & 0 deletions docs/explanations/taxonomy_build_cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Taxonomies have a significant impact on OFF processing and automated test results so need to be rebuilt before running any tests. However, this process takes some time, so the built taxonomy files are cached in a GitHub repository so that they only need to be rebuilt when there is a genuine change.

# How it works
A hash is calculated for all of the source files used to build a particular taxonomy and GitHub is then checked to see if a cache already exists for that hash.

If no cached build is found then the taxonomy is rebuilt and cached locally.

If the GITHUB_TOKEN environemnt variable is set then the cached build is also uploaded to the https://github.com/openfoodfacts/openfoodfacts-build-cache repository. Note that no token is required to download previous cached builds from the repo.

# Obtaining a token

The GITHUB_TOKEN is a personal access token, created here: https://github.com/settings/tokens. Only the public_repo scope is needed.

# Considerations

In maintianing this code be aware of the following complications...

## Circular Dependencies

There is a cicular dependency between taxonomies, languages and foods. The foods library is used to create the source for the nutrient_levels taxonomy, which uses transalations from languages. However, languages depends on the languages taxonomy...

This is currently resolved by building the taxonomy on the fly if it is requested but not currently built.

## Taxonomy Dependencies

Some taxonomies perform lookups on others, e.g. additives_classes are referenced by additives, so the referenced taxonomy needs to be built first. The build order is determined in the Config_off.pm file.


4 changes: 4 additions & 0 deletions lib/ProductOpener/Config2_docker.pm
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ BEGIN {
$events_password
$redis_url
%server_options
$build_cache_repo
);
%EXPORT_TAGS = (all => [@EXPORT_OK]);
}
Expand Down Expand Up @@ -128,4 +129,7 @@ $redis_url = $ENV{REDIS_URL};
# this one does not seems to be used
minion_admin_server_and_port => "http://0.0.0.0:3003",
);

$build_cache_repo = $ENV{BUILD_CACHE_REPO};

1;
53 changes: 48 additions & 5 deletions lib/ProductOpener/Config_off.pm
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ BEGIN {
@edit_rules
$build_cache_repo
);
%EXPORT_TAGS = (all => [@EXPORT_OK]);
}
Expand Down Expand Up @@ -364,6 +365,8 @@ $facets_kp_url = $ProductOpener::Config2::facets_kp_url;

%server_options = %ProductOpener::Config2::server_options;

$build_cache_repo = $ProductOpener::Config2::build_cache_repo;

$reference_timezone = 'Europe/Paris';

$contact_email = 'contact@openfoodfacts.org';
Expand Down Expand Up @@ -562,15 +565,55 @@ $options{categories_exempted_from_nutrient_levels} = [
# fields for which we will load taxonomies
# note: taxonomies that are used as properties of other taxonomies must be loaded first
# (e.g. additives_classes are referenced in additives)
# Below is a list of all of the taxonomies with other taxonomies that reference them
# If there are entries in () these are other taxonomies that are combined into this one
#
# additives
# additives_classes: additives, minerals
# allergens: ingredients, traces
# amino_acids
# categories
# countries:
# data_quality
# data_quality_bugs (data_quality)
# data_quality_errors (data_quality)
# data_quality_errors_producers (data_quality)
# data_quality_info (data_quality)
# data_quality_warnings (data_quality)
# data_quality_warnings_producers (data_quality)
# food_groups: categories
# improvements
# ingredients_analysis
# ingredients_processing:
# ingredients (additives_classes, additives, minerals, vitamins, nucleotides, other_nutritional_substances): labels
# labels: categories
# languages:
# minerals
# misc
# nova_groups
# nucleotides
# nutrient_levels
# nutrients
# origins (countries): categories, ingredients, labels
# other_nutritional_substances
# packaging_materials: packaging_recycling, packaging_shapes
# packaging_recycling
# packaging_shapes: packaging_materials, packaging_recycling
# packaging (packaging_materials, packaging_shapes, packaging_recycling, preservation): labels
# periods_after_opening:
# states:
# traces (allergens)
# vitamins

@taxonomy_fields = qw(
states countries languages labels categories food_groups
ingredients ingredients_processing
additives_classes additives vitamins minerals amino_acids nucleotides other_nutritional_substances allergens traces
origins
languages states countries
allergens origins additives_classes ingredients
packaging_shapes packaging_materials packaging_recycling packaging
labels food_groups categories
ingredients_processing
additives vitamins minerals amino_acids nucleotides other_nutritional_substances traces
ingredients_analysis
nutrients nutrient_levels misc nova_groups
packaging packaging_shapes packaging_materials packaging_recycling
periods_after_opening
data_quality data_quality_bugs data_quality_info data_quality_warnings data_quality_errors data_quality_warnings_producers data_quality_errors_producers
improvements
Expand Down
Loading

0 comments on commit 2a79845

Please sign in to comment.