-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix train state and modification time for unfinished project training #722
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #722 +/- ##
=======================================
Coverage 99.67% 99.67%
=======================================
Files 89 89
Lines 6380 6397 +17
=======================================
+ Hits 6359 6376 +17
Misses 21 21
☔ View full report in Codecov by Sentry. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the This would also keep the project's datadir clean of any temp files hanging around. However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implemented approach looks sensible.
Regarding this idea:
To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer file would not affect the modification time or train state of the project.
This would also keep the project's datadir clean of any temp files hanging around.
However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).
I think that this would be a step in the wrong direction. It would be better to try to make train operations more atomic, so that vectorizers, train files, models etc. would first be written into temporary files and only right at the end they would be renamed so that they replace existing files (if any). This is how annif.util.atomic_save is intended to work, but it only applies on individual files and not sets of files that need to be kept in sync (e.g. vectorizer + model).
If initial training of a project is not finished after any files have been created in the project's data directory, the train state and modification time information turn out incorrect (the project shows to be (fully) trained when it is not, and with a modification time). And when retraining a project is interrupted, the modification time is falsely updated.
This problem can realize more commonly when/if implementing the
--prepare-only
option to the train command.This PR makes the methods inquiring the train state and modification time to ignore files in the project's datadir with pattern
*-train*
,tmp-*
andvectorizer
. Thetmp-
prefix is added to all temporary files, because some backends are using a tempfile for the model file during training, which can remain after unfinished training, e.g.stwfsa_predictor1mz8z4im.zip
.The train and temp files should definitely be ignored, but the vectorizer file case is not so clear:
Instead of using global ignore patterns, this functionality could use the actual model file names/patterns per backend, but the field storing them varies a bit (
MODEL_FILE
,INDEX_FILE
,MODEL_FILE_PREFIX
).