Merge pull request #971 from kermitt2/incremental-training

Incremental training for Deep Learning and Wapiti models
kermitt2 · Nov 25, 2022 · 2bfadfd · 2bfadfd
2 parents b9d92c6 + fc701f7
commit 2bfadfd
Show file tree

Hide file tree

Showing 15 changed files with 219 additions and 53 deletions.
diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md
@@ -751,6 +751,7 @@ Launch a training for a given model. The service return back a training token (a
 |           |                     |                      | type | optional | type of training, `full`, `holdout`, `split`, `nfold`, default is `split` |
 |           |                     |                      | ratio | optional | only considered for `split` training mode, give the ratio (number bewteen 0 and 1) of training and evaluation data when splitting the annotated data, default is `0.9` |
 |           |                     |                      | n | optional | only considered for `nfold` training mode, give the number of folds to be used, default is `10` |
+|           |                     |                      | `incremental` | optional | boolean indicating if the training should be incremental (`1`) starting from the existing model, or not (default, `0`) |
 
 The `type` of training indicates which training and evaluation mode should be used:
 - `full`: the whole available training data is used for training a model

diff --git a/doc/Training-the-models-of-Grobid.md b/doc/Training-the-models-of-Grobid.md
@@ -28,7 +28,7 @@ Grobid uses different sequence labelling models depending on the labeling task t
 
 * table
 
-The models are located under `grobid/grobid-home/models`. Each of these models can be retrained using amended or additional training data. For production, a model is trained with all the available training data to maximize the performance. For development purposes, it is also possible to evaluate a model with part of the training data. 
+The models are located under `grobid/grobid-home/models`. Each of these models can be retrained using amended or additional training data. For production, a model is trained with all the available training data to maximize the performance. For development purposes, it is also possible to evaluate a model with part of the training data as frozen set (e.g. holdout set), automatic random split or apply 10-fold cross-evaluation. 
 
 ## Train and evaluate
 
@@ -43,56 +43,88 @@ When generating a new model, a segmentation of data can be done (e.g. 80%-20%) b
 
 There are different ways to generate the new model and run the evaluation, whether running the training and the evaluation of the new model separately or not, and whether to split automatically the training data or not. For any methods, the newly generated models are saved directly under grobid-home/models and replace the previous one. A rollback can be made by replacing the newly generated model by the backup record (`<model name>.wapiti.old`).
 
-### Train and evaluation in one command
+### Train and evaluation in one command (simple mode)
+
+For simple training without particular parameters, a single command can be used as follow. All the available annotated files under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus` will be used for trainng and all available annotated files under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation/` will be used for evaluation.
+
 Under the main project directory `grobid/`, run the following command to execute both training and evaluation: 
+
 ```bash
-> ./gradlew <training goal. I.E: train_name-header>
+> ./gradlew train_<name_of_model>
 ```
-Example of goal names: `train_header`, `train_date`, `train_name_header`, `train_name_citation`, `train_citation`, `train_affiliation_address`, `train_fulltext`, `train_patent_citation`, ...
 
-The files used for the training are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`, and the evaluation files under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation`. 
+Example: `train_header`, `train_date`, `train_name_header`, `train_name_citation`, `train_citation`, `train_affiliation_address`, `train_fulltext`, `train_patent_citation`, ...
 
 Examples for training the header model: 
+
 ```bash
 > ./gradlew train_header
 ```
+
 Examples for training the model for names in header: 
+
 ```bash
 > ./gradlew train_name_header 
 ```
 
-### Train and evaluation separately
-Under the main project directory `grobid/`, execute the following command (be sure to have built the project as indicated in [Install GROBID](Install-Grobid.md)):
+### Train and evaluation separately and using more parameters (full mode)
+
+To have more flexibility and options for training and evaluating the models, use the following commands. 
+
+First be sure to have the full project libraries locally built (see [Install GROBID](Install-Grobid.md) for nore details): 
+
+```bash
+> ./gradlew clean install
+```
+
+Under the main project directory `grobid/`:
+
+**Train** (generate a new model):
 
-Train (generate a new model):
 ```bash
 > java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <name of the model> -gH grobid-home
 ```
+
 The training files considered are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`
 
 The training of the models can be controlled using different parameters. The `nbThreads` parameter in the configuration file `grobid-home/config/grobid.yaml` can be increased to speed up the training. Similarly, modifying the stopping criteria can help speed up the training. Please refer [this comment](https://github.com/kermitt2/grobid/issues/336#issuecomment-412516422) to know more.
 
-Evaluate:
+**Evaluate**:
+
 ```bash
 > java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 1 <name of the model> -gH grobid-home
 ```
 
 The considered evaluation files are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation`
 
-Automatically split data, train and evaluate:
+**Automatically split data, train and evaluate**:
+
 ```bash
 > java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 <name of the model> -gH grobid-home -s <segmentation ratio as a number between 0 and 1, e.g. 0.8 for 80%>
 ```
 
 For instance, training the date model with a ratio of 75% for training and 25% for evaluation:
+
 ```bash
 > java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 date -gH grobid-home -s 0.75
 ```
 
-A ratio of 1.0 means that all the data available under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus/` will be used for training the model, and the evaluation will be empty. *Automatic split data, train and evaluate* is for the moment only available for the following models: header, citation, date, name-citation, name-header and affiliation-address.
+A ratio of 1.0 means that all the data available under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus/` will be used for training the model, and the evaluation will be empty. 
+
+**Incremental training**: 
+
+The previous commands were starting a training from scratch, using all available training data in one training task. 
+Incremental training will start from an existing already train model and apply a further training task using the available training data under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`. 
+
+Launching an incremental training is similar as the previous commands, but adding the parameter `-i`. An existing model under `grobid/grobid-home/models/*MODEL*` must be available. For example:
+
+```bash
+> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <name of the model> -gH grobid-home -i
+```
 
-Several runs with different files to evaluate can be made to have a more reliable evaluation (e.g. 10 fold cross-validation). For the time being, such segmentation and iterative evaluation is not yet implemented. 
+Note that a full training from scratch with all training data should normally provide better accuracy for a model than several iterative training with a partition of the training data. Using incremental training makes sense for exemple when the model has been trained with a lot of data during days/weeks, and an update is required, or for the development of training data when the update of a model must be quick to generate new trainng data. 
 
+In incremental training phases, the training parameters might require some update to stop the training earlier than in normal full training. 
 
 ### N-folds cross-evaluation
 

diff --git a/grobid-core/src/main/java/org/grobid/core/jni/DeLFTClassifierModel.java b/grobid-core/src/main/java/org/grobid/core/jni/DeLFTClassifierModel.java
@@ -164,7 +164,8 @@ public String classify(List<String> data) {
     public static void trainJNI(String modelName, File trainingData, File outputModel) {
         try {
             LOGGER.info("Train DeLFT classification model " + modelName + "...");
-            JEPThreadPoolClassifier.getInstance().run(new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath()));
+            JEPThreadPoolClassifier.getInstance().run(
+                new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath()));
         } catch(InterruptedException e) {
             LOGGER.error("Train DeLFT classification model " + modelName + " task failed", e);
         }
@@ -174,13 +175,18 @@ private static class TrainTask implements Runnable {
         private String modelName;
         private File trainPath;
         private File modelPath;
+        private String architecture;
+        private boolean incremental;
 
         public TrainTask(String modelName, File trainPath, File modelPath) { 
+        //public TrainTask(String modelName, File trainPath, File modelPath, String architecture, boolean incremental) { 
             //System.out.println("train thread: " + Thread.currentThread().getId());
             this.modelName = modelName;
             this.trainPath = trainPath;
             this.modelPath = modelPath;
-        } 
+            this.architecture = null;
+            this.incremental = false;
+        }
 
         @Override
         public void run() { 
@@ -198,15 +204,44 @@ public void run() {
                     useELMo = "True";
                 }
 
+                String localArgs = "";
+                if (GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(this.modelName) != -1)
+                    localArgs += ", maxlen="+
+                        GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(this.modelName);
+
+                if (GrobidProperties.getInstance().getDelftTrainingBatchSize(this.modelName) != -1)
+                    localArgs += ", batch_size="+
+                        GrobidProperties.getInstance().getDelftTrainingBatchSize(this.modelName);
+
+                if (GrobidProperties.getInstance().getDelftTranformer(modelName) != null) {
+                    localArgs += ", transformer="+
+                        GrobidProperties.getInstance().getDelftTranformer(modelName);
+                }
+
                 // init model to be trained
-                jep.eval("model = Classifier('"+this.modelName+
-                    "', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+")");
+                if (this.architecture == null) {
+                    jep.eval("model = Classifier('"+this.modelName+
+                        "', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+localArgs+")");
+                } else {
+                    jep.eval("model = Classifier('"+this.modelName+
+                        "', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+localArgs+
+                        ", architecture='"+architecture+"')");
+                }
 
                 // actual training
-                //start_time = time.time()
-                jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
-                //runtime = round(time.time() - start_time, 3)
-                //print("training runtime: %s seconds " % (runtime))
+                if (incremental) {
+                    // if incremental training, we need to load the existing model
+                    if (this.modelPath != null && 
+                        this.modelPath.exists() &&
+                        !this.modelPath.isDirectory()) {
+                        jep.eval("model.load('" + this.modelPath.getAbsolutePath() + "')");
+                        jep.eval("model.train(x_train, y_train, x_valid, y_valid, incremental=True)");
+                    } else {
+                        throw new GrobidException("the path to the model to be used for starting incremental training is invalid: " +
+                            this.modelPath.getAbsolutePath());
+                    }
+                } else
+                    jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
 
                 // saving the model
                 System.out.println(this.modelPath.getAbsolutePath());
@@ -223,6 +258,8 @@ public void run() {
                 jep.eval("del model");
             } catch(JepException e) {
                 LOGGER.error("DeLFT classification model training via JEP failed", e);
+            } catch(GrobidException e) {
+                LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
             } 
         } 
     }
@@ -260,7 +297,9 @@ public static void train(String modelName, File trainingData, File outputModel)
             LOGGER.error("IO error when training DeLFT classification model " + modelName, e);
         } catch(InterruptedException e) {
             LOGGER.error("Train DeLFT classification model " + modelName + " task failed", e);
-        }
+        } catch(GrobidException e) {
+            LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
+        } 
     }
 
     public synchronized void close() {

diff --git a/grobid-core/src/main/java/org/grobid/core/jni/DeLFTModel.java b/grobid-core/src/main/java/org/grobid/core/jni/DeLFTModel.java
@@ -209,11 +209,11 @@ public String label(String data) {
      * usually hangs... Possibly issues with IO threads at the level of JEP (output not consumed because
      * of \r and no end of line?). 
      */
-    public static void trainJNI(String modelName, File trainingData, File outputModel, String architecture) {
+    public static void trainJNI(String modelName, File trainingData, File outputModel, String architecture, boolean incremental) {
         try {
             LOGGER.info("Train DeLFT model " + modelName + "...");
             JEPThreadPool.getInstance().run(
-                new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath(), architecture));
+                new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath(), architecture, incremental));
         } catch(InterruptedException e) {
             LOGGER.error("Train DeLFT model " + modelName + " task failed", e);
         }
@@ -224,13 +224,15 @@ private static class TrainTask implements Runnable {
         private File trainPath;
         private File modelPath;
         private String architecture;
+        private boolean incremental;
 
-        public TrainTask(String modelName, File trainPath, File modelPath, String architecture) { 
+        public TrainTask(String modelName, File trainPath, File modelPath, String architecture, boolean incremental) { 
             //System.out.println("train thread: " + Thread.currentThread().getId());
             this.modelName = modelName;
             this.trainPath = trainPath;
             this.modelPath = modelPath;
             this.architecture = architecture;
+            this.incremental = incremental;
         } 
 
         @Override
@@ -269,11 +271,23 @@ public void run() {
                 else
                     jep.eval("model = Sequence('"+this.modelName+
                         "', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+localArgs+ 
-                        ", model_type='"+architecture+"')");
+                        ", architecture='"+architecture+"')");
 
                 // actual training
                 //start_time = time.time()
-                jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
+                if (incremental) {
+                    // if incremental training, we need to load the existing model
+                    if (this.modelPath != null && 
+                        this.modelPath.exists() &&
+                        !this.modelPath.isDirectory()) {
+                        jep.eval("model.load('" + this.modelPath.getAbsolutePath() + "')");
+                        jep.eval("model.train(x_train, y_train, x_valid, y_valid, incremental=True)");
+                    } else {
+                        throw new GrobidException("the path to the model to be used for starting incremental training is invalid: " +
+                            this.modelPath.getAbsolutePath());
+                    }
+                } else
+                    jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
                 //runtime = round(time.time() - start_time, 3)
                 //print("training runtime: %s seconds " % (runtime))
 
@@ -292,6 +306,8 @@ public void run() {
                 jep.eval("del model");
             } catch(JepException e) {
                 LOGGER.error("DeLFT model training via JEP failed", e);
+            } catch(GrobidException e) {
+                LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
             } 
         } 
     } 
@@ -300,7 +316,7 @@ public void run() {
      *  Train with an external process rather than with JNI, this approach appears to be more stable for the
      *  training process (JNI approach hangs after a while) and does not raise any runtime/integration issues. 
      */
-    public static void train(String modelName, File trainingData, File outputModel, String architecture) {
+    public static void train(String modelName, File trainingData, File outputModel, String architecture, boolean incremental) {
         try {
             LOGGER.info("Train DeLFT model " + modelName + "...");
             List<String> command = new ArrayList<>();
@@ -322,16 +338,29 @@ public static void train(String modelName, File trainingData, File outputModel,
             if (GrobidProperties.getInstance().useELMo(modelName) && modelName.toLowerCase().indexOf("bert") == -1) {
                 command.add("--use-ELMo");
             }
-
             if (GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(modelName) != -1) {
                 command.add("--max-sequence-length");
                 command.add(String.valueOf(GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(modelName)));
             }
-
             if (GrobidProperties.getInstance().getDelftTrainingBatchSize(modelName) != -1) {
                 command.add("--batch-size");
                 command.add(String.valueOf(GrobidProperties.getInstance().getDelftTrainingBatchSize(modelName)));
             }
+            if (incremental) {
+                command.add("--incremental");
+
+                // if incremental training, we need to load the existing model
+                File modelPath = GrobidProperties.getInstance().getModelPath();
+                if (modelPath != null && 
+                    modelPath.exists() &&
+                    !modelPath.isDirectory()) {
+                    command.add("--input-model");
+                    command.add(GrobidProperties.getInstance().getModelPath().getAbsolutePath());
+                } else {
+                    throw new GrobidException("the path to the model to be used for starting incremental training is invalid: " +
+                        GrobidProperties.getInstance().getModelPath().getAbsolutePath());
+                }
+            }
 
             ProcessBuilder pb = new ProcessBuilder(command);
             File delftPath = new File(GrobidProperties.getInstance().getDeLFTFilePath());
@@ -349,7 +378,9 @@ public static void train(String modelName, File trainingData, File outputModel,
             LOGGER.error("IO error when training DeLFT model " + modelName, e);
         } catch(InterruptedException e) {
             LOGGER.error("Train DeLFT model " + modelName + " task failed", e);
-        }
+        } catch(GrobidException e) {
+            LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
+        } 
     }
 
     public synchronized void close() {