Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental training for Deep Learning and Wapiti models #971

Merged
merged 7 commits into from
Nov 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/Grobid-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -751,6 +751,7 @@ Launch a training for a given model. The service return back a training token (a
| | | | type | optional | type of training, `full`, `holdout`, `split`, `nfold`, default is `split` |
| | | | ratio | optional | only considered for `split` training mode, give the ratio (number bewteen 0 and 1) of training and evaluation data when splitting the annotated data, default is `0.9` |
| | | | n | optional | only considered for `nfold` training mode, give the number of folds to be used, default is `10` |
| | | | `incremental` | optional | boolean indicating if the training should be incremental (`1`) starting from the existing model, or not (default, `0`) |

The `type` of training indicates which training and evaluation mode should be used:
- `full`: the whole available training data is used for training a model
Expand Down
56 changes: 44 additions & 12 deletions doc/Training-the-models-of-Grobid.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Grobid uses different sequence labelling models depending on the labeling task t

* table

The models are located under `grobid/grobid-home/models`. Each of these models can be retrained using amended or additional training data. For production, a model is trained with all the available training data to maximize the performance. For development purposes, it is also possible to evaluate a model with part of the training data.
The models are located under `grobid/grobid-home/models`. Each of these models can be retrained using amended or additional training data. For production, a model is trained with all the available training data to maximize the performance. For development purposes, it is also possible to evaluate a model with part of the training data as frozen set (e.g. holdout set), automatic random split or apply 10-fold cross-evaluation.

## Train and evaluate

Expand All @@ -43,56 +43,88 @@ When generating a new model, a segmentation of data can be done (e.g. 80%-20%) b

There are different ways to generate the new model and run the evaluation, whether running the training and the evaluation of the new model separately or not, and whether to split automatically the training data or not. For any methods, the newly generated models are saved directly under grobid-home/models and replace the previous one. A rollback can be made by replacing the newly generated model by the backup record (`<model name>.wapiti.old`).

### Train and evaluation in one command
### Train and evaluation in one command (simple mode)

For simple training without particular parameters, a single command can be used as follow. All the available annotated files under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus` will be used for trainng and all available annotated files under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation/` will be used for evaluation.

Under the main project directory `grobid/`, run the following command to execute both training and evaluation:

```bash
> ./gradlew <training goal. I.E: train_name-header>
> ./gradlew train_<name_of_model>
```
Example of goal names: `train_header`, `train_date`, `train_name_header`, `train_name_citation`, `train_citation`, `train_affiliation_address`, `train_fulltext`, `train_patent_citation`, ...

The files used for the training are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`, and the evaluation files under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation`.
Example: `train_header`, `train_date`, `train_name_header`, `train_name_citation`, `train_citation`, `train_affiliation_address`, `train_fulltext`, `train_patent_citation`, ...

Examples for training the header model:

```bash
> ./gradlew train_header
```

Examples for training the model for names in header:

```bash
> ./gradlew train_name_header
```

### Train and evaluation separately
Under the main project directory `grobid/`, execute the following command (be sure to have built the project as indicated in [Install GROBID](Install-Grobid.md)):
### Train and evaluation separately and using more parameters (full mode)

To have more flexibility and options for training and evaluating the models, use the following commands.

First be sure to have the full project libraries locally built (see [Install GROBID](Install-Grobid.md) for nore details):

```bash
> ./gradlew clean install
```

Under the main project directory `grobid/`:

**Train** (generate a new model):

Train (generate a new model):
```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <name of the model> -gH grobid-home
```

The training files considered are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`

The training of the models can be controlled using different parameters. The `nbThreads` parameter in the configuration file `grobid-home/config/grobid.yaml` can be increased to speed up the training. Similarly, modifying the stopping criteria can help speed up the training. Please refer [this comment](https://github.com/kermitt2/grobid/issues/336#issuecomment-412516422) to know more.

Evaluate:
**Evaluate**:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 1 <name of the model> -gH grobid-home
```

The considered evaluation files are located under `grobid/grobid-trainer/resources/dataset/*MODEL*/evaluation`

Automatically split data, train and evaluate:
**Automatically split data, train and evaluate**:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 <name of the model> -gH grobid-home -s <segmentation ratio as a number between 0 and 1, e.g. 0.8 for 80%>
```

For instance, training the date model with a ratio of 75% for training and 25% for evaluation:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 2 date -gH grobid-home -s 0.75
```

A ratio of 1.0 means that all the data available under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus/` will be used for training the model, and the evaluation will be empty. *Automatic split data, train and evaluate* is for the moment only available for the following models: header, citation, date, name-citation, name-header and affiliation-address.
A ratio of 1.0 means that all the data available under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus/` will be used for training the model, and the evaluation will be empty.

**Incremental training**:

The previous commands were starting a training from scratch, using all available training data in one training task.
Incremental training will start from an existing already train model and apply a further training task using the available training data under `grobid/grobid-trainer/resources/dataset/*MODEL*/corpus`.

Launching an incremental training is similar as the previous commands, but adding the parameter `-i`. An existing model under `grobid/grobid-home/models/*MODEL*` must be available. For example:

```bash
> java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-<current version>-onejar.jar 0 <name of the model> -gH grobid-home -i
```

Several runs with different files to evaluate can be made to have a more reliable evaluation (e.g. 10 fold cross-validation). For the time being, such segmentation and iterative evaluation is not yet implemented.
Note that a full training from scratch with all training data should normally provide better accuracy for a model than several iterative training with a partition of the training data. Using incremental training makes sense for exemple when the model has been trained with a lot of data during days/weeks, and an update is required, or for the development of training data when the update of a model must be quick to generate new trainng data.

In incremental training phases, the training parameters might require some update to stop the training earlier than in normal full training.

### N-folds cross-evaluation

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,8 @@ public String classify(List<String> data) {
public static void trainJNI(String modelName, File trainingData, File outputModel) {
try {
LOGGER.info("Train DeLFT classification model " + modelName + "...");
JEPThreadPoolClassifier.getInstance().run(new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath()));
JEPThreadPoolClassifier.getInstance().run(
new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath()));
} catch(InterruptedException e) {
LOGGER.error("Train DeLFT classification model " + modelName + " task failed", e);
}
Expand All @@ -174,13 +175,18 @@ private static class TrainTask implements Runnable {
private String modelName;
private File trainPath;
private File modelPath;
private String architecture;
private boolean incremental;

public TrainTask(String modelName, File trainPath, File modelPath) {
//public TrainTask(String modelName, File trainPath, File modelPath, String architecture, boolean incremental) {
//System.out.println("train thread: " + Thread.currentThread().getId());
this.modelName = modelName;
this.trainPath = trainPath;
this.modelPath = modelPath;
}
this.architecture = null;
this.incremental = false;
}

@Override
public void run() {
Expand All @@ -198,15 +204,44 @@ public void run() {
useELMo = "True";
}

String localArgs = "";
if (GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(this.modelName) != -1)
localArgs += ", maxlen="+
GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(this.modelName);

if (GrobidProperties.getInstance().getDelftTrainingBatchSize(this.modelName) != -1)
localArgs += ", batch_size="+
GrobidProperties.getInstance().getDelftTrainingBatchSize(this.modelName);

if (GrobidProperties.getInstance().getDelftTranformer(modelName) != null) {
localArgs += ", transformer="+
GrobidProperties.getInstance().getDelftTranformer(modelName);
}

// init model to be trained
jep.eval("model = Classifier('"+this.modelName+
"', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+")");
if (this.architecture == null) {
jep.eval("model = Classifier('"+this.modelName+
"', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+localArgs+")");
} else {
jep.eval("model = Classifier('"+this.modelName+
"', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+localArgs+
", architecture='"+architecture+"')");
}

// actual training
//start_time = time.time()
jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
//runtime = round(time.time() - start_time, 3)
//print("training runtime: %s seconds " % (runtime))
if (incremental) {
// if incremental training, we need to load the existing model
if (this.modelPath != null &&
this.modelPath.exists() &&
!this.modelPath.isDirectory()) {
jep.eval("model.load('" + this.modelPath.getAbsolutePath() + "')");
jep.eval("model.train(x_train, y_train, x_valid, y_valid, incremental=True)");
} else {
throw new GrobidException("the path to the model to be used for starting incremental training is invalid: " +
this.modelPath.getAbsolutePath());
}
} else
jep.eval("model.train(x_train, y_train, x_valid, y_valid)");

// saving the model
System.out.println(this.modelPath.getAbsolutePath());
Expand All @@ -223,6 +258,8 @@ public void run() {
jep.eval("del model");
} catch(JepException e) {
LOGGER.error("DeLFT classification model training via JEP failed", e);
} catch(GrobidException e) {
LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
}
}
}
Expand Down Expand Up @@ -260,7 +297,9 @@ public static void train(String modelName, File trainingData, File outputModel)
LOGGER.error("IO error when training DeLFT classification model " + modelName, e);
} catch(InterruptedException e) {
LOGGER.error("Train DeLFT classification model " + modelName + " task failed", e);
}
} catch(GrobidException e) {
LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
}
}

public synchronized void close() {
Expand Down
49 changes: 40 additions & 9 deletions grobid-core/src/main/java/org/grobid/core/jni/DeLFTModel.java
Original file line number Diff line number Diff line change
Expand Up @@ -209,11 +209,11 @@ public String label(String data) {
* usually hangs... Possibly issues with IO threads at the level of JEP (output not consumed because
* of \r and no end of line?).
*/
public static void trainJNI(String modelName, File trainingData, File outputModel, String architecture) {
public static void trainJNI(String modelName, File trainingData, File outputModel, String architecture, boolean incremental) {
try {
LOGGER.info("Train DeLFT model " + modelName + "...");
JEPThreadPool.getInstance().run(
new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath(), architecture));
new TrainTask(modelName, trainingData, GrobidProperties.getInstance().getModelPath(), architecture, incremental));
} catch(InterruptedException e) {
LOGGER.error("Train DeLFT model " + modelName + " task failed", e);
}
Expand All @@ -224,13 +224,15 @@ private static class TrainTask implements Runnable {
private File trainPath;
private File modelPath;
private String architecture;
private boolean incremental;

public TrainTask(String modelName, File trainPath, File modelPath, String architecture) {
public TrainTask(String modelName, File trainPath, File modelPath, String architecture, boolean incremental) {
//System.out.println("train thread: " + Thread.currentThread().getId());
this.modelName = modelName;
this.trainPath = trainPath;
this.modelPath = modelPath;
this.architecture = architecture;
this.incremental = incremental;
}

@Override
Expand Down Expand Up @@ -269,11 +271,23 @@ public void run() {
else
jep.eval("model = Sequence('"+this.modelName+
"', max_epoch=100, recurrent_dropout=0.50, embeddings_name='glove-840B', use_ELMo="+useELMo+localArgs+
", model_type='"+architecture+"')");
", architecture='"+architecture+"')");

// actual training
//start_time = time.time()
jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
if (incremental) {
// if incremental training, we need to load the existing model
if (this.modelPath != null &&
this.modelPath.exists() &&
!this.modelPath.isDirectory()) {
jep.eval("model.load('" + this.modelPath.getAbsolutePath() + "')");
jep.eval("model.train(x_train, y_train, x_valid, y_valid, incremental=True)");
} else {
throw new GrobidException("the path to the model to be used for starting incremental training is invalid: " +
this.modelPath.getAbsolutePath());
}
} else
jep.eval("model.train(x_train, y_train, x_valid, y_valid)");
//runtime = round(time.time() - start_time, 3)
//print("training runtime: %s seconds " % (runtime))

Expand All @@ -292,6 +306,8 @@ public void run() {
jep.eval("del model");
} catch(JepException e) {
LOGGER.error("DeLFT model training via JEP failed", e);
} catch(GrobidException e) {
LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
}
}
}
Expand All @@ -300,7 +316,7 @@ public void run() {
* Train with an external process rather than with JNI, this approach appears to be more stable for the
* training process (JNI approach hangs after a while) and does not raise any runtime/integration issues.
*/
public static void train(String modelName, File trainingData, File outputModel, String architecture) {
public static void train(String modelName, File trainingData, File outputModel, String architecture, boolean incremental) {
try {
LOGGER.info("Train DeLFT model " + modelName + "...");
List<String> command = new ArrayList<>();
Expand All @@ -322,16 +338,29 @@ public static void train(String modelName, File trainingData, File outputModel,
if (GrobidProperties.getInstance().useELMo(modelName) && modelName.toLowerCase().indexOf("bert") == -1) {
command.add("--use-ELMo");
}

if (GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(modelName) != -1) {
command.add("--max-sequence-length");
command.add(String.valueOf(GrobidProperties.getInstance().getDelftTrainingMaxSequenceLength(modelName)));
}

if (GrobidProperties.getInstance().getDelftTrainingBatchSize(modelName) != -1) {
command.add("--batch-size");
command.add(String.valueOf(GrobidProperties.getInstance().getDelftTrainingBatchSize(modelName)));
}
if (incremental) {
command.add("--incremental");

// if incremental training, we need to load the existing model
File modelPath = GrobidProperties.getInstance().getModelPath();
if (modelPath != null &&
modelPath.exists() &&
!modelPath.isDirectory()) {
command.add("--input-model");
command.add(GrobidProperties.getInstance().getModelPath().getAbsolutePath());
} else {
throw new GrobidException("the path to the model to be used for starting incremental training is invalid: " +
GrobidProperties.getInstance().getModelPath().getAbsolutePath());
}
}

ProcessBuilder pb = new ProcessBuilder(command);
File delftPath = new File(GrobidProperties.getInstance().getDeLFTFilePath());
Expand All @@ -349,7 +378,9 @@ public static void train(String modelName, File trainingData, File outputModel,
LOGGER.error("IO error when training DeLFT model " + modelName, e);
} catch(InterruptedException e) {
LOGGER.error("Train DeLFT model " + modelName + " task failed", e);
}
} catch(GrobidException e) {
LOGGER.error("GROBID call to DeLFT training via JEP failed", e);
}
}

public synchronized void close() {
Expand Down
Loading