-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration Explained
In T2R2 we use configuration files in yaml
format to specify our needs. Below we describe sections that you may include in your config. We left some fields undescribed to omit redundancy. You may find an exemplary configuration at the bottom.
-
random_state (int, default: None)
the seed for random operations
-
model_name (str)
HF name of the model you need -
output_dir (str, default: ./results/)
the path of an output dir of your model -
num_labels (int, default: 2)
the number of used labels -
max_length (int, default: 256)
the maximal sequence length -
padding (str, default: max_length)
pad up to the given length parameter -
truncation (bool, default: True)
controls wether to use truncation -
return_tensors (str, default: pt)
returns tensors of a particular framework -
output_path (str, default: best_model)
name for a model file
Exemplary config of model
:
model:
max_length: 128
output_dir: ../../results/
model_name: distilbert-base-uncased
num_labels: 12
padding: max_length
return_tensors: pt
truncation: False
output_path: ../../output_model/
Pick metrics that you need as arguments for this section. Complete list of metrics we handle is available here.
To ommit redundant descriptions - you may find metrics explained under this link.
-
name (str)
the name of the metric -
args (dict[str, Any], default: None)
the dictionary of pairs where keys are metrics' names and values are argument for those metrics
Exemplary config of metrics
:
metrics:
- name: f1_score
args:
average: macro
- name: slicing_scores
args:
base_directory: 'slicing/'
default_file_name: "slicing.pickle"
For each of the 3 following subsections (Training, Testing and Control) you may additionally provide the following arguments or have them in the additional section data
to omit unnecessary repetitions:
-
data_dir (str, default: ./data/)
the path of a directory with the data -
output_dir (str, default: ./results/)
the path of a directory for the output -
text_column_id (int, default: 0)
the index of the text column -
label_column_id (int, default: 1)
the index of the label column -
has_header (bool, default: True)
the variable that describes wether the dataset contain the header
Exemplary config of data
:
data:
data_dir: "../../data/featuresets/"
output_dir: "../../results/"
has_header: False
text_column_id: 1
label_column_id: 2
Also you may enrich training and testing sections with selectors
subsection.
The following paragraphs briefly describe selectors that we provide.
Implementation of the concept of data cartography as outlined in the paper: https://aclanthology.org/2020.emnlp-main.746/.
To fully understand how it works - check this notebook
Example of data cartography yaml snippet:
perform_data_cartography: True
LLM selector was prepared with intention of running on high-performance machines with GPU / Google Colab. To provide you with LLM that is small enough to run it on your own we used TheBloke/Mistral-7B-v0.1-AWQ
.
For this selector you need to provide a prompt
argument in the config. Bear in mind that LLM is not perfect, and sometimes you might work on your prompt. On our side - we feed the model with your prompt and the data converted to string.
The exemplary snippet in yaml:
selectors:
- name: llm
args:
prompt: Generate more synthetic examples
A notebook that covers an installation of additional libraries needed to run LLM selector is here
Check our notebook that explains slicing functions here.
Example of slicing functions yaml snippet:
selectors:
- name: slicing
args:
result_file: '../../data/slicing/train_slicing.pickle'
list_of_slicing_functions: [short, textblob_polarity]
Undersampling performed randomly with RandomUnderSampler from imblearn.
Example of undersampling yaml snippet:
selectors:
- name: random_under_sampler
We give you an opportunity to use your own selectors.
- Prepare a class you want to use - it should inherit from
Selector
class fromt2r2.selector
. Implement itsselect
method. - When declaring your own selector - provide
module_path
as one of the arguments.
Below we present a simple example how to do it.
Example of user selector yaml snippet:
selectors:
- name: UserSelector
args:
module_path: ./my_selector.py
my_selector.py
code
import pandas as pd
from t2r2.selector import Selector
class UserSelector(Selector):
def select(self, dataset: pd.DataFrame) -> pd.DataFrame:
return dataset[:5]
To force specific order in which examples will be passed during training:
training:
curriculum_learning: True
Then you also need to provide the order
column in your training data.
Basically, the examples will be sorted according to order column and won't be shuffled.
You can also use the custom selector to dynamically provide the order of your training examples. For example, to pass examples in the order of increasing lenght of text:
class ClSelector(Selector):
def select(self, dataset: pd.DataFrame) -> pd.DataFrame:
dataset["order"] = [len(i) for i in dataset["text"]]
return dataset
-
dataset_path (str, default: train.csv)
the name of the file with data -
validation_dataset_path (str, default: None)
the validation dataset path -
results_file (str, default: train_results.pickle)
the name of the file with results -
epochs (int, default: 1)
the number of epochs -
batch_size (int, default: 32)
the batch size -
learning_rate (float, default: 0.00001)
the learning rate parameter -
validation_size (float, default: 0.2)
the validation size (between 0 and 1) -
metric_for_best_model (str, default: loss)
the metric that characterizes the best model -
perform_data_cartography (bool, default: False)
the switch for performing a data cartography -
data_cartography_results (str, default: ./data_cartography_metrics.pickle)
the name of a result file with data cartography metrics -
curriculum_learning (bool, default: False)
the switch for performing a curriculum learning
-
dataset_path (str, default: test.csv)
the name of the file with data -
results_file (str, default: test_results.pickle)
the name of the file with results
-
dataset_path (str, default: control.csv)
the name of the file with data -
results_file (str, default: control_results.pickle)
the name of the file with results
Enables tracking an experiment (datasets, model and metrics) with the use of MLflow tools.
-
experiment_name (str)
the experiment name -
tags (dict[str, str])
additional tags such asversion
-
tracking_uri (str)
the endpoint of your server
Exemplary config of mlflow
:
mlflow:
experiment_name: 'my_experiment_1'
tags:
version: 'v1'
tracking_uri: "http://localhost:5000"
Check our notebook with MLflow enabled here.
Enables versioning of an experiment (datasets, model and metrics), showing differences in parameters as well as in metrics, easy checkouts.
If you want to switch it on - add the following lines to your config
dvc:
enabled: true
(default: off)
For more tips check our notebook that presents a workflow with DVC enabled here.
You may find exemplary config.yaml
files with their notebooks in this directory.