Thank you for considering contributing to SG-NLP! We believe that SG-NLP's future is closely tied with the community and community contributions will help SG-NLP to grow faster.
- Fork the
sgnlp
repository. - Create a Python virtual environment (verison >= 3.8)
- Install the packages in
requirements_dev.txt
at the root of the repository usingpip install -r requirements_dev.txt
. This will installblack
,flake8
, andpytest
which we use for code formatting and testing.
Before you create a pull request to add your model to sgnlp
, please ensure that you have the following components ready.
- Python scripts / code comprising of (we'll go into more detail below)
- config.py
- modeling.py
- preprocess.py
- train.py
- eval.py
- utils.py (optional)
- README
- requirements.txt (discouraged)
- Model information (to be included in the README)
- Original paper / source
- Datasets and/or how to obtain them
- Evaluation metrics
- Model size
- Training information
- Model weights and artefacts (to be submitted separately)
- pytorch_model.bin
- config.json
- tokenizer_config.json (optional)
To contribute a model, add a folder for the model at sgnlp/sgnlp/models/<model_name>
. The following components are required within this folder.
Folder / File | Description |
---|---|
<model_name> | Folder containing the modeling, preprocess, config, train, and eval scripts. |
<model_name>/config | Folder containing the JSON configuration files used for the train and eval scripts. |
<model_name>/config.py | Script containing the model config class which inherits from HuggingFace's PretrainedConfig or its family of derived classes. |
<model_name>/eval.py | Script containing code to evaluate the model performance. |
<model_name>/modeling.py | Script containing the model class which inherits from HuggingFace's PretrainedModel class or its family of derived classes. |
<model_name>/preprocess.py | Script containing code to preprocess the input text into the model's required input tensors. |
<model_name>/tokenization.py (optional) | Script containing the model tokenizer class which inherits from HuggingFace's PretrainedTokenizer class or its family of derived classes. |
<model_name>/train.py | Script containing code to train the model. It is recommended to utilize the Trainer class from HuggingFace. |
<model_name>/README.md | README markdown file containing model information such as model source, architecture, evaluation datasets and metrics, model size, and training information. |
To manage the number of dependencies installed with sgnlp
, contributors are strongly recommended to limit their code to use the packages listed in setup.py
. If additional dependencies are required, please introduce a check in the __init__.py
at sgnlp/sgnlp/models/<model_name>/__init__.py
. For example, the Latent Structure Refinement model for Relation Extraction requires the networkx
package. The code snippet from LSR's __init__.py
checks if networkx
is installed when the model is imported. Users will have to install these additional dependencies separately.
from ...utils.requirements import check_requirements
requirements = ["networkx"]
check_requirements(requirements)
Model configs contain model architecture information. Typically, this would include hyperparameters for the different layers within the model as well as the loss function. Model configs should inherit from the PretrainedConfig
class from the transformers
package. The following is an example from the Cross Lingual Cross Domain Sentiment Analysis model.
from transformers import PretrainedConfig
class UFDClassifierConfig(PretrainedConfig):
model_type = "classifier"
def __init__(self, embed_dim=1024, num_class=2, initrange=0.1, **kwargs):
super().__init__(**kwargs)
self.embed_dim = embed_dim
self.num_class = num_class
self.initrange = initrange
For models that use or adapt pre-trained configs already available in the transformers
package, the model config should inherit from the pre-trained config class instead. For example, this model config inherits from BertConfig
which is a child class of PretrainedConfig
.
from transformers import BertConfig
class NewModelConfig(BertConfig):
def __init__(self, **kwargs):
super().__init__(**kwargs)
The preprocess.py
script and its associated preprocessor class is an addition in sgnlp
. When implementing various models, the team found that some models required more complex preprocessing. For example, some NLP models take in multiple different text inputs (ie, different utterances, multiple tweets, a single question and multiple possible answers, etc) which require different preprocessing steps. The preprocess.py
and the preprocessor class is the team's solution to packaging all of these different steps into a single and consistent (across different models) step.
The preprocessor class inherits from the default object
class. All preprocessing steps should be executed in the class' __call__
method. The __call__
method should return a dictionary containing all the necessary input tensors required by the model. The following code snippet illustrates the __call__
method from the RECCON Span Extraction model's RecconSpanExtractionPreprocessor
.
class RecconSpanExtractionPreprocessor:
def __call__(
self, data_batch: Dict[str, List[str]]
) -> Tuple[
BatchEncoding,
List[Dict[str, Union[int, str]]],
List[SquadExample],
List[SquadFeatures],
]:
self._check_values_len(data_batch)
concatenated_batch, evidences = self._concatenate_batch(data_batch)
dataset, examples, features = load_examples(
concatenated_batch, self.tokenizer, evaluate=True, output_examples=True
)
input_ids = [torch.unsqueeze(instance[0], 0) for instance in dataset]
attention_mask = [torch.unsqueeze(instance[1], 0) for instance in dataset]
token_type_ids = [torch.unsqueeze(instance[2], 0) for instance in dataset]
output = {
"input_ids": torch.cat(input_ids, axis=0),
"attention_mask": torch.cat(attention_mask, axis=0),
"token_type_ids": torch.cat(token_type_ids, axis=0),
}
output = BatchEncoding(output)
return output, evidences, examples, features
In the RECCON Span Extraction model, output
is a dictionary with the token ids, attention masks and token type ids for the input utterance. evidences
, examples
and features
are other features required in the RECCON model. The key idea here is to consolidate all the necessary preprocessing steps into a single method to reduce the effort needed to start using the models.
The tokenizer.py
is optional if the preprocess.py
already contains a tokenizer. All tokenizers should inherit from the PreTrainedTokenizer
or PreTrainedTokenizerFast
classes from the transformers
package.
from transformers import PreTrainedTokenizer
class NewModelTokenizer(PreTrainedTokenizer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
For models that use or adapt pre-trained tokenizers already available in the transformers
package, the tokenizer should inherit from the pre-trained tokenizer class instead. For example, the RECCON Span Extraction tokenizer inherits from BertTokenizer
which inherits from PreTrainedTokenizer
.
from transformers import BertTokenizer
class RecconSpanExtractionTokenizer(BertTokenizer):
"""
Constructs a Reccon Span Extraction tokenizer, derived from the Bert tokenizer.
Args:
vocab_file (:obj:`str`):
Path to the vocabulary file.
do_lower_case (:obj:`bool`, defaults to :obj:`False`):
Whether or not to lowercase the input when tokenizing.
"""
def __init__(self, vocab_file: str, do_lower_case: bool = False, **kwargs) -> None:
super().__init__(vocab_file=vocab_file, do_lower_case=do_lower_case, **kwargs)
There are 2 steps required to add a new model class. The first step is to introduce a NewModelPreTrainedModel
class which handles weights instantiation, downloading and loading pretrained models. This class should inherit from the PreTrainedModel
class from transformers
.
The key things to define as the config_class
, base_model_prefix
class attributes and _init_weights
method. The _init_weights
method dictates how the weights for the different layers are instantiated.
from transformers import PreTrainedModel
from .config import LsrConfig
class LsrPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = LsrConfig
base_model_prefix = "lsr"
def _init_weights(self, module):
""" Initialize the weights """
...
Subsequently, the main model class should inherit from this NewModelPreTrainedModel
class. The main model class contains the code required to execute the model's forward pass.
class RumourDetectionTwitterModel(RumourDetectionTwitterPreTrainedModel):
def __init__(self, config: RumourDetectionTwitterConfig):
super().__init__(config)
self.config = config
self.wordEncoder = WordEncoder(self.config)
self.positionEncoderWord = PositionEncoder(config, self.config.max_length)
self.positionEncoderTime = PositionEncoder(config, self.config.size)
self.hierarchicalTransformer = HierarchicalTransformer(self.config)
if config.loss == "cross_entropy":
self.loss = nn.CrossEntropyLoss()
self.init_weights()
def forward(
self,
token_ids: torch.Tensor,
time_delay_ids: torch.Tensor,
structure_ids: torch.Tensor,
token_attention_mask=None,
post_attention_mask=None,
labels: Optional[torch.Tensor] = None,
):
X = self.wordEncoder(token_ids)
word_pos = self.prepare_word_pos(token_ids).to(X.device)
word_pos = self.positionEncoderWord(word_pos)
time_delay = self.positionEncoderTime(time_delay_ids)
logits = self.hierarchicalTransformer(
X,
word_pos,
time_delay,
structure_ids,
attention_mask_word=token_attention_mask,
attention_mask_post=post_attention_mask,
)
if labels is not None:
loss = self.loss(logits, labels)
else:
loss = None
return RumourDetectionTwitterModelOutput(loss=loss, logits=logits)
There are 3 key things to note in the above implementation.
-
When initialising the model, it is important to invoke the
init_weights()
method. Note the lack of an underscore at the start of the method name. This is required so that the model weights are initialized using the methods defined in the__init__
method specified inNewModelPreTrainedModel
. -
The
forward
method takes in an optionallabels
argument. If this argument is passed to the model, theforward
method should also return the value of the loss function for that batch of inputs. -
The
forward
method's output is an object of theRumourDetectionTwitterModelOutput
dataclass. This dataclass is illustrated in the code snippet below.
from dataclasses import dataclass
from transformers.file_utils import ModelOutput
@dataclass
class RumourDetectionTwitterModelOutput(ModelOutput):
"""
Base class for outputs of Rumour Detection models
Args:
loss (:obj:`torch.Tensor` of shape `(1,)`, `optional`, returned when :obj:`labels` is provided):
Classification loss, typically cross entropy. Loss function used is dependent on what is specified in RumourDetectionTwitterConfig.
logits (:obj:`torch.Tensor` of shape :obj:`(batch_size, num_classes)`):
Raw logits for each class. num_classes = 4 by default.
"""
loss: Optional[torch.Tensor] = None
logits: torch.Tensor = None
train.py
should contain a working implementation of the model training process. A user should be able to train the model from the command line using python -m train --train_args config/config.json
.
eval.py
should contain a working implementing of the model evaluation process. This script should load the trained model (using the information in config/config.json
) and evaluate it against the evaluation datasets. The evaluation metrics reported should correspond to that reported in the README.md
.
utils.py
should contain other functions which are useful for train.py
or eval.py
but do not directly fit within any of the other scripts above.
The README
for the model should provide a concise introduction to the model. The following information are required:
- Citation and link to the original paper or source that introduced the model
- Citation and link to the train, validation and test datasets that the model was trained on. If the model was trained and evaluated on licensed datasets, information should be provided on how the SG-NLP team can obtain access to the evaluation (test) dataset. The train dataset may be omitted.
- Evaluation metrics. Please cite the appropriate paper if the evaluation metric is a published benchmark.
- Model size (in terms of the size of the trained model's weights)
- Training information such as hyperparameter values and compute time and resources used to train the model.
Model weights and artefects comprise of:
- Saved model weights. Specifically, the
pytorch_model.bin
file saved using thesave_pretrained
method from the model class implemented inmodeling.py
. For now, only models implemented inPyTorch
are accepted. The team is looking into accepting models implemented inTensorFlow
as well. - Model config. The
config.json
generated when using thesave_pretrained
method from the model config class implemented inconfig.py
. - Any artefacts needed by the tokenizer or preprocessor.
If you spotted a bug, please follow these steps to report them.
- Check the issues list to see whether the bug has already been reported. If an issue has already been created, please comment on that issue with details on how to replicate the bug.
- If there is no issue relevant to the bug, please open a GitHub with
- Clear description of the bug
- Information related to your environment (ie, OS, package version, Python version, etc)
- Steps on how to replicate the bug (ie, a code snippet that we could run to encounter the bug)
One easy way to contribute is to add or refine documentation / docstrings to the models that are currently available. sgnlp
uses the Google Python Style Guide for our docstrings. Once the docstrings have been added or edited, please submit a pull request.