Skip to content

Commit

Permalink
docs and simplification of cmd args (NVIDIA#8979)
Browse files Browse the repository at this point in the history
* docs and simplification of cmd args

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added cicd test

Signed-off-by: arendu <adithya.r@gmail.com>

* added cicd test is needs

Signed-off-by: arendu <adithya.r@gmail.com>

* Update information_retrieval.rst

Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>

* updated to fix wrong file paths

Signed-off-by: arendu <adithya.r@gmail.com>

* update

Signed-off-by: arendu <adithya.r@gmail.com>

* Update cicd-main.yml

Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pablo Garay <palenq@gmail.com>
  • Loading branch information
3 people authored Apr 29, 2024
1 parent 6c20bc8 commit 428546f
Show file tree
Hide file tree
Showing 6 changed files with 200 additions and 22 deletions.
55 changes: 55 additions & 0 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4648,6 +4648,60 @@ jobs:
rm -rf examples/nlp/language_modeling/gpt_sft_results
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

L2_Megatron_GPT_Embedding:
needs: [cicd-test-container-setup]
runs-on: self-hosted-azure
container:
image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
options:
# --user 0:128
--device=/dev/nvidia0
--gpus all
--shm-size=8g
--env TRANSFORMERS_OFFLINE=0
--env HYDRA_FULL_ERROR=1
--volume /mnt/datadrive/TestData:/home/TestData
steps:
- name: Checkout repository
uses: actions/checkout@v2
- run: |
rm -rf /home/TestData/nlp/megatron_ir/working_dir
python examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
exp_manager.exp_dir='/home/TestData/nlp/megatron_ir/working_dir' \
model.global_batch_size=4 \
model.micro_batch_size=4 \
trainer.devices=1 \
trainer.num_nodes=1 \
trainer.max_epochs=null \
trainer.max_steps=20 \
trainer.val_check_interval=10 \
model.restore_from_path='/home/TestData/nlp/megatron_gpt/mcore_45M/megatron_llama.nemo' \
model.peft.lora_tuning.adapter_dim=8 \
model.data.validation_ds.query_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
model.data.validation_ds.doc_file_names=[/home/TestData/nlp/megatron_ir/test_doc.jsonl] \
model.data.validation_ds.write_embeddings_to_file=True \
model.data.validation_ds.output_file_path_prefix='/home/TestData/nlp/megatron_ir/working_dir/val_embs' \
model.data.train_ds.file_names=[/home/TestData/nlp/megatron_ir/train.jsonl]
python examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
trainer.devices=1 \
trainer.num_nodes=1 \
model.restore_from_path='/home/TestData/nlp/megatron_gpt/mcore_45M/megatron_llama.nemo' \
model.peft.restore_from_path='/home/TestData/nlp/megatron_ir/working_dir/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning.nemo' \
model.global_batch_size=4 \
model.micro_batch_size=4 \
model.peft.lora_tuning.adapter_dim=8 \
model.data.test_ds.write_embeddings_to_file=True \
model.data.test_ds.output_file_path_prefix='/home/TestData/nlp/megatron_ir/working_dir/test_embs' \
model.data.test_ds.query_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
model.data.test_ds.doc_file_names=[/home/TestData/nlp/megatron_ir/test_doc.jsonl]
rm -rf /home/TestData/nlp/megatron_ir/working_dir
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

L2_Megatron_GPT_PEFT_Lora_PP2:
needs: [cicd-test-container-setup]
Expand Down Expand Up @@ -6256,6 +6310,7 @@ jobs:
- L2_Megatron_GPT_Pretraining_and_Resume_Training_PP2
- L2_Megatron_GPT_Finetuning_PP2
- L2_Megatron_GPT_Finetuning_StarCoder_PP1
- L2_Megatron_GPT_Embedding
- L2_Megatron_GPT_PEFT_Lora_PP2
- L2_Megatron_GPT_PEFT_Lora_TP2
- L2_Megatron_GPT_Eval
Expand Down
104 changes: 104 additions & 0 deletions docs/source/nlp/information_retrieval.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,107 @@ Then you can fine-tune the sentence-BERT model using the following script:
exp_manager.wandb_logger_kwargs.name=${NAME} \
exp_manager.wandb_logger_kwargs.project=${PROJECT}
GPT Embedding Models
=====================

Recent work has also shown that it is possible to use Decoder-Only (GPT Style) models to train embedding models.
`Improving Text Embeddings with
Large Language Models <https://arxiv.org/pdf/2401.00368.pdf>`__ is one such recent papers which served as inspiration to implement Decoder-only embedding training in Nemo.

Training a GPT Embedding Model
-------------------------------

To train GPT Embedding models we follow a format very similar to the SBERT Embedding training. However, there are a couple of differences. GPT Embedding model training expects a `jsonl` file in which each line is a json object. Here is a truncated example of data jsonl file::

{"query": "What did ... 1952-2002 period?", "pos_doc": "Morning (2008) ... has changed little.", "neg_doc": "Even though ... sapiens.", "query_id": "q103151", "doc_id": "d14755"}
{"query": "What type of ... passions?", "pos_doc": "Burke was a leading ... upper classes.", "neg_doc": "Writing to a friend ... Government.", "query_id": "q77959", "doc_id": "d11263"}
{"query": "Since 1999, ... progressed at?", "pos_doc": "Commercial solar water ... as of 2007.", "neg_doc": "The potential solar ... acquire.", "query_id": "q16545", "doc_id": "d1883"}


As visible the json object should contain the following fields ``query``, ``pos_doc``, ``neg_doc``, ``query_id`` and ``doc_id``. The ``query_id`` and ``doc_id`` can be any alphanumeric string that uniquely maps to the ``query`` string and ``pos_doc`` string.

During training, the GPT Embedding model employs LoRA (by default) to learn embeddings for the queries and documents, such that similarity of the ``query``-to-``pos_doc`` are maximized while simultaneously minimizing ``query``-to-``neg_doc`` similarity. LoRA allows us to fine-tune large LLMs such as Mistral 7B model with a relatively small number of training parameters.

An example command to launch a training job is

.. code-block:: console
python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
exp_manager.exp_dir="PATH_TO_SAVE_LORA_WEIGHTS" \
model.global_batch_size=4 \ # exact choice for global batch size is data dependent typical values are in the range of 32 to 128.
model.micro_batch_size=4 \ # exact choice for micro batch size is GPU memory dependent 2 to 8 are reasonable values.
trainer.devices=1 \ # indicates how many GPUs to use during training per node.
trainer.num_nodes=1 \ # indicates how many nodes to use if multi-node cluster is available
trainer.max_steps=20 \ # how many training steps to run.
model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \
model.peft.lora_tuning.adapter_dim=16 \ # the low-rank size for lora weights.
model.data.train_ds.file_names=["train.jsonl"]
The full list of possible run arguments is configurable in ``/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_tuning_config.yaml``. By default a trained model file should be generated in here ``PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints/`` typically with the extension ``.nemo``.


Inference using a GPT Embedding Model
-------------------------------------

Once trained, the GPT Embedding Model can be used to generate embeddings for queries and corpus documents. We can launch inference using the following command:

.. code-block:: console
python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
model.global_batch_size=4 \
model.micro_batch_size=4 \
trainer.devices=1 \
trainer.num_nodes=1 \
model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \ # Same base model used at training time.
model.peft.restore_from_path="PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints//megatron_gpt_peft_lora_tuning.nemo" \
model.data.test_ds.query_file_names=["test_query.jsonl"] \
model.data.test_ds.doc_file_names=\["test_docs.jsonl"] \
model.data.test_ds.write_embeddings_to_file=True \
model.data.test_ds.output_file_path_prefix="PATH_TO_SAVE_EMEBDDINGS"
The contents of ``test_queries.jsonl`` is expected to be in the following format::

{"query": "What do ... quantities?","query_id": "q11600", "doc_id": "d1172"}
{"query": "What are ... subsectors?", "query_id": "q5831", "doc_id": "d577"}
{"query": "Which article ... Government?", "query_id": "q3037", "doc_id": "d336"}

Here, the ``doc_id`` field is expected to be the id of the document/passage which is the correct passage for the query. Note that since we are in inference mode, we don't require query-doc pairs.

The contents of ``test_docs.jsonl`` is expected to be in the following format::

{"pos_doc": "Hormones ... vitamin D.", "doc_id": "d823"}
{"pos_doc": "Historically, Victoria ... October 2016.", "doc_id": "d159"}
{"pos_doc": "Exceptional examples ... Warsaw.", "doc_id": "d1084"}

Once again, we show 3 examples form each file. Typically the ``test_docs.jsonl`` will contain more items than queries in the ``test_queries.jsonl``.

The inference command will result in two folders

* ``PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries``
* ``PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_docs``

The ``X`` in the folder ``consumed_samplesX`` is a number denoted number of batches consumed, this is not crucial at test time, but it is useful in training which we will see in the next section. First, let's take a look at the ``test_queries``.

.. code-block:: console
$> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries
query.ids query.npy
$>head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries/query.ids
q11600
q5831
q3037
``query.npy`` is a numpy pickled array containing rows of query embeddings and the ``query.ids`` text file list the id of each embedding in the same order.

Similarly let's look into the ``test_docs`` folder

.. code-block:: console
$> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/
doc.ids doc.npy
$> head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/doc.ids
d823
d159
d1084
We can see that ``test_doc`` has a similar structure to ``test_queries`` but with ids and embeddings of the documents from the ``test_docs.josnl`` file. With this setup it is possible to evaluate the performance using metrics like MRR or NDCG.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ trainer:
devices: 1
accelerator: gpu
num_nodes: 1
precision: 16
precision: bf16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
Expand Down Expand Up @@ -66,8 +66,14 @@ model:
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
temperature: 0.8
temperature: 0.02
num_soft_negatives: 0 # Number of soft negatives to use for contrastive loss,it should be max(batch_size - 1), 0 means use hard negatives only
use_all_possible_negatives: False # If True, use all possible negatives for contrastive loss, otherwise use num_soft_negatives, if num_soft_negatives is 0, use hard negatives only
post_process: False # should be False.
transformer_engine: True # required to be True for newer versions of Megatron-LM based models
mcore_gpt: True # required to be True for newer versions of Megatron-LM based models
use_flash_attention: True
precision: bf16

peft:
peft_scheme: "lora" # can be either adapter,ia3, or ptuning
Expand Down Expand Up @@ -119,8 +125,8 @@ model:
query_file_names: ??? # Path to a list of JSONL files corresponding to the query data. Data format is identical to validation_ds.
doc_file_names: ??? # Path to a list of JSONL files corresponding to the doc data. Data format is identical to validation_ds.
names: ["queries", "doc"] # Names of the corresponding datasets used to log metrics.
global_batch_size: 1
micro_batch_size: 1
global_batch_size: ${global_batch_size}
micro_batch_size: ${micro_batch_size}
shuffle: False
num_workers: 0
pin_memory: True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@ trainer:
devices: 1
accelerator: gpu
num_nodes: 1
precision: 16
precision: bf16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
max_epochs: 9999
max_epochs: null
max_steps: 20000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
log_every_n_steps: 10 # frequency with which training steps are logged
val_check_interval: 200 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
gradient_clip_val: 1.0
val_check_interval: ${trainer.max_steps} # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
gradient_clip_val: null
num_sanity_val_steps: 0

exp_manager:
explicit_log_dir: null
Expand All @@ -34,7 +35,7 @@ exp_manager:
model_parallel_size: ${model.tensor_model_parallel_size}
always_save_nemo: False
save_best_model: True
create_early_stopping_callback: True
create_early_stopping_callback: False
early_stopping_callback_params:
monitor: "val_loss"
mode: "min"
Expand All @@ -54,16 +55,16 @@ model:
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training.
sync_batch_comm: False
megatron_amp_O2: False
megatron_amp_O2: True

## Sequence Parallelism
# Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
# See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
sequence_parallel: False

## Activation Checkpoint
activations_checkpoint_granularity: null # 'selective' or 'full'
activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
activations_checkpoint_granularity: selective # 'selective' or 'full'
activations_checkpoint_method: uniform # 'uniform', 'block', not used with 'selective'
# 'uniform' divides the total number of transformer layers and checkpoints the input activation
# of each chunk at the specified granularity
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
Expand All @@ -74,9 +75,14 @@ model:
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
temperature: 0.8
temperature: 0.02
num_soft_negatives: 0 # Number of soft negatives to use for contrastive loss,it should be max(batch_size - 1), 0 means use hard negatives only
use_all_possible_negatives: False # If True, use all possible negatives for contrastive loss, otherwise use num_soft_negatives, if num_soft_negatives is 0, use hard negatives only
post_process: False # should be False.
transformer_engine: True # required to be True for newer versions of Megatron-LM based models
mcore_gpt: True # required to be True for newer versions of Megatron-LM based models
use_flash_attention: True
precision: bf16

peft:
peft_scheme: "lora" # can be either adapter,ia3, or ptuning
Expand Down Expand Up @@ -135,31 +141,32 @@ model:
num_workers: 0
memmap_workers: 2
pin_memory: True
max_seq_length: 2048
max_seq_length: 512 # Even if the base model can handle longer sequences, 512 is generally a good choice for training efficiency.
min_seq_length: 1
drop_last: True
# Example of how to specify concat_sampling_probabilities
# concat_sampling_probabilities:
# - 0.5
# - 0.25
# - 0.25
concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
concat_sampling_probabilities:
- 1.0
label_key: 'output'
add_eos: True
add_bos: False
index_mapping_dir: null # Path to a directory to write index mapping files.
truncation_method: 'right' # Truncation from which position, Options: ['left', 'right']
validation_ds:
query_file_names: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
doc_file_names: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
query_file_names: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
doc_file_names: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
names: ["queries", "doc"] # Names of the corresponding datasets used to log metrics.
global_batch_size: ${model.global_batch_size}
micro_batch_size: ${model.micro_batch_size}
shuffle: False
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: True
max_seq_length: 2048
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: False
label_key: ${model.data.train_ds.label_key}
Expand All @@ -182,7 +189,7 @@ model:
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: True
max_seq_length: 2048
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: False
add_eos: ${model.data.train_ds.add_eos}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,10 @@ def _build_dataset(self, data_cfg, is_train=True):
_, _, num_train_samples_per_dataset = get_datasets_weights_and_num_samples(data_prefix, num_train_samples)
num_train_samples_after_blend = sum([x[0] for x in num_train_samples_per_dataset])
else:
num_query_samples_per_dataset = [[None]] * len(data_cfg.query_file_names)
num_doc_samples_per_dataset = [[None]] * len(data_cfg.doc_file_names)
num_query_files = len(data_cfg.query_file_names) if data_cfg.query_file_names is not None else 0
num_doc_files = len(data_cfg.doc_file_names) if data_cfg.doc_file_names is not None else 0
num_query_samples_per_dataset = [[None]] * num_query_files
num_doc_samples_per_dataset = [[None]] * num_doc_files

# Check dataset max_seq_legnth and max_position_embeddings size
if (
Expand Down Expand Up @@ -174,6 +176,9 @@ def _build_dataset(self, data_cfg, is_train=True):
)
return dataset
else:
if data_cfg.query_file_names is None or data_cfg.doc_file_names is None:
return []

query_dataset = GPTEmbeddingDataset(
file_path=data_cfg.query_file_names[0],
tokenizer=self.tokenizer,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -804,7 +804,8 @@ def build_train_valid_test_datasets(self, stage):
logging.info('Building GPT SFT validation datasets.')
# Wrap this in a list since the general finetuning parent class supports multi-validation.
self._validation_ds = self._build_dataset(self.cfg.data.validation_ds, is_train=False)
logging.info(f'Length of val dataset: {len(self._validation_ds[0])}')
if self._validation_ds:
logging.info(f'Length of val dataset: {len(self._validation_ds[0])}')

if stage != 'validate':
self.maybe_build_test()
Expand Down

0 comments on commit 428546f

Please sign in to comment.