docs and simplification of cmd args (NVIDIA#8979)

* docs and simplification of cmd args Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added cicd test Signed-off-by: arendu <adithya.r@gmail.com> * added cicd test is needs Signed-off-by: arendu <adithya.r@gmail.com> * Update information_retrieval.rst Signed-off-by: Adi Renduchintala <adithya.r@gmail.com> * updated to fix wrong file paths Signed-off-by: arendu <adithya.r@gmail.com> * update Signed-off-by: arendu <adithya.r@gmail.com> * Update cicd-main.yml Signed-off-by: Adi Renduchintala <adithya.r@gmail.com> --------- Signed-off-by: arendu <adithya.r@gmail.com> Signed-off-by: Adi Renduchintala <adithya.r@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay <palenq@gmail.com>
rohitrango · Apr 29, 2024 · 428546f · 428546f
1 parent 6c20bc8
commit 428546f
Show file tree

Hide file tree

Showing 6 changed files with 200 additions and 22 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -4648,6 +4648,60 @@ jobs:
             rm -rf examples/nlp/language_modeling/gpt_sft_results
         - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
           if: "failure()"
+
+  L2_Megatron_GPT_Embedding:
+    needs: [cicd-test-container-setup]
+    runs-on: self-hosted-azure
+    container:
+      image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
+      options: 
+        # --user 0:128
+        --device=/dev/nvidia0
+        --gpus all
+        --shm-size=8g
+        --env TRANSFORMERS_OFFLINE=0 
+        --env HYDRA_FULL_ERROR=1
+        --volume /mnt/datadrive/TestData:/home/TestData
+    steps:
+        - name: Checkout repository
+          uses: actions/checkout@v2
+        - run: |
+            rm -rf /home/TestData/nlp/megatron_ir/working_dir
+
+            python examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
+            exp_manager.exp_dir='/home/TestData/nlp/megatron_ir/working_dir' \
+            model.global_batch_size=4 \
+            model.micro_batch_size=4 \
+            trainer.devices=1 \
+            trainer.num_nodes=1 \
+            trainer.max_epochs=null \
+            trainer.max_steps=20 \
+            trainer.val_check_interval=10 \
+            model.restore_from_path='/home/TestData/nlp/megatron_gpt/mcore_45M/megatron_llama.nemo' \
+            model.peft.lora_tuning.adapter_dim=8 \
+            model.data.validation_ds.query_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
+            model.data.validation_ds.doc_file_names=[/home/TestData/nlp/megatron_ir/test_doc.jsonl] \
+            model.data.validation_ds.write_embeddings_to_file=True \
+            model.data.validation_ds.output_file_path_prefix='/home/TestData/nlp/megatron_ir/working_dir/val_embs' \
+            model.data.train_ds.file_names=[/home/TestData/nlp/megatron_ir/train.jsonl]
+
+
+            python examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
+            trainer.devices=1 \
+            trainer.num_nodes=1 \
+            model.restore_from_path='/home/TestData/nlp/megatron_gpt/mcore_45M/megatron_llama.nemo' \
+            model.peft.restore_from_path='/home/TestData/nlp/megatron_ir/working_dir/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning.nemo' \
+            model.global_batch_size=4 \
+            model.micro_batch_size=4 \
+            model.peft.lora_tuning.adapter_dim=8 \
+            model.data.test_ds.write_embeddings_to_file=True \
+            model.data.test_ds.output_file_path_prefix='/home/TestData/nlp/megatron_ir/working_dir/test_embs' \
+            model.data.test_ds.query_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
+            model.data.test_ds.doc_file_names=[/home/TestData/nlp/megatron_ir/test_doc.jsonl]
+
+            rm -rf /home/TestData/nlp/megatron_ir/working_dir
+        - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
+          if: "failure()"
 
   L2_Megatron_GPT_PEFT_Lora_PP2:
     needs: [cicd-test-container-setup]
@@ -6256,6 +6310,7 @@ jobs:
       - L2_Megatron_GPT_Pretraining_and_Resume_Training_PP2
       - L2_Megatron_GPT_Finetuning_PP2
       - L2_Megatron_GPT_Finetuning_StarCoder_PP1
+      - L2_Megatron_GPT_Embedding 
       - L2_Megatron_GPT_PEFT_Lora_PP2
       - L2_Megatron_GPT_PEFT_Lora_TP2
       - L2_Megatron_GPT_Eval

diff --git a/docs/source/nlp/information_retrieval.rst b/docs/source/nlp/information_retrieval.rst
@@ -102,3 +102,107 @@ Then you can fine-tune the sentence-BERT model using the following script:
     exp_manager.wandb_logger_kwargs.name=${NAME} \
     exp_manager.wandb_logger_kwargs.project=${PROJECT}
     
+GPT Embedding Models
+=====================
+
+Recent work has also shown that it is possible to use Decoder-Only (GPT Style) models to train embedding models.
+`Improving Text Embeddings with
+Large Language Models <https://arxiv.org/pdf/2401.00368.pdf>`__ is one such recent papers which served as inspiration to implement Decoder-only embedding training in Nemo.
+
+Training a GPT Embedding Model
+-------------------------------
+
+To train GPT Embedding models we follow a format very similar to the SBERT Embedding training. However, there are a couple of differences. GPT Embedding model training expects a `jsonl` file in which each line is a json object. Here is a truncated example of data jsonl file::
+
+{"query": "What did ... 1952-2002 period?", "pos_doc": "Morning (2008) ... has changed little.", "neg_doc": "Even though ... sapiens.", "query_id": "q103151", "doc_id": "d14755"}
+{"query": "What type of ...  passions?", "pos_doc": "Burke was a leading ... upper classes.", "neg_doc": "Writing to a friend ... Government.", "query_id": "q77959", "doc_id": "d11263"}
+{"query": "Since 1999, ... progressed at?", "pos_doc": "Commercial solar water ... as of 2007.", "neg_doc": "The potential solar ... acquire.", "query_id": "q16545", "doc_id": "d1883"}
+
+
+As visible the json object should contain the following fields ``query``, ``pos_doc``, ``neg_doc``, ``query_id`` and ``doc_id``. The ``query_id`` and ``doc_id`` can be any alphanumeric string that uniquely maps to the ``query`` string and ``pos_doc`` string.
+
+During training, the GPT Embedding model employs LoRA (by default) to learn embeddings for the queries and documents, such that similarity of the ``query``-to-``pos_doc`` are maximized while simultaneously minimizing ``query``-to-``neg_doc`` similarity. LoRA allows us to fine-tune large LLMs such as Mistral 7B model with a relatively small number of training parameters.
+
+An example command to launch a training job is
+
+.. code-block:: console
+
+ python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
+    exp_manager.exp_dir="PATH_TO_SAVE_LORA_WEIGHTS" \
+    model.global_batch_size=4 \                         # exact choice for global batch size is data dependent typical values are in the range of 32 to 128.
+    model.micro_batch_size=4 \                          # exact choice for micro batch size is GPU memory dependent 2 to 8 are reasonable values.
+    trainer.devices=1 \                                 # indicates how many GPUs to use during training per node.
+    trainer.num_nodes=1 \                               # indicates how many nodes to use if multi-node cluster is available
+    trainer.max_steps=20 \                              # how many training steps to run.
+    model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \
+    model.peft.lora_tuning.adapter_dim=16 \             # the low-rank size for lora weights.
+    model.data.train_ds.file_names=["train.jsonl"]
+
+The full list of possible run arguments is configurable in ``/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_tuning_config.yaml``. By default a trained model file should be generated in here ``PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints/`` typically with the extension ``.nemo``.
+
+
+Inference using a GPT Embedding Model
+-------------------------------------
+
+Once trained, the GPT Embedding Model can be used to generate embeddings for queries and corpus documents. We can launch inference using the following command:
+
+.. code-block:: console
+
+ python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
+    model.global_batch_size=4 \
+    model.micro_batch_size=4 \
+    trainer.devices=1 \
+    trainer.num_nodes=1 \
+    model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \  # Same base model used at training time. 
+    model.peft.restore_from_path="PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints//megatron_gpt_peft_lora_tuning.nemo" \ 
+    model.data.test_ds.query_file_names=["test_query.jsonl"] \
+    model.data.test_ds.doc_file_names=\["test_docs.jsonl"] \
+    model.data.test_ds.write_embeddings_to_file=True \
+    model.data.test_ds.output_file_path_prefix="PATH_TO_SAVE_EMEBDDINGS" 
+
+The contents of ``test_queries.jsonl`` is expected to be in the following format::
+
+{"query": "What do ... quantities?","query_id": "q11600", "doc_id": "d1172"}
+{"query": "What are ... subsectors?", "query_id": "q5831", "doc_id": "d577"}
+{"query": "Which article ... Government?", "query_id": "q3037", "doc_id": "d336"}
+
+Here, the ``doc_id`` field is expected to be the id of the document/passage which is the correct passage for the query. Note that since we are in inference mode, we don't require query-doc pairs.
+
+The contents of ``test_docs.jsonl`` is expected to be in the following format::
+
+{"pos_doc": "Hormones ... vitamin D.", "doc_id": "d823"}
+{"pos_doc": "Historically, Victoria ... October 2016.", "doc_id": "d159"}
+{"pos_doc": "Exceptional examples ... Warsaw.", "doc_id": "d1084"}
+
+Once again, we show 3 examples form each file. Typically the ``test_docs.jsonl`` will contain more items than queries in the ``test_queries.jsonl``.
+
+The inference command will result in two folders 
+
+* ``PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries`` 
+* ``PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_docs``
+
+The ``X`` in the folder ``consumed_samplesX`` is a number denoted number of batches consumed, this is not crucial at test time, but it is useful in training which we will see in the next section. First, let's take a look at the ``test_queries``.
+
+.. code-block:: console
+
+ $> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries
+ query.ids  query.npy
+ $>head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries/query.ids 
+ q11600
+ q5831
+ q3037
+
+``query.npy`` is a numpy pickled array containing rows of query embeddings and the ``query.ids`` text file list the id of each embedding in the same order.
+
+Similarly let's look into the ``test_docs`` folder
+
+.. code-block:: console
+
+ $> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/
+ doc.ids  doc.npy
+ $> head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/doc.ids 
+ d823
+ d159
+ d1084
+
+We can see that ``test_doc`` has a similar structure to ``test_queries`` but with ids and embeddings of the documents from the ``test_docs.josnl`` file. With this setup it is possible to evaluate the performance using metrics like MRR or NDCG.
diff --git a/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_generate_config.yaml b/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_generate_config.yaml
@@ -4,7 +4,7 @@ trainer:
   devices: 1
   accelerator: gpu
   num_nodes: 1
-  precision: 16
+  precision: bf16
   logger: False # logger provided by exp_manager
   enable_checkpointing: False
   use_distributed_sampler: False
@@ -66,8 +66,14 @@ model:
   hidden_dropout: 0.0
   attention_dropout: 0.0
   ffn_dropout: 0.0
-  temperature: 0.8
+  temperature: 0.02
   num_soft_negatives: 0 # Number of soft negatives to use for contrastive loss,it should be max(batch_size - 1), 0 means use hard negatives only
+  use_all_possible_negatives: False # If True, use all possible negatives for contrastive loss, otherwise use num_soft_negatives, if num_soft_negatives is 0, use hard negatives only
+  post_process: False # should be False.
+  transformer_engine: True # required to be True for newer versions of Megatron-LM based models
+  mcore_gpt: True # required to be True for newer versions of Megatron-LM based models
+  use_flash_attention: True
+  precision: bf16
 
   peft:
     peft_scheme: "lora"  # can be either adapter,ia3, or ptuning
@@ -119,8 +125,8 @@ model:
       query_file_names: ??? # Path to a list of JSONL files corresponding to the query data. Data format is identical to validation_ds.
       doc_file_names: ??? # Path to a list of JSONL files corresponding to the doc data. Data format is identical to validation_ds.
       names: ["queries", "doc"] # Names of the corresponding datasets used to log metrics.
-      global_batch_size: 1
-      micro_batch_size: 1
+      global_batch_size: ${global_batch_size}
+      micro_batch_size: ${micro_batch_size}
       shuffle: False
       num_workers: 0
       pin_memory: True

diff --git a/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_tuning_config.yaml b/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_tuning_config.yaml
@@ -4,15 +4,16 @@ trainer:
   devices: 1
   accelerator: gpu
   num_nodes: 1
-  precision: 16
+  precision: bf16
   logger: False # logger provided by exp_manager
   enable_checkpointing: False
   use_distributed_sampler: False
-  max_epochs: 9999
+  max_epochs: null
   max_steps: 20000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
   log_every_n_steps: 10 # frequency with which training steps are logged
-  val_check_interval: 200 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
-  gradient_clip_val: 1.0
+  val_check_interval: ${trainer.max_steps} # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
+  gradient_clip_val: null
+  num_sanity_val_steps: 0
 
 exp_manager:
   explicit_log_dir: null
@@ -34,7 +35,7 @@ exp_manager:
     model_parallel_size: ${model.tensor_model_parallel_size}
     always_save_nemo: False
     save_best_model: True
-  create_early_stopping_callback: True
+  create_early_stopping_callback: False
   early_stopping_callback_params:
     monitor: "val_loss"
     mode: "min"
@@ -54,16 +55,16 @@ model:
   resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
   save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training.
   sync_batch_comm: False
-  megatron_amp_O2: False
+  megatron_amp_O2: True 
 
   ## Sequence Parallelism
   # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
   # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
   sequence_parallel: False
 
   ## Activation Checkpoint
-  activations_checkpoint_granularity: null # 'selective' or 'full'
-  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
+  activations_checkpoint_granularity: selective # 'selective' or 'full'
+  activations_checkpoint_method: uniform # 'uniform', 'block', not used with 'selective'
   # 'uniform' divides the total number of transformer layers and checkpoints the input activation
   # of each chunk at the specified granularity
   # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
@@ -74,9 +75,14 @@ model:
   hidden_dropout: 0.0
   attention_dropout: 0.0
   ffn_dropout: 0.0
-  temperature: 0.8
+  temperature: 0.02
   num_soft_negatives: 0 # Number of soft negatives to use for contrastive loss,it should be max(batch_size - 1), 0 means use hard negatives only
   use_all_possible_negatives: False # If True, use all possible negatives for contrastive loss, otherwise use num_soft_negatives, if num_soft_negatives is 0, use hard negatives only
+  post_process: False # should be False.
+  transformer_engine: True # required to be True for newer versions of Megatron-LM based models
+  mcore_gpt: True # required to be True for newer versions of Megatron-LM based models
+  use_flash_attention: True
+  precision: bf16
 
   peft:
     peft_scheme: "lora"  # can be either adapter,ia3, or ptuning
@@ -135,31 +141,32 @@ model:
       num_workers: 0
       memmap_workers: 2
       pin_memory: True
-      max_seq_length: 2048
+      max_seq_length: 512  # Even if the base model can handle longer sequences, 512 is generally a good choice for training efficiency.
       min_seq_length: 1
       drop_last: True
       # Example of how to specify concat_sampling_probabilities
       # concat_sampling_probabilities:
       #   - 0.5
       #   - 0.25
       #   - 0.25
-      concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
+      concat_sampling_probabilities: 
+        - 1.0 
       label_key: 'output'
       add_eos: True
       add_bos: False
       index_mapping_dir: null # Path to a directory to write index mapping files.
       truncation_method: 'right' # Truncation from which position, Options: ['left', 'right'] 
     validation_ds:
-      query_file_names: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
-      doc_file_names: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      query_file_names: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
+      doc_file_names: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
       names: ["queries", "doc"] # Names of the corresponding datasets used to log metrics.
       global_batch_size: ${model.global_batch_size}
       micro_batch_size: ${model.micro_batch_size}
       shuffle: False
       num_workers: 0
       memmap_workers: ${model.data.train_ds.memmap_workers}
       pin_memory: True
-      max_seq_length: 2048
+      max_seq_length: ${model.data.train_ds.max_seq_length}
       min_seq_length: 1
       drop_last: False
       label_key: ${model.data.train_ds.label_key}
@@ -182,7 +189,7 @@ model:
       num_workers: 0
       memmap_workers: ${model.data.train_ds.memmap_workers}
       pin_memory: True
-      max_seq_length: 2048
+      max_seq_length: ${model.data.train_ds.max_seq_length}
       min_seq_length: 1
       drop_last: False
       add_eos: ${model.data.train_ds.add_eos}

diff --git a/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py b/nemo/collections/nlp/models/information_retrieval/megatron_gpt_embedding_model.py
@@ -123,8 +123,10 @@ def _build_dataset(self, data_cfg, is_train=True):
             _, _, num_train_samples_per_dataset = get_datasets_weights_and_num_samples(data_prefix, num_train_samples)
             num_train_samples_after_blend = sum([x[0] for x in num_train_samples_per_dataset])
         else:
-            num_query_samples_per_dataset = [[None]] * len(data_cfg.query_file_names)
-            num_doc_samples_per_dataset = [[None]] * len(data_cfg.doc_file_names)
+            num_query_files = len(data_cfg.query_file_names) if data_cfg.query_file_names is not None else 0
+            num_doc_files = len(data_cfg.doc_file_names) if data_cfg.doc_file_names is not None else 0
+            num_query_samples_per_dataset = [[None]] * num_query_files
+            num_doc_samples_per_dataset = [[None]] * num_doc_files
 
         # Check dataset max_seq_legnth and max_position_embeddings size
         if (
@@ -174,6 +176,9 @@ def _build_dataset(self, data_cfg, is_train=True):
             )
             return dataset
         else:
+            if data_cfg.query_file_names is None or data_cfg.doc_file_names is None:
+                return []
+
             query_dataset = GPTEmbeddingDataset(
                 file_path=data_cfg.query_file_names[0],
                 tokenizer=self.tokenizer,

diff --git a/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py b/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py
@@ -804,7 +804,8 @@ def build_train_valid_test_datasets(self, stage):
             logging.info('Building GPT SFT validation datasets.')
             # Wrap this in a list since the general finetuning parent class supports multi-validation.
             self._validation_ds = self._build_dataset(self.cfg.data.validation_ds, is_train=False)
-            logging.info(f'Length of val dataset: {len(self._validation_ds[0])}')
+            if self._validation_ds:
+                logging.info(f'Length of val dataset: {len(self._validation_ds[0])}')
 
         if stage != 'validate':
             self.maybe_build_test()