Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs and simplification of cmd args #8979

Merged
merged 13 commits into from
Apr 29, 2024
Merged
54 changes: 54 additions & 0 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4644,6 +4644,59 @@ jobs:
rm -rf examples/nlp/language_modeling/gpt_sft_results
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

L2_Megatron_GPT_Embedding:
needs: [cicd-test-container-setup]
runs-on: self-hosted-azure
container:
image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
options:
# --user 0:128
--device=/dev/nvidia0
--gpus all
--shm-size=8g
--env TRANSFORMERS_OFFLINE=0
--env HYDRA_FULL_ERROR=1
--volume /mnt/datadrive/TestData:/home/TestData
steps:
- name: Checkout repository
uses: actions/checkout@v2
- run: |
rm -rf /home/TestData/nlp/megatron_ir/working_dir

python examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
exp_manager.exp_dir='/home/TestData/nlp/megatron_ir/working_dir' \
model.global_batch_size=4 \
model.micro_batch_size=4 \
trainer.devices=1 \
trainer.num_nodes=1 \
trainer.max_epochs=null \
trainer.max_steps=20 \
trainer.val_check_interval=10 \
model.restore_from_path='/home/TestData/nlp/megatron_gpt/mcore_45M/megatron_llama.nemo' \
model.peft.lora_tuning.adapter_dim=8 \
model.data.validation_ds.query_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
model.data.validation_ds.doc_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
model.data.validation_ds.write_embeddings_to_file=True \
model.data.validation_ds.output_file_path_prefix='/home/TestData/nlp/megatron_ir/working_dir/val_embs' \
model.data.train_ds.file_names=[/home/TestData/nlp/megatron_ir/train.jsonl]


python examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
trainer.devices=1 \
trainer.num_nodes=1 \
model.restore_from_path='/home/TestData/nlp/megatron_gpt/mcore_45M/megatron_llama.nemo' \
model.peft.restore_from_path='/home/TestData/nlp/megatron_ir/working_dir/megatron_gpt_peft_lora_tuning/checkpoints/megatron_gpt_peft_lora_tuning.nemo' \
model.data.test_ds.query_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
model.data.test_ds.doc_file_names=[/home/TestData/nlp/megatron_ir/test_query.jsonl] \
model.global_batch_size=4 \
model.micro_batch_size=4 \
model.data.test_ds.write_embeddings_to_file=True \
model.data.test_ds.out ut_file_path_prefix='/home/TestData/nlp/megatron_ir/working_dir/test_embs'

rm -rf /home/TestData/nlp/megatron_ir/working_dir
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

L2_Megatron_GPT_PEFT_Lora_PP2:
needs: [cicd-test-container-setup]
Expand Down Expand Up @@ -6252,6 +6305,7 @@ jobs:
- L2_Megatron_GPT_Pretraining_and_Resume_Training_PP2
- L2_Megatron_GPT_Finetuning_PP2
- L2_Megatron_GPT_Finetuning_StarCoder_PP1
- L2_Megatron_GPT_Embedding
- L2_Megatron_GPT_PEFT_Lora_PP2
- L2_Megatron_GPT_PEFT_Lora_TP2
- L2_Megatron_GPT_Eval
Expand Down
104 changes: 104 additions & 0 deletions docs/source/nlp/information_retrieval.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,107 @@ Then you can fine-tune the sentence-BERT model using the following script:
exp_manager.wandb_logger_kwargs.name=${NAME} \
exp_manager.wandb_logger_kwargs.project=${PROJECT}

GPT Embedding Models
=====================

Recent work has also shown that it is possible to use Decoder-Only (GPT Style) models to train embedding models.
`Improving Text Embeddings with
Large Language Models <https://arxiv.org/pdf/2401.00368.pdf>`__ is one such recent papers which served as inspritation to implement Decoder-only embdding training in Nemo.
arendu marked this conversation as resolved.
Show resolved Hide resolved

Training a GPT Embedding Model
-------------------------------

To train GPT Embedding models we follow a format very similar to the SBERT Embedding training. However, there are a couple of differences. GPT Embedding model training expects a `jsonl` file in which each line is a json object. Here is a truncated example of data jsonl file::

{"query": "What did ... 1952-2002 period?", "pos_doc": "Morning (2008) ... has changed little.", "neg_doc": "Even though ... sapiens.", "query_id": "q103151", "doc_id": "d14755"}
{"query": "What type of ... passions?", "pos_doc": "Burke was a leading ... upper classes.", "neg_doc": "Writing to a friend ... Government.", "query_id": "q77959", "doc_id": "d11263"}
{"query": "Since 1999, ... progressed at?", "pos_doc": "Commercial solar water ... as of 2007.", "neg_doc": "The potential solar ... acquire.", "query_id": "q16545", "doc_id": "d1883"}


As visible the json object should contain the following fields ``query``, ``pos_doc``, ``neg_doc``, ``query_id`` and ``doc_id``. The ``query_id`` and ``doc_id`` can be any alphanumeric string that uniquely maps to the ``query`` string and ``pos_doc`` string.

During training, the GPT Embedding model employs LoRA (by default) to learn embeddings for the queries and documents, such that similarity of the ``query``-to-``pos_doc`` are maximized while simultaneously minimizing ``query``-to-``neg_doc`` similarity. LoRA allows us to fine-tune large LLMs such as Mistral 7B model with a relatively small number of training parameters.

An example command to launch a training job is

.. code-block:: console

python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
exp_manager.exp_dir="PATH_TO_SAVE_LORA_WEIGHTS" \
model.global_batch_size=4 \ # exact choice for global batch size is data dependent typical values are in the rage of 32 to 128.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rage -> range

model.micro_batch_size=4 \ # exact choice for micro batch size is GPU memory dependent 2 to 8 are reasonable values.
trainer.devices=1 \ # indicates how many GPUs to use during training per node.
trainer.num_nodes=1 \ # indicates how many nodes to use if multi-node cluster is available
trainer.max_steps=20 \ # how many training steps to run.
model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \
model.peft.lora_tuning.adapter_dim=16 \ # the low-rank size for lora weights.
model.data.train_ds.file_names=["train.jsonl"]

The full list of possible run arguments is configurable in ``/examples/nlp/information_retrieval/conf/megatron_gpt_embedder_tuning_config.yaml``. By default a trained model file should be generated in here ``PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints/`` typically with the extension ``.nemo``.


Inference using a GPT Embedding Model
-------------------------------------

Once trained, the GPT Embedding Model can be used to generate embeddings for queries and corpus documents. We can launch inference using the following command:

.. code-block:: console

python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
model.global_batch_size=4 \
model.micro_batch_size=4 \
trainer.devices=1 \
trainer.num_nodes=1 \
model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \ # Same base model used at training time.
model.peft.restore_from_path="PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints//megatron_gpt_peft_lora_tuning.nemo" \
model.data.test_ds.query_file_names=["test_query.jsonl"] \
model.data.test_ds.doc_file_names=\["test_docs.jsonl"] \
model.data.test_ds.write_embeddings_to_file=True \
model.data.test_ds.output_file_path_prefix="PATH_TO_SAVE_EMEBDDINGS"

The contents of ``test_queries.jsonl`` is expected to be in the following format::

{"query": "What do ... quantities?","query_id": "q11600", "doc_id": "d1172"}
{"query": "What are ... subsectors?", "query_id": "q5831", "doc_id": "d577"}
{"query": "Which article ... Government?", "query_id": "q3037", "doc_id": "d336"}

Here, the ``doc_id`` field is expected to be the id of the document/passage which is the correct passage for the query. Note that since we are in inference mode, we don't require query-doc pairs.

The contents of ``test_docs.jsonl`` is expected to be in the following format::

{"pos_doc": "Hormones ... vitamin D.", "doc_id": "d823"}
{"pos_doc": "Historically, Victoria ... October 2016.", "doc_id": "d159"}
{"pos_doc": "Exceptional examples ... Warsaw.", "doc_id": "d1084"}

Once again, we show 3 examples form each file. Typically the ``test_docs.jsonl`` will contain more items than queries in the ``test_queries.jsonl``.

The inference command will result in two folders

* ``PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries``
* ``PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_docs``

The ``X`` in the folder ``consumed_samplesX`` is a number denoted number of batches consumed, this is not crucial at test time, but it is useful in training which we will see in the next section. First, lets take a look at the ``test_queries``.
arendu marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: console

$> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries
query.ids query.npy
$>head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries/query.ids
q11600
q5831
q3037

``query.npy`` is a numpy pickled array containing rows of query embeddings and the ``query.ids`` text file list the id of each embedding in the same order.

Similarly lets look into the ``test_docs`` folder
arendu marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: console

$> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/
doc.ids doc.npy
$> head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/doc.ids
d823
d159
d1084

We can see that ``test_doc`` has a similar structure to ``test_queries`` but with ids and embeddings of the documents from the ``test_docs.josnl`` file. With this setup it is possible to evaluate the performance using metrics like MRR or NDCG.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ trainer:
devices: 1
accelerator: gpu
num_nodes: 1
precision: 16
precision: bf16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
Expand Down Expand Up @@ -66,8 +66,14 @@ model:
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
temperature: 0.8
temperature: 0.02
num_soft_negatives: 0 # Number of soft negatives to use for contrastive loss,it should be max(batch_size - 1), 0 means use hard negatives only
use_all_possible_negatives: False # If True, use all possible negatives for contrastive loss, otherwise use num_soft_negatives, if num_soft_negatives is 0, use hard negatives only
post_process: False # should be False.
transformer_engine: True # required to be True for newer versions of Megatron-LM based models
mcore_gpt: True # required to be True for newer versions of Megatron-LM based models
use_flash_attention: True
precision: bf16

peft:
peft_scheme: "lora" # can be either adapter,ia3, or ptuning
Expand Down Expand Up @@ -119,8 +125,8 @@ model:
query_file_names: ??? # Path to a list of JSONL files corresponding to the query data. Data format is identical to validation_ds.
doc_file_names: ??? # Path to a list of JSONL files corresponding to the doc data. Data format is identical to validation_ds.
names: ["queries", "doc"] # Names of the corresponding datasets used to log metrics.
global_batch_size: 1
micro_batch_size: 1
global_batch_size: ${global_batch_size}
micro_batch_size: ${micro_batch_size}
shuffle: False
num_workers: 0
pin_memory: True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@ trainer:
devices: 1
accelerator: gpu
num_nodes: 1
precision: 16
precision: bf16
logger: False # logger provided by exp_manager
enable_checkpointing: False
use_distributed_sampler: False
max_epochs: 9999
max_epochs: null
max_steps: 20000 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
log_every_n_steps: 10 # frequency with which training steps are logged
val_check_interval: 200 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
gradient_clip_val: 1.0
val_check_interval: ${trainer.max_steps} # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
gradient_clip_val: null
num_sanity_val_steps: 0

exp_manager:
explicit_log_dir: null
Expand All @@ -34,7 +35,7 @@ exp_manager:
model_parallel_size: ${model.tensor_model_parallel_size}
always_save_nemo: False
save_best_model: True
create_early_stopping_callback: True
create_early_stopping_callback: False
early_stopping_callback_params:
monitor: "val_loss"
mode: "min"
Expand All @@ -54,16 +55,16 @@ model:
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
save_nemo_on_validation_end: False # Saves an inference ready .nemo file every time a checkpoint is saved during training.
sync_batch_comm: False
megatron_amp_O2: False
megatron_amp_O2: True

## Sequence Parallelism
# Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
# See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
sequence_parallel: False

## Activation Checkpoint
activations_checkpoint_granularity: null # 'selective' or 'full'
activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
activations_checkpoint_granularity: selective # 'selective' or 'full'
activations_checkpoint_method: uniform # 'uniform', 'block', not used with 'selective'
# 'uniform' divides the total number of transformer layers and checkpoints the input activation
# of each chunk at the specified granularity
# 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
Expand All @@ -74,9 +75,14 @@ model:
hidden_dropout: 0.0
attention_dropout: 0.0
ffn_dropout: 0.0
temperature: 0.8
temperature: 0.02
num_soft_negatives: 0 # Number of soft negatives to use for contrastive loss,it should be max(batch_size - 1), 0 means use hard negatives only
use_all_possible_negatives: False # If True, use all possible negatives for contrastive loss, otherwise use num_soft_negatives, if num_soft_negatives is 0, use hard negatives only
post_process: False # should be False.
transformer_engine: True # required to be True for newer versions of Megatron-LM based models
mcore_gpt: True # required to be True for newer versions of Megatron-LM based models
use_flash_attention: True
precision: bf16

peft:
peft_scheme: "lora" # can be either adapter,ia3, or ptuning
Expand Down Expand Up @@ -135,31 +141,32 @@ model:
num_workers: 0
memmap_workers: 2
pin_memory: True
max_seq_length: 2048
max_seq_length: 512 # Even if the base model can handle longer sequences, 512 is generally a good choice for training efficiency.
min_seq_length: 1
drop_last: True
# Example of how to specify concat_sampling_probabilities
# concat_sampling_probabilities:
# - 0.5
# - 0.25
# - 0.25
concat_sampling_probabilities: null # When providing a list of datasets, this arg defines the sampling probabilities from each dataset when strategy='random'
concat_sampling_probabilities:
- 1.0
label_key: 'output'
add_eos: True
add_bos: False
index_mapping_dir: null # Path to a directory to write index mapping files.
truncation_method: 'right' # Truncation from which position, Options: ['left', 'right']
validation_ds:
query_file_names: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
doc_file_names: ??? # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
query_file_names: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
doc_file_names: null # Path to a list of JSONL files corresponding to the source data. Data format is identical to train_ds.
names: ["queries", "doc"] # Names of the corresponding datasets used to log metrics.
global_batch_size: ${model.global_batch_size}
micro_batch_size: ${model.micro_batch_size}
shuffle: False
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: True
max_seq_length: 2048
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: False
label_key: ${model.data.train_ds.label_key}
Expand All @@ -182,7 +189,7 @@ model:
num_workers: 0
memmap_workers: ${model.data.train_ds.memmap_workers}
pin_memory: True
max_seq_length: 2048
max_seq_length: ${model.data.train_ds.max_seq_length}
min_seq_length: 1
drop_last: False
add_eos: ${model.data.train_ds.add_eos}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,10 @@
_, _, num_train_samples_per_dataset = get_datasets_weights_and_num_samples(data_prefix, num_train_samples)
num_train_samples_after_blend = sum([x[0] for x in num_train_samples_per_dataset])
else:
num_query_samples_per_dataset = [[None]] * len(data_cfg.query_file_names)
num_doc_samples_per_dataset = [[None]] * len(data_cfg.doc_file_names)
num_query_files = len(data_cfg.query_file_names) if data_cfg.query_file_names is not None else 0
num_doc_files = len(data_cfg.doc_file_names) if data_cfg.doc_file_names is not None else 0
num_query_samples_per_dataset = [[None]] * num_query_files

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable num_query_samples_per_dataset is not used.
num_doc_samples_per_dataset = [[None]] * num_doc_files

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable num_doc_samples_per_dataset is not used.

# Check dataset max_seq_legnth and max_position_embeddings size
if (
Expand Down Expand Up @@ -174,6 +176,9 @@
)
return dataset
else:
if data_cfg.query_file_names is None or data_cfg.doc_file_names is None:
return []

query_dataset = GPTEmbeddingDataset(
file_path=data_cfg.query_file_names[0],
tokenizer=self.tokenizer,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -804,7 +804,8 @@ def build_train_valid_test_datasets(self, stage):
logging.info('Building GPT SFT validation datasets.')
# Wrap this in a list since the general finetuning parent class supports multi-validation.
self._validation_ds = self._build_dataset(self.cfg.data.validation_ds, is_train=False)
logging.info(f'Length of val dataset: {len(self._validation_ds[0])}')
if self._validation_ds:
logging.info(f'Length of val dataset: {len(self._validation_ds[0])}')

if stage != 'validate':
self.maybe_build_test()
Expand Down
Loading