update llm finetune methods #44

tanganke · 2024-11-28T15:16:09Z

No description provided.

Copilot

Copilot reviewed 13 out of 28 changed files in this pull request and generated 3 suggestions.

Files not reviewed (15)

examples/lm_finetune/llama_fullfinetune.sh: Language not supported
examples/lm_finetune/llama_reward_modeling.sh: Language not supported
examples/lm_finetune/create_reward_model.ipynb: Evaluated as low risk
fusion_bench/dataset/llama/init.py: Evaluated as low risk
fusion_bench/mixins/init.py: Evaluated as low risk
config/modelpool/SeqenceClassificationModelPool/llama_preference700k.yaml: Evaluated as low risk
config/method/lm_finetune/bradly_terry_rm.yaml: Evaluated as low risk
fusion_bench/mixins/fabric_training.py: Evaluated as low risk
fusion_bench/method/lm_finetune/fullfinetune_sft.py: Evaluated as low risk
fusion_bench/modelpool/init.py: Evaluated as low risk
fusion_bench/method/lm_finetune/init.py: Evaluated as low risk
fusion_bench/modelpool/causal_lm/causal_lm.py: Evaluated as low risk
fusion_bench/method/init.py: Evaluated as low risk
config/method/lm_finetune/fullfinetune_sft.yaml: Evaluated as low risk
fusion_bench/dataset/llama/alpaca.py: Evaluated as low risk

Comments skipped due to low confidence (2)

fusion_bench/dataset/llama/preference_700k.py:59

Ensure that rank_zero_only is properly initialized or imported to avoid potential runtime errors.

if cache_path is not None and rank_zero_only.rank == 0:

fusion_bench/modelpool/seq_classification_lm/init.py:2

The class name 'SeqenceClassificationModelPool' is misspelled. It should be 'SequenceClassificationModelPool'.

from .seq_classification_lm import SeqenceClassificationModelPool

Copilot · 2024-12-01T10:34:49Z

fusion_bench/dataset/llama/preference_700k.py

+
+    def tokenize(sample):
+
+        # ? is it necessary to `.replace(tokenizer.bos_token, "")`?


[nitpick] The comment contains a question which might be confusing. It should be revised to provide clear information.

Suggested change

# ? is it necessary to `.replace(tokenizer.bos_token, "")`?

# Remove the beginning-of-sequence token from the tokenized text

Copilot · 2024-12-01T10:34:50Z

fusion_bench/dataset/llama/collate.py

+def bradly_terry_rm_collate(
+    batch: List[Dict[str, List[int]]],
+    pad_token_id: int = 0,
+    padding_side="right",


The padding_side argument is not valid for the pad_sequence function and should be removed.

Suggested change

padding_side="right",

padding_side=None,

Copilot · 2024-12-01T10:34:50Z

fusion_bench/dataset/llama/collate.py

@@ -64,3 +64,57 @@ def padded_collate_sft(
            collated_batch[key] = [x[key] for x in batch]

    return collated_batch
+
+
+def bradly_terry_rm_collate(


The function name bradly_terry_rm_collate should be corrected to bradley_terry_rm_collate.

Suggested change

def bradly_terry_rm_collate(

def bradley_terry_rm_collate(

Copilot

Copilot reviewed 13 out of 28 changed files in this pull request and generated 4 suggestions.

Files not reviewed (15)

examples/lm_finetune/llama_fullfinetune.sh: Language not supported
examples/lm_finetune/llama_reward_modeling.sh: Language not supported
examples/lm_finetune/create_reward_model.ipynb: Evaluated as low risk
fusion_bench/dataset/llama/init.py: Evaluated as low risk
config/modelpool/SeqenceClassificationModelPool/llama_preference700k.yaml: Evaluated as low risk
fusion_bench/dataset/llama/alpaca.py: Evaluated as low risk
fusion_bench/modelpool/seq_classification_lm/init.py: Evaluated as low risk
fusion_bench/dataset/llama/metamathqa.py: Evaluated as low risk
fusion_bench/modelpool/causal_lm/causal_lm.py: Evaluated as low risk
fusion_bench/method/lm_finetune/init.py: Evaluated as low risk
fusion_bench/method/init.py: Evaluated as low risk
config/method/lm_finetune/fullfinetune_sft.yaml: Evaluated as low risk
fusion_bench/mixins/init.py: Evaluated as low risk
fusion_bench/method/lm_finetune/peftfinetune_sft.py: Evaluated as low risk
fusion_bench/modelpool/init.py: Evaluated as low risk

Comments skipped due to low confidence (1)

fusion_bench/dataset/llama/preference_700k.py:41

[nitpick] The comment questioning the necessity of replacing the bos_token is ambiguous. Please clarify or remove it if it's not needed.

# ? is it necessary to `.replace(tokenizer.bos_token, "")`?

Copilot · 2024-12-01T10:40:03Z

fusion_bench/mixins/fabric_training.py

+                    os.symlink(
+                        save_path,
+                        os.path.join(self.log_dir, "checkpoints", "latest_model.ckpt"),
+                        os.path.isdir(save_path),


The os.path.isdir(save_path) check is incorrect. The save_path is a file path, not a directory path. This should be removed.

Suggested change

os.path.isdir(save_path),

os.path.join(self.log_dir, 'checkpoints', 'latest_model.ckpt')

Copilot · 2024-12-01T10:40:03Z

fusion_bench/mixins/fabric_training.py

+            )
+        if self.max_epochs > 0:
+            self._expected_total_steps.append(
+                len(train_dataloader) * self.max_epochs // self.accumulate_grad_batches


The compute_expected_total_steps method should account for the case when self.accumulate_grad_batches is 0 to avoid division by zero errors.

Copilot · 2024-12-01T10:40:03Z

fusion_bench/dataset/llama/collate.py

+        [torch.tensor(x["input_ids"]) for x in converted_batch],
+        batch_first=True,
+        padding_value=pad_token_id,
+        padding_side=padding_side,


The pad_sequence function does not support the padding_side parameter. This will cause a runtime error. Please remove the padding_side parameter from the pad_sequence calls.

Suggested change

padding_side=padding_side,

padding_value=pad_token_id,

Copilot · 2024-12-01T10:40:04Z

fusion_bench/dataset/llama/collate.py

+        [torch.tensor(x["attention_mask"]) for x in converted_batch],
+        batch_first=True,
+        padding_value=0,
+        padding_side=padding_side,


The pad_sequence function does not support the padding_side parameter. This will cause a runtime error. Please remove the padding_side parameter from the pad_sequence calls.

Suggested change

padding_side=padding_side,

padding_value=0,

tanganke added 2 commits November 28, 2024 22:48

add bradly terry reward modeling

1c316ff

add config for rm

563fed9

pull-request-size bot added the size/XL label Nov 28, 2024

tanganke added 2 commits November 29, 2024 00:17

format code

d7b8239

add chat_templates

7367db5

pull-request-size bot added size/XXL and removed size/XL labels Nov 29, 2024

tanganke added 7 commits November 29, 2024 09:59

udpate configuration

5d2d1f0

update BT reward modeling

b906646

fix bug

7967fa5

fix bug

ff66375

update training methods for llm

d8a09c7

update fusion_bench/mixins/fabric_training.py

0dcd63c

fix bug in bradley_terry_rm.py

8bbcbcc

tanganke changed the title ~~merge develop into main~~ update llm finetune mesthods Dec 1, 2024

tanganke changed the title ~~update llm finetune mesthods~~ update llm finetune methods Dec 1, 2024

tanganke marked this pull request as ready for review December 1, 2024 10:33

tanganke requested a review from Copilot December 1, 2024 10:33

Copilot AI reviewed Dec 1, 2024

View reviewed changes

fix typo

111bb16

tanganke requested a review from Copilot December 1, 2024 10:38

Copilot AI reviewed Dec 1, 2024

View reviewed changes

tanganke merged commit a4847b1 into main Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update llm finetune methods #44

update llm finetune methods #44

tanganke commented Nov 28, 2024

Copilot AI left a comment

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI left a comment

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Copilot AI Dec 1, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.


		def tokenize(sample):

		# ? is it necessary to `.replace(tokenizer.bos_token, "")`?

	# ? is it necessary to `.replace(tokenizer.bos_token, "")`?
	# Remove the beginning-of-sequence token from the tokenized text

	os.path.isdir(save_path),
	os.path.join(self.log_dir, 'checkpoints', 'latest_model.ckpt')

update llm finetune methods #44

update llm finetune methods #44

Conversation

tanganke commented Nov 28, 2024

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment

Copilot AI Dec 1, 2024

Choose a reason for hiding this comment