Fix retriever only training #25

SachiraKuruppu · 2023-09-16T12:05:06Z

Use bits and bytes to reduce model size.
move PEFT config inside the model to be consistent with the e2e RAG.
Remove hardcoded values. Add command line arguments to make the script generic. Be able to train with different datasets.
Reduce tokenizer max length to 128. Our current data is less than 100 words. (Maybe this should also be configurable via command line).

shamanez · 2023-09-16T12:30:19Z

@rsachira could you also add the issue with the padding = True for the future references . Would super helpful if you can add a two sentence summary of our convo. :)

SachiraKuruppu · 2023-09-16T13:12:05Z

@shamanez This is what I remember from our conversation. Correct me if I'm wrong.

In the tokenizer, setting the padding to max_length causes it to set all the examples to the same length. The downside is we may not need so many tokens to encode the passages.

result_ = tokenizer(queries, padding="max_length", max_length=128, truncation=True)

We can let the tokenizer dynamically select the token size, and pad to the longest length in the batch. This can be achieved by setting the padding to True.

result_ = tokenizer(queries, padding=True, max_length=128, truncation=True)

However, this is not a good approach for us when it comes to in-batch negative contrastive learning, because it may order the input IDs according to the length.

shamanez · 2023-09-16T12:32:27Z

dalm/models/retriever_only_base_model.py

-        self.model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map={"": 0})
+        self.model = AutoModel.from_pretrained(
+            model_name,
+            load_in_8bit=True,


Remove this line. You don't need this

SachiraKuruppu force-pushed the fix-train-retriever-only branch from 57d00ea to 14d1f6b Compare September 16, 2023 12:31

SachiraKuruppu force-pushed the fix-train-retriever-only branch from 14d1f6b to 401dff8 Compare September 16, 2023 13:23

SachiraKuruppu mentioned this pull request Sep 16, 2023

Update train_retriever_only.py #26

Closed

SachiraKuruppu force-pushed the fix-train-retriever-only branch 3 times, most recently from dc92f2f to e044e84 Compare September 16, 2023 15:21

Fix retriever only training

d464824

SachiraKuruppu force-pushed the fix-train-retriever-only branch from e044e84 to d464824 Compare September 17, 2023 00:09

shamanez approved these changes Sep 17, 2023

View reviewed changes

SachiraKuruppu merged commit 387c79f into main Sep 17, 2023
1 check passed

SachiraKuruppu deleted the fix-train-retriever-only branch September 17, 2023 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix retriever only training #25

Fix retriever only training #25

SachiraKuruppu commented Sep 16, 2023

shamanez commented Sep 16, 2023

SachiraKuruppu commented Sep 16, 2023

shamanez Sep 16, 2023

SachiraKuruppu Sep 17, 2023

Fix retriever only training #25

Fix retriever only training #25

Conversation

SachiraKuruppu commented Sep 16, 2023

shamanez commented Sep 16, 2023

SachiraKuruppu commented Sep 16, 2023

shamanez Sep 16, 2023

Choose a reason for hiding this comment

SachiraKuruppu Sep 17, 2023

Choose a reason for hiding this comment