Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix retriever only training #25

Merged
merged 1 commit into from
Sep 17, 2023
Merged

Conversation

SachiraKuruppu
Copy link
Contributor

  • Use bits and bytes to reduce model size.
  • move PEFT config inside the model to be consistent with the e2e RAG.
  • Remove hardcoded values. Add command line arguments to make the script generic. Be able to train with different datasets.
  • Reduce tokenizer max length to 128. Our current data is less than 100 words. (Maybe this should also be configurable via command line).

@shamanez
Copy link
Member

@rsachira could you also add the issue with the padding = True for the future references . Would super helpful if you can add a two sentence summary of our convo. :)

@SachiraKuruppu
Copy link
Contributor Author

@shamanez This is what I remember from our conversation. Correct me if I'm wrong.

In the tokenizer, setting the padding to max_length causes it to set all the examples to the same length. The downside is we may not need so many tokens to encode the passages.

result_ = tokenizer(queries, padding="max_length", max_length=128, truncation=True)

We can let the tokenizer dynamically select the token size, and pad to the longest length in the batch. This can be achieved by setting the padding to True.

result_ = tokenizer(queries, padding=True, max_length=128, truncation=True)

However, this is not a good approach for us when it comes to in-batch negative contrastive learning, because it may order the input IDs according to the length.

self.model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map={"": 0})
self.model = AutoModel.from_pretrained(
model_name,
load_in_8bit=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line. You don't need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SachiraKuruppu SachiraKuruppu merged commit 387c79f into main Sep 17, 2023
1 check passed
@SachiraKuruppu SachiraKuruppu deleted the fix-train-retriever-only branch September 17, 2023 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants