-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard negative mining for Retriever fine-tuning #523
base: main
Are you sure you want to change the base?
Hard negative mining for Retriever fine-tuning #523
Conversation
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
…idia.com Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should update the README for this too. It should answer the following questions:
- What is hard negative mining?
- When should I use hard negative mining?
- How does this version of hard negative mining work?
- How can I use these scripts?
@@ -124,7 +124,7 @@ def __call__(self, embeddings_dataset: DocumentDataset): | |||
) | |||
|
|||
with performance_report_if_with_ts_suffix(self.profile_dir, "clustering-model"): | |||
embeddings_df = embeddings_df[[self.id_col, self.embedding_col]] | |||
# embeddings_df = embeddings_df[[self.id_col, self.embedding_col]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change made?
@@ -3,7 +3,7 @@ base_url: https://integrate.api.nvidia.com/v1 | |||
api_key: "your api key here" | |||
|
|||
# LLM Generator Module | |||
generator_model: mistralai/mixtral-8x22b-instruct-v0.1 | |||
generator_model: nvdev/mistralai/mixtral-8x22b-instruct-v0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert
@@ -57,7 +57,7 @@ generator_user_prompt_template: | | |||
|
|||
|
|||
#Easiness filter (embedding model as judge) | |||
easiness_filter: nvidia/nv-embedqa-e5-v5 | |||
easiness_filter: nvdev/nvidia/nv-embedqa-e5-v5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert
# HARD-NEGATIVE MINING parameters | ||
# base_url: "https://integrate.api.nvidia.com/v1" | ||
model_name: "sentence-transformers/all-MiniLM-L6-v2" | ||
# "intfloat/e5-large-unsupervised" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented out config.
# SEMANTIC CLUSTERING parameters | ||
min_cluster_size: 50 | ||
max_number_clusters: 200 | ||
cluster_output_dir: "/home/viraman/qa_bug_fix/curator_outputs/clusters" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change these directories to the defaults in your config.py.
df = df.rename(columns={"documents": "pos_doc"}) | ||
df = df[["question", "pos_doc", "neg_doc"]] | ||
|
||
return DocumentDataset(dd.from_pandas(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you doing dd.from_pandas
here? I don't see where df
becomes a pandas dataframe. df = df.to_backend("pandas")
makes it such that df
is still a Dask dataframe with a pandas backend, but it's still a distributed dataframe.
df["min_pos_score"] = "" | ||
|
||
with ProgressBar(dt=1): | ||
df = df.map_partitions(self._process_partition, meta=df).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not call compute. It's up to the user when to call compute outside of the modules.
# nvdev/nvidia/nv-embedqa-e5-v5" | ||
# "nvdev/nvidia/llama-3.2-nv-embedqa-1b-v1" | ||
model_type: "hf" | ||
query_prefix: "query:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the defaults in the dataclass don't match the defaults in the yaml?
return SentenceTransformer(model_name_or_path, trust_remote_code=True) | ||
|
||
|
||
class HardNegativeMiner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a general enough module that we could move it into nemo_curator/
and have this tutorial just show how to use it, what do you think? If we did, you'd also need to inherit from nemo_curator.modules.base.BaseModule
.
lambda x: self._get_nim_embedding(x, "query") | ||
) | ||
elif self.model_type == "hf": | ||
self.hf_model = load_object_on_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VibhuJawa should this be using crossfit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be, crossfit has support for these kind of embeddings . This will need some re-work to make that happen. I am not sure about time constraints at play here
def _groupby_question(self, pdf): | ||
pdf2 = pdf.groupby("question").agg({"documents": set}) | ||
pdf2["documents"] = pdf2["documents"].map(lambda x: list(x)) | ||
del pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call this del
?
df = dataset.df | ||
df = df.to_backend("pandas") | ||
df = df[["question", "documents"]] | ||
df = df.map_partitions(self._groupby_question).reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you confident that all the questions will be in the same partition, or should you do a shuffle first?
"hard negative mining algorithm not mentioned in config, using default" | ||
) | ||
self.hard_neg_mining_algorithm = "topk_percpos" | ||
if self.hard_neg_mining_algorithm == "topk_percpos": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we abstract the algorithms and their arguments into a separate class? Follow a similar pattern to the common crawl extraction algorithms. We should have one base HardNegativeMiningAlgorithm
class that defines a single method with a common signature across all subclasses. Instances of those subclasses can be passed in as arguments to this class.
If you feel like too much of the state of the HardNegativeMiner
class would need to be passed to this algorithm, then we could instead consider making HardNegativeMiner
an ABC that these specific methods inherit from. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took an intial look, will take a deeper look in a bit more time.
self, dataset: DocumentDataset | ||
) -> DocumentDataset: | ||
df = dataset.df | ||
n_data = df.compute().shape[0] # number of row items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make sure we dont do double computation here and only pull in shape
n_data = df.compute().shape[0] # number of row items | |
df = df.persist() | |
n_data = df.shape[0].compute() |
|
||
def _process_partition(self, df_p: pd.DataFrame): | ||
|
||
if self.model_type == "nvidia": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this model_type name to nim
or open_ai client
, model_type="nvidia" is confusing
if self.model_type == "nvidia": | |
if self.model_type == "nim": |
lambda x: self._get_nim_embedding(x, "query") | ||
) | ||
elif self.model_type == "hf": | ||
self.hf_model = load_object_on_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be, crossfit has support for these kind of embeddings . This will need some re-work to make that happen. I am not sure about time constraints at play here
Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>
…ed after sync with latest main, Signed-off by viraman@nvidia.com Signed-off-by: viraman <viraman@nvidia.com>
Description
Provides functionality to create training datasets for retriever customization
Usage
Checklist