Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard negative mining for Retriever fine-tuning #523

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

vinay-raman
Copy link
Contributor

@vinay-raman vinay-raman commented Feb 5, 2025

Description

Provides functionality to create training datasets for retriever customization

Usage

  1. Semantically cluster documents into partitions:
python3 repartition.py --input-dir=<input_directory> --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>
  1. Mine hard negatives separately for each of the partitions
python3 mine_hard_negatives.py --input-dir=<output_directory_of step1>/clustered_dataset/ --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

vinay-raman and others added 17 commits February 5, 2025 10:18
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: viraman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
…idia.com

Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: Vinay Raman <viraman@nvidia.com>
Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>
@ryantwolf ryantwolf self-requested a review February 24, 2025 18:29
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update the README for this too. It should answer the following questions:

  1. What is hard negative mining?
  2. When should I use hard negative mining?
  3. How does this version of hard negative mining work?
  4. How can I use these scripts?

@@ -124,7 +124,7 @@ def __call__(self, embeddings_dataset: DocumentDataset):
)

with performance_report_if_with_ts_suffix(self.profile_dir, "clustering-model"):
embeddings_df = embeddings_df[[self.id_col, self.embedding_col]]
# embeddings_df = embeddings_df[[self.id_col, self.embedding_col]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change made?

@@ -3,7 +3,7 @@ base_url: https://integrate.api.nvidia.com/v1
api_key: "your api key here"

# LLM Generator Module
generator_model: mistralai/mixtral-8x22b-instruct-v0.1
generator_model: nvdev/mistralai/mixtral-8x22b-instruct-v0.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

@@ -57,7 +57,7 @@ generator_user_prompt_template: |


#Easiness filter (embedding model as judge)
easiness_filter: nvidia/nv-embedqa-e5-v5
easiness_filter: nvdev/nvidia/nv-embedqa-e5-v5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

# HARD-NEGATIVE MINING parameters
# base_url: "https://integrate.api.nvidia.com/v1"
model_name: "sentence-transformers/all-MiniLM-L6-v2"
# "intfloat/e5-large-unsupervised"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out config.

# SEMANTIC CLUSTERING parameters
min_cluster_size: 50
max_number_clusters: 200
cluster_output_dir: "/home/viraman/qa_bug_fix/curator_outputs/clusters"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change these directories to the defaults in your config.py.

df = df.rename(columns={"documents": "pos_doc"})
df = df[["question", "pos_doc", "neg_doc"]]

return DocumentDataset(dd.from_pandas(df))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing dd.from_pandas here? I don't see where df becomes a pandas dataframe. df = df.to_backend("pandas") makes it such that df is still a Dask dataframe with a pandas backend, but it's still a distributed dataframe.

df["min_pos_score"] = ""

with ProgressBar(dt=1):
df = df.map_partitions(self._process_partition, meta=df).compute()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not call compute. It's up to the user when to call compute outside of the modules.

# nvdev/nvidia/nv-embedqa-e5-v5"
# "nvdev/nvidia/llama-3.2-nv-embedqa-1b-v1"
model_type: "hf"
query_prefix: "query:"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the defaults in the dataclass don't match the defaults in the yaml?

return SentenceTransformer(model_name_or_path, trust_remote_code=True)


class HardNegativeMiner:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a general enough module that we could move it into nemo_curator/ and have this tutorial just show how to use it, what do you think? If we did, you'd also need to inherit from nemo_curator.modules.base.BaseModule.

lambda x: self._get_nim_embedding(x, "query")
)
elif self.model_type == "hf":
self.hf_model = load_object_on_worker(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VibhuJawa should this be using crossfit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be, crossfit has support for these kind of embeddings . This will need some re-work to make that happen. I am not sure about time constraints at play here

def _groupby_question(self, pdf):
pdf2 = pdf.groupby("question").agg({"documents": set})
pdf2["documents"] = pdf2["documents"].map(lambda x: list(x))
del pdf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why call this del?

df = dataset.df
df = df.to_backend("pandas")
df = df[["question", "documents"]]
df = df.map_partitions(self._groupby_question).reset_index()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you confident that all the questions will be in the same partition, or should you do a shuffle first?

"hard negative mining algorithm not mentioned in config, using default"
)
self.hard_neg_mining_algorithm = "topk_percpos"
if self.hard_neg_mining_algorithm == "topk_percpos":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we abstract the algorithms and their arguments into a separate class? Follow a similar pattern to the common crawl extraction algorithms. We should have one base HardNegativeMiningAlgorithm class that defines a single method with a common signature across all subclasses. Instances of those subclasses can be passed in as arguments to this class.

If you feel like too much of the state of the HardNegativeMiner class would need to be passed to this algorithm, then we could instead consider making HardNegativeMiner an ABC that these specific methods inherit from. Let me know what you think.

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took an intial look, will take a deeper look in a bit more time.

self, dataset: DocumentDataset
) -> DocumentDataset:
df = dataset.df
n_data = df.compute().shape[0] # number of row items
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make sure we dont do double computation here and only pull in shape

Suggested change
n_data = df.compute().shape[0] # number of row items
df = df.persist()
n_data = df.shape[0].compute()


def _process_partition(self, df_p: pd.DataFrame):

if self.model_type == "nvidia":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this model_type name to nim or open_ai client , model_type="nvidia" is confusing

Suggested change
if self.model_type == "nvidia":
if self.model_type == "nim":

lambda x: self._get_nim_embedding(x, "query")
)
elif self.model_type == "hf":
self.hf_model = load_object_on_worker(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be, crossfit has support for these kind of embeddings . This will need some re-work to make that happen. I am not sure about time constraints at play here

vinay-raman and others added 2 commits February 26, 2025 06:00
Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>
…ed after sync with latest main, Signed-off by viraman@nvidia.com

Signed-off-by: viraman <viraman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants