Hard negative mining for Retriever fine-tuning #523

vinay-raman · 2025-02-05T19:37:30Z

Description

Provides functionality to create training datasets for retriever customization

Usage

Semantically cluster documents into partitions:

python3 repartition.py --input-dir=<input_directory> --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>

Mine hard negatives separately for each of the partitions

python3 mine_hard_negatives.py --input-dir=<output_directory_of step1>/clustered_dataset/ --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: viraman <viraman@nvidia.com>

Signed-off-by: Vinay Raman <viraman@nvidia.com>

…idia.com Signed-off-by: Vinay Raman <viraman@nvidia.com>

Signed-off-by: Vinay Raman <viraman@nvidia.com>

Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>

ryantwolf

You should update the README for this too. It should answer the following questions:

What is hard negative mining?
When should I use hard negative mining?
How does this version of hard negative mining work?
How can I use these scripts?

ryantwolf · 2025-02-24T18:31:05Z

nemo_curator/modules/semantic_dedup/clusteringmodel.py

@@ -124,7 +124,7 @@ def __call__(self, embeddings_dataset: DocumentDataset):
            )

        with performance_report_if_with_ts_suffix(self.profile_dir, "clustering-model"):
-            embeddings_df = embeddings_df[[self.id_col, self.embedding_col]]
+            # embeddings_df = embeddings_df[[self.id_col, self.embedding_col]]


Why was this change made?

ryantwolf · 2025-02-24T18:31:26Z

tutorials/nemo-retriever-synthetic-data-generation/config/config.yaml

@@ -3,7 +3,7 @@ base_url: https://integrate.api.nvidia.com/v1
 api_key: "your api key here"

 # LLM Generator Module
-generator_model: mistralai/mixtral-8x22b-instruct-v0.1
+generator_model: nvdev/mistralai/mixtral-8x22b-instruct-v0.1


ryantwolf · 2025-02-24T18:31:30Z

tutorials/nemo-retriever-synthetic-data-generation/config/config.yaml

@@ -57,7 +57,7 @@ generator_user_prompt_template: |


 #Easiness filter (embedding model as judge)
-easiness_filter: nvidia/nv-embedqa-e5-v5
+easiness_filter: nvdev/nvidia/nv-embedqa-e5-v5


ryantwolf · 2025-02-24T18:31:52Z

tutorials/nemo-retriever-synthetic-data-generation/config/hard-negative-mining-config.yaml

+# HARD-NEGATIVE MINING parameters
+# base_url: "https://integrate.api.nvidia.com/v1"
+model_name: "sentence-transformers/all-MiniLM-L6-v2"
+# "intfloat/e5-large-unsupervised"


Remove commented out config.

ryantwolf · 2025-02-24T18:35:20Z

tutorials/nemo-retriever-synthetic-data-generation/config/hard-negative-mining-config.yaml

+# SEMANTIC CLUSTERING parameters
+min_cluster_size: 50
+max_number_clusters: 200
+cluster_output_dir: "/home/viraman/qa_bug_fix/curator_outputs/clusters"


Change these directories to the defaults in your config.py.

ryantwolf · 2025-02-24T18:38:27Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+        df = df.rename(columns={"documents": "pos_doc"})
+        df = df[["question", "pos_doc", "neg_doc"]]
+
+        return DocumentDataset(dd.from_pandas(df))


Why are you doing dd.from_pandas here? I don't see where df becomes a pandas dataframe. df = df.to_backend("pandas") makes it such that df is still a Dask dataframe with a pandas backend, but it's still a distributed dataframe.

ryantwolf · 2025-02-24T18:38:56Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+        df["min_pos_score"] = ""
+
+        with ProgressBar(dt=1):
+            df = df.map_partitions(self._process_partition, meta=df).compute()


Do not call compute. It's up to the user when to call compute outside of the modules.

ryantwolf · 2025-02-24T18:40:04Z

tutorials/nemo-retriever-synthetic-data-generation/config/hard-negative-mining-config.yaml

+# nvdev/nvidia/nv-embedqa-e5-v5"
+# "nvdev/nvidia/llama-3.2-nv-embedqa-1b-v1"
+model_type: "hf"
+query_prefix: "query:"


Is there a reason why the defaults in the dataclass don't match the defaults in the yaml?

ryantwolf · 2025-02-24T18:42:47Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+    return SentenceTransformer(model_name_or_path, trust_remote_code=True)
+
+
+class HardNegativeMiner:


This seems like a general enough module that we could move it into nemo_curator/ and have this tutorial just show how to use it, what do you think? If we did, you'd also need to inherit from nemo_curator.modules.base.BaseModule.

ryantwolf · 2025-02-24T18:44:05Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+                lambda x: self._get_nim_embedding(x, "query")
+            )
+        elif self.model_type == "hf":
+            self.hf_model = load_object_on_worker(


@VibhuJawa should this be using crossfit?

Yes, it can be, crossfit has support for these kind of embeddings . This will need some re-work to make that happen. I am not sure about time constraints at play here

ryantwolf · 2025-02-25T17:01:34Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+    def _groupby_question(self, pdf):
+        pdf2 = pdf.groupby("question").agg({"documents": set})
+        pdf2["documents"] = pdf2["documents"].map(lambda x: list(x))
+        del pdf


Why call this del?

ryantwolf · 2025-02-25T17:02:14Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+        df = dataset.df
+        df = df.to_backend("pandas")
+        df = df[["question", "documents"]]
+        df = df.map_partitions(self._groupby_question).reset_index()


Are you confident that all the questions will be in the same partition, or should you do a shuffle first?

ryantwolf · 2025-02-25T17:08:47Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+                "hard negative mining algorithm not mentioned in config, using default"
+            )
+            self.hard_neg_mining_algorithm = "topk_percpos"
+        if self.hard_neg_mining_algorithm == "topk_percpos":


Can we abstract the algorithms and their arguments into a separate class? Follow a similar pattern to the common crawl extraction algorithms. We should have one base HardNegativeMiningAlgorithm class that defines a single method with a common signature across all subclasses. Instances of those subclasses can be passed in as arguments to this class.

If you feel like too much of the state of the HardNegativeMiner class would need to be passed to this algorithm, then we could instead consider making HardNegativeMiner an ABC that these specific methods inherit from. Let me know what you think.

VibhuJawa

Took an intial look, will take a deeper look in a bit more time.

VibhuJawa · 2025-02-25T18:48:15Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+        self, dataset: DocumentDataset
+    ) -> DocumentDataset:
+        df = dataset.df
+        n_data = df.compute().shape[0]  # number of row items


This will make sure we dont do double computation here and only pull in shape

Suggested change

n_data = df.compute().shape[0] # number of row items

df = df.persist()

n_data = df.shape[0].compute()

VibhuJawa · 2025-02-25T18:52:02Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+
+    def _process_partition(self, df_p: pd.DataFrame):
+
+        if self.model_type == "nvidia":


Can we change this model_type name to nim or open_ai client , model_type="nvidia" is confusing

Suggested change

if self.model_type == "nvidia":

if self.model_type == "nim":

VibhuJawa · 2025-02-25T18:53:47Z

tutorials/nemo-retriever-synthetic-data-generation/retriever_hardnegative_miner.py

+                lambda x: self._get_nim_embedding(x, "query")
+            )
+        elif self.model_type == "hf":
+            self.hf_model = load_object_on_worker(


Yes, it can be, crossfit has support for these kind of embeddings . This will need some re-work to make that happen. I am not sure about time constraints at play here

Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>

…ed after sync with latest main, Signed-off by viraman@nvidia.com Signed-off-by: viraman <viraman@nvidia.com>

vinay-raman and others added 17 commits February 5, 2025 10:18

Signed-off by viraman@nvidia.com

e5a88c9

Signed-off-by: viraman <viraman@nvidia.com>

Merge branch 'main' into vinayraman/nvbug_5025154_fix

f6a47f9

Signed-off by viraman@nvidia.com

971df10

Signed-off-by: viraman <viraman@nvidia.com>

Signed-off by viraman@nvidia.com

8a6deb9

Signed-off-by: viraman <viraman@nvidia.com>

fixed qa bug 5008113, Signed-off by viraman@nvidia.com

75c869e

Signed-off-by: viraman <viraman@nvidia.com>

bug fixes for generator, Signed-off by viraman@nvidia.com

547c849

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed precommit, Signed-off by viraman@nvidia.com

cf0ec14

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed filters, Signed-off by viraman@nvidia.com

d9f7be3

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed all issues, Signed-off by viraman@nvidia.com

d9ee0ee

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed model caching in hard negative mining, Signed-off by viraman@nv…

0e69b74

…idia.com Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed minor issues, Signed-off by viraman@nvidia.com

d5d051a

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed merge conflicts, Signed-off by viraman@nvidia.com

f512886

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed minor bug, Signed-off by viraman@nvidia.com

794ef20

Signed-off-by: Vinay Raman <viraman@nvidia.com>

fixed minor issues, Signed-off by viraman@nvidia.com

531b5ac

Signed-off-by: Vinay Raman <viraman@nvidia.com>

removed duplicates in pos_docs, Signed-off by viraman@nvidia.com

2082f16

Signed-off-by: Vinay Raman <viraman@nvidia.com>

Merge branch 'main' into vinayraman/hardnegativemining

bb9ed81

Merge branch 'main' into vinayraman/hardnegativemining

94800e5

Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>

ryantwolf self-requested a review February 24, 2025 18:29

ryantwolf requested changes Feb 24, 2025

View reviewed changes

ryantwolf reviewed Feb 25, 2025

View reviewed changes

VibhuJawa requested changes Feb 25, 2025

View reviewed changes

vinay-raman and others added 2 commits February 26, 2025 06:00

Merge branch 'main' into vinayraman/hardnegativemining

1b122f0

Signed-off-by: vinay-raman <98057837+vinay-raman@users.noreply.github.com>

fixed some comments, added input file for hard negative mining & test…

6acaa62

…ed after sync with latest main, Signed-off by viraman@nvidia.com Signed-off-by: viraman <viraman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard negative mining for Retriever fine-tuning #523

Hard negative mining for Retriever fine-tuning #523

vinay-raman commented Feb 5, 2025 •

edited

Loading

ryantwolf left a comment

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

ryantwolf Feb 24, 2025

VibhuJawa Feb 25, 2025

ryantwolf Feb 25, 2025

ryantwolf Feb 25, 2025

ryantwolf Feb 25, 2025

VibhuJawa left a comment

VibhuJawa Feb 25, 2025

VibhuJawa Feb 25, 2025

VibhuJawa Feb 25, 2025

		return SentenceTransformer(model_name_or_path, trust_remote_code=True)


		class HardNegativeMiner:

	n_data = df.compute().shape[0] # number of row items
	df = df.persist()
	n_data = df.shape[0].compute()


		def _process_partition(self, df_p: pd.DataFrame):

		if self.model_type == "nvidia":

Hard negative mining for Retriever fine-tuning #523

Are you sure you want to change the base?

Hard negative mining for Retriever fine-tuning #523

Conversation

vinay-raman commented Feb 5, 2025 • edited Loading

Description

Usage

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinay-raman commented Feb 5, 2025 •

edited

Loading