Add GPL adaptation tutorial #2632

vblagoje · 2022-06-03T14:31:12Z

Proposed changes:

Adds GPL tutorial

cc @TuanaCelik

review-notebook-app · 2022-06-03T14:31:16Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

TuanaCelik

Hey @vblagoje - This looks good. A few things I would do:

Some hand holding maybe. Some explanation on what Generative Pseudo Labeling is and then short descriptions of what a cell block is about to do just above it (you already have this for some of them)
I think we may have to add this to the headers here but I will check this.
I would maybe include the full name 'Generative Pseudo Labeling' in the title(s) too. If I understand the tutorial correctly something like this might fit wdyt?: "Generative Pseudo Labeling for Domain Adaptation of Dense Retrieval" or "Domain Adaptation of Dense Retrieval with GPL" (fair if you think it's too long)

vblagoje · 2022-06-03T16:26:41Z

All good points @TuanaCelik Will make the recommended changes.

agnieszka-m · 2022-06-06T11:15:24Z

tutorials/Tutorial17_GPL.py

+# Generative Pseudo Labeling for Domain Adaptation of Dense Retrievals
+#### Note: Adapted to Haystack from Nils Riemers' original [notebook](https://colab.research.google.com/gist/jamescalam/d2c888775c87f9882bb7c379a96adbc8/gpl-domain-adaptation.ipynb#scrollTo=183ff7ab)
+
+NLP models we use every day were trained on a large corpus of data, reflecting the World from the past. What if some significant world-changing events occur in the meantime, and we want our models to know about them? Like the COVID pandemic, for example? It doesn't make sense to retrain the models from scratch as that would be wasteful. What if we could update the models with new essential data instead of retraining the models from scratch? **Generative Pseudo Labeling (GPL)** to the rescue.


How about:
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

agnieszka-m · 2022-06-06T11:18:32Z

tutorials/Tutorial17_GPL.py

+
+NLP models we use every day were trained on a large corpus of data, reflecting the World from the past. What if some significant world-changing events occur in the meantime, and we want our models to know about them? Like the COVID pandemic, for example? It doesn't make sense to retrain the models from scratch as that would be wasteful. What if we could update the models with new essential data instead of retraining the models from scratch? **Generative Pseudo Labeling (GPL)** to the rescue.
+
+In the below example, we demonstrate this for a simple query: "How is COVID-19 transmitted?".


The example below shows you how to use GPL to fine-tune a model so that it can answer the query: "How is COVID-19 transmitted?".

agnieszka-m · 2022-06-06T11:20:08Z

tutorials/Tutorial17_GPL.py

+
+In the below example, we demonstrate this for a simple query: "How is COVID-19 transmitted?".
+
+As model, we use TAS-B: A DistilBERT model that achieves state-of-the-art performance on MS MARCO (500k queries from Bing Search Engine). Both DistilBERT and MS MARCO were created with data from 2018 and before, hence, it lacks the knowledge of any COVID-related information.


We're using TAS-B: a DistilBERT model...
Both DistilBERT and MS MARCO were trained on data from 2018 and before, so they don't have any COVID-related information.

agnieszka-m · 2022-06-06T11:22:06Z

tutorials/Tutorial17_GPL.py

+
+As model, we use TAS-B: A DistilBERT model that achieves state-of-the-art performance on MS MARCO (500k queries from Bing Search Engine). Both DistilBERT and MS MARCO were created with data from 2018 and before, hence, it lacks the knowledge of any COVID-related information.
+
+In your example we use a small collection of just 4 documents. If you search with this model, you get the following results (dot-score & document):


For this example, we're using just four documents. When you ask the model ""How is COVID-19 transmitted?", here are the answers that you get (dot-score and document):

agnieszka-m · 2022-06-06T11:23:25Z

tutorials/Tutorial17_GPL.py

+- 91.54	Polio is transmitted via contaminated water or food
+
+
+As we see, the correct document is just ranked on 3rd place behind how Ebola and HIV are transmitted.


You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.

agnieszka-m · 2022-06-06T11:38:04Z

tutorials/Tutorial17_GPL.py

+
+    """# Use PseudoLabelGenerator to genenerate Retriever adaptation training data
+
+    PseudoLabelGenerator#run will execute all three steps of the GPL [algorithm](https://github.com/UKPLab/gpl#how-does-gpl-work):


I think # before "run" is not needed here.

agnieszka-m · 2022-06-06T11:38:22Z

tutorials/Tutorial17_GPL.py

+     2. Negative mining
+     3. Pseudo labeling (margin scoring)
+
+    The output of the `PseudoLabelGenerator` is the training data we'll use to adapt our `EmbeddingRetriever`


missing full stop

agnieszka-m · 2022-06-06T11:39:04Z

tutorials/Tutorial17_GPL.py

+
+    """# Update the Retriever
+
+    Now that we have the generated training data produced by `PseudoLabelGenerator` we'll update the `EmbeddingRetriever`. Let's take a peek at the training data


..PseudoLabelGenerator, (comma)
... data. (full stop)

agnieszka-m · 2022-06-06T11:40:21Z

tutorials/Tutorial17_GPL.py

+
+    retriever.train(output["gpl_labels"])
+
+    """## Verify EmbeddingRetriever has been adapted; save it for future use


Verify that EmbeddingRetriever is adapted and save it for future use

agnieszka-m · 2022-06-06T11:41:21Z

tutorials/Tutorial17_GPL.py

+
+    """## Verify EmbeddingRetriever has been adapted; save it for future use
+
+    We'll repeat our query and verify the Retriever is now more aware of the completely new concept such as COVID (i.e. ranks #1 in our query).


Let's repeat our query to see if the Retriever learned about COVID and can now rank it as #1 among the answers.

vblagoje · 2022-06-22T17:54:39Z

@agnieszka-m would you please confirm your recommended changes are included? TY

vblagoje · 2022-06-23T12:44:24Z

Seems like it is good to go now @agnieszka-m

agnieszka-m · 2022-06-23T07:22:16Z