Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPL adaptation tutorial #2632

Merged
merged 3 commits into from
Jun 26, 2022
Merged

Add GPL adaptation tutorial #2632

merged 3 commits into from
Jun 26, 2022

Conversation

vblagoje
Copy link
Member

@vblagoje vblagoje commented Jun 3, 2022

Proposed changes:

  • Adds GPL tutorial

cc @TuanaCelik

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@vblagoje vblagoje requested a review from TuanaCelik June 3, 2022 14:31
@vblagoje vblagoje added the type:documentation Improvements on the docs label Jun 3, 2022
Copy link
Contributor

@TuanaCelik TuanaCelik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vblagoje - This looks good. A few things I would do:

  • Some hand holding maybe. Some explanation on what Generative Pseudo Labeling is and then short descriptions of what a cell block is about to do just above it (you already have this for some of them)
  • I think we may have to add this to the headers here but I will check this.
  • I would maybe include the full name 'Generative Pseudo Labeling' in the title(s) too. If I understand the tutorial correctly something like this might fit wdyt?: "Generative Pseudo Labeling for Domain Adaptation of Dense Retrieval" or "Domain Adaptation of Dense Retrieval with GPL" (fair if you think it's too long)

@vblagoje
Copy link
Member Author

vblagoje commented Jun 3, 2022

All good points @TuanaCelik Will make the recommended changes.

# Generative Pseudo Labeling for Domain Adaptation of Dense Retrievals
#### Note: Adapted to Haystack from Nils Riemers' original [notebook](https://colab.research.google.com/gist/jamescalam/d2c888775c87f9882bb7c379a96adbc8/gpl-domain-adaptation.ipynb#scrollTo=183ff7ab)

NLP models we use every day were trained on a large corpus of data, reflecting the World from the past. What if some significant world-changing events occur in the meantime, and we want our models to know about them? Like the COVID pandemic, for example? It doesn't make sense to retrain the models from scratch as that would be wasteful. What if we could update the models with new essential data instead of retraining the models from scratch? **Generative Pseudo Labeling (GPL)** to the rescue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect


NLP models we use every day were trained on a large corpus of data, reflecting the World from the past. What if some significant world-changing events occur in the meantime, and we want our models to know about them? Like the COVID pandemic, for example? It doesn't make sense to retrain the models from scratch as that would be wasteful. What if we could update the models with new essential data instead of retraining the models from scratch? **Generative Pseudo Labeling (GPL)** to the rescue.

In the below example, we demonstrate this for a simple query: "How is COVID-19 transmitted?".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example below shows you how to use GPL to fine-tune a model so that it can answer the query: "How is COVID-19 transmitted?".


In the below example, we demonstrate this for a simple query: "How is COVID-19 transmitted?".

As model, we use TAS-B: A DistilBERT model that achieves state-of-the-art performance on MS MARCO (500k queries from Bing Search Engine). Both DistilBERT and MS MARCO were created with data from 2018 and before, hence, it lacks the knowledge of any COVID-related information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using TAS-B: a DistilBERT model...
Both DistilBERT and MS MARCO were trained on data from 2018 and before, so they don't have any COVID-related information.


As model, we use TAS-B: A DistilBERT model that achieves state-of-the-art performance on MS MARCO (500k queries from Bing Search Engine). Both DistilBERT and MS MARCO were created with data from 2018 and before, hence, it lacks the knowledge of any COVID-related information.

In your example we use a small collection of just 4 documents. If you search with this model, you get the following results (dot-score & document):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this example, we're using just four documents. When you ask the model ""How is COVID-19 transmitted?", here are the answers that you get (dot-score and document):

- 91.54 Polio is transmitted via contaminated water or food


As we see, the correct document is just ranked on 3rd place behind how Ebola and HIV are transmitted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.


"""# Use PseudoLabelGenerator to genenerate Retriever adaptation training data

PseudoLabelGenerator#run will execute all three steps of the GPL [algorithm](https://github.com/UKPLab/gpl#how-does-gpl-work):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think # before "run" is not needed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure.

2. Negative mining
3. Pseudo labeling (margin scoring)

The output of the `PseudoLabelGenerator` is the training data we'll use to adapt our `EmbeddingRetriever`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing full stop


"""# Update the Retriever

Now that we have the generated training data produced by `PseudoLabelGenerator` we'll update the `EmbeddingRetriever`. Let's take a peek at the training data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..PseudoLabelGenerator, (comma)
... data. (full stop)


retriever.train(output["gpl_labels"])

"""## Verify EmbeddingRetriever has been adapted; save it for future use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify that EmbeddingRetriever is adapted and save it for future use


"""## Verify EmbeddingRetriever has been adapted; save it for future use

We'll repeat our query and verify the Retriever is now more aware of the completely new concept such as COVID (i.e. ranks #1 in our query).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's repeat our query to see if the Retriever learned about COVID and can now rank it as #1 among the answers.

@vblagoje
Copy link
Member Author

@agnieszka-m would you please confirm your recommended changes are included? TY

@vblagoje
Copy link
Member Author

Seems like it is good to go now @agnieszka-m

# Generative Pseudo Labeling for Domain Adaptation of Dense Retrievals
#### Note: Adapted to Haystack from Nils Riemers' original [notebook](https://colab.research.google.com/gist/jamescalam/d2c888775c87f9882bb7c379a96adbc8/gpl-domain-adaptation.ipynb#scrollTo=183ff7ab)

The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing events, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.

You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.

## Efficient Domain Adaptation with GPL
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains and data.

## Efficient Domain Adaptation with GPL
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.

We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the COVID knowledge .

document_store.update_embeddings(retriever)
```

## Optionally download pre-generated questions or even generate them outside of Haystack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack


## Optionally download pre-generated questions or even generate them outside of Haystack

The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub.

## Efficient Domain Adaptation with GPL
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data.

We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model.
We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the new COVID knowledge.

show_examples(org_model)

"""# Get Some Data on COVID-19
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As a dataset, we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid).

)
document_store.update_embeddings(retriever)

"""## Optionally download pre-generated questions or even generate them outside of Haystack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""## Optionally download pre-generated questions or even generate them outside of Haystack
"""## (Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack


"""## Optionally download pre-generated questions or even generate them outside of Haystack

The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub.
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub.


retriever.train(output["gpl_labels"])

"""## Verify that EmbeddingRetriever is adapted and save it for future use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""## Verify that EmbeddingRetriever is adapted and save it for future use
"""## Verify That EmbeddingRetriever Is Adapted and Save It for Future Use

@agnieszka-m
Copy link
Contributor

Just minor updates and it's good to go

@vblagoje vblagoje merged commit b08c5f8 into deepset-ai:master Jun 26, 2022
@julian-risch
Copy link
Member

Hi @agnieszka-m @vblagoje the tutorial doesn't show up on our docs website. It still needs to be added here: https://github.com/deepset-ai/haystack-website/blob/ab49b74cdee82153f56348372e1d3ec2d294d32a/docs/latest/menu.json#L144

@vblagoje
Copy link
Member Author

@agnieszka-m I'll make the PR

Krak91 pushed a commit to Krak91/haystack that referenced this pull request Jul 26, 2022
* Add GPL adaptation tutorial

* Latest round of Aga's corrections

* Update Documentation & Code Style

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@vblagoje vblagoje deleted the gpl_tutorial branch February 28, 2023 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants