-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPL adaptation tutorial #2632
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @vblagoje - This looks good. A few things I would do:
- Some hand holding maybe. Some explanation on what Generative Pseudo Labeling is and then short descriptions of what a cell block is about to do just above it (you already have this for some of them)
- I think we may have to add this to the headers here but I will check this.
- I would maybe include the full name 'Generative Pseudo Labeling' in the title(s) too. If I understand the tutorial correctly something like this might fit wdyt?: "Generative Pseudo Labeling for Domain Adaptation of Dense Retrieval" or "Domain Adaptation of Dense Retrieval with GPL" (fair if you think it's too long)
All good points @TuanaCelik Will make the recommended changes. |
tutorials/Tutorial17_GPL.py
Outdated
# Generative Pseudo Labeling for Domain Adaptation of Dense Retrievals | ||
#### Note: Adapted to Haystack from Nils Riemers' original [notebook](https://colab.research.google.com/gist/jamescalam/d2c888775c87f9882bb7c379a96adbc8/gpl-domain-adaptation.ipynb#scrollTo=183ff7ab) | ||
|
||
NLP models we use every day were trained on a large corpus of data, reflecting the World from the past. What if some significant world-changing events occur in the meantime, and we want our models to know about them? Like the COVID pandemic, for example? It doesn't make sense to retrain the models from scratch as that would be wasteful. What if we could update the models with new essential data instead of retraining the models from scratch? **Generative Pseudo Labeling (GPL)** to the rescue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect
tutorials/Tutorial17_GPL.py
Outdated
|
||
NLP models we use every day were trained on a large corpus of data, reflecting the World from the past. What if some significant world-changing events occur in the meantime, and we want our models to know about them? Like the COVID pandemic, for example? It doesn't make sense to retrain the models from scratch as that would be wasteful. What if we could update the models with new essential data instead of retraining the models from scratch? **Generative Pseudo Labeling (GPL)** to the rescue. | ||
|
||
In the below example, we demonstrate this for a simple query: "How is COVID-19 transmitted?". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example below shows you how to use GPL to fine-tune a model so that it can answer the query: "How is COVID-19 transmitted?".
tutorials/Tutorial17_GPL.py
Outdated
|
||
In the below example, we demonstrate this for a simple query: "How is COVID-19 transmitted?". | ||
|
||
As model, we use TAS-B: A DistilBERT model that achieves state-of-the-art performance on MS MARCO (500k queries from Bing Search Engine). Both DistilBERT and MS MARCO were created with data from 2018 and before, hence, it lacks the knowledge of any COVID-related information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using TAS-B: a DistilBERT model...
Both DistilBERT and MS MARCO were trained on data from 2018 and before, so they don't have any COVID-related information.
tutorials/Tutorial17_GPL.py
Outdated
|
||
As model, we use TAS-B: A DistilBERT model that achieves state-of-the-art performance on MS MARCO (500k queries from Bing Search Engine). Both DistilBERT and MS MARCO were created with data from 2018 and before, hence, it lacks the knowledge of any COVID-related information. | ||
|
||
In your example we use a small collection of just 4 documents. If you search with this model, you get the following results (dot-score & document): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this example, we're using just four documents. When you ask the model ""How is COVID-19 transmitted?", here are the answers that you get (dot-score and document):
tutorials/Tutorial17_GPL.py
Outdated
- 91.54 Polio is transmitted via contaminated water or food | ||
|
||
|
||
As we see, the correct document is just ranked on 3rd place behind how Ebola and HIV are transmitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better.
tutorials/Tutorial17_GPL.py
Outdated
|
||
"""# Use PseudoLabelGenerator to genenerate Retriever adaptation training data | ||
|
||
PseudoLabelGenerator#run will execute all three steps of the GPL [algorithm](https://github.com/UKPLab/gpl#how-does-gpl-work): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think # before "run" is not needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure.
tutorials/Tutorial17_GPL.py
Outdated
2. Negative mining | ||
3. Pseudo labeling (margin scoring) | ||
|
||
The output of the `PseudoLabelGenerator` is the training data we'll use to adapt our `EmbeddingRetriever` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing full stop
tutorials/Tutorial17_GPL.py
Outdated
|
||
"""# Update the Retriever | ||
|
||
Now that we have the generated training data produced by `PseudoLabelGenerator` we'll update the `EmbeddingRetriever`. Let's take a peek at the training data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
..PseudoLabelGenerator
, (comma)
... data. (full stop)
tutorials/Tutorial17_GPL.py
Outdated
|
||
retriever.train(output["gpl_labels"]) | ||
|
||
"""## Verify EmbeddingRetriever has been adapted; save it for future use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that EmbeddingRetriever is adapted and save it for future use
tutorials/Tutorial17_GPL.py
Outdated
|
||
"""## Verify EmbeddingRetriever has been adapted; save it for future use | ||
|
||
We'll repeat our query and verify the Retriever is now more aware of the completely new concept such as COVID (i.e. ranks #1 in our query). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's repeat our query to see if the Retriever learned about COVID and can now rank it as #1 among the answers.
@agnieszka-m would you please confirm your recommended changes are included? TY |
Seems like it is good to go now @agnieszka-m |
docs/_src/tutorials/tutorials/18.md
Outdated
# Generative Pseudo Labeling for Domain Adaptation of Dense Retrievals | ||
#### Note: Adapted to Haystack from Nils Riemers' original [notebook](https://colab.research.google.com/gist/jamescalam/d2c888775c87f9882bb7c379a96adbc8/gpl-domain-adaptation.ipynb#scrollTo=183ff7ab) | ||
|
||
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing evens, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue. | |
The NLP models we use every day were trained on a corpus of data that reflects the world from the past. In the meantime, we've experienced world-changing events, like the COVID pandemics, and we'd like our models to know about them. Training a model from scratch is tedious work but what if we could just update the models with new data? Generative Pseudo Labeling comes to the rescue. |
docs/_src/tutorials/tutorials/18.md
Outdated
You can see that the correct document is only third, outranked by Ebola and HIV information. Let's see how we can make this better. | ||
|
||
## Efficient Domain Adaptation with GPL | ||
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data. | |
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains and data. |
docs/_src/tutorials/tutorials/18.md
Outdated
## Efficient Domain Adaptation with GPL | ||
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data. | ||
|
||
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model. | |
We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the COVID knowledge . |
docs/_src/tutorials/tutorials/18.md
Outdated
document_store.update_embeddings(retriever) | ||
``` | ||
|
||
## Optionally download pre-generated questions or even generate them outside of Haystack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack
docs/_src/tutorials/tutorials/18.md
Outdated
|
||
## Optionally download pre-generated questions or even generate them outside of Haystack | ||
|
||
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub. | |
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub. |
tutorials/Tutorial18_GPL.py
Outdated
## Efficient Domain Adaptation with GPL | ||
This notebook demonstrates [Generative Pseudo Labeling (GPL)](https://arxiv.org/abs/2112.07577), an efficient approach to adapt existing dense retrieval models to new domains & data. | ||
|
||
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get a collection 10k scientific papers on COVID-19 and then fine-tune within 15-60 minutes (depending on your GPU) to include the new COVID knowledge into our model. | |
We get a collection of 10k scientific papers on COVID-19 and then fine-tune the model within 15-60 minutes (depending on your GPU) so that it includes the new COVID knowledge. |
tutorials/Tutorial18_GPL.py
Outdated
show_examples(org_model) | ||
|
||
"""# Get Some Data on COVID-19 | ||
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As dataset we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid). | |
We select 10k scientific publications (title + abstract) that are connected to COVID-19. As a dataset, we use [TREC-COVID-19](https://huggingface.co/datasets/nreimers/trec-covid). |
tutorials/Tutorial18_GPL.py
Outdated
) | ||
document_store.update_embeddings(retriever) | ||
|
||
"""## Optionally download pre-generated questions or even generate them outside of Haystack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""## Optionally download pre-generated questions or even generate them outside of Haystack | |
"""## (Optional) Download Pre-Generated Questions or Generate Them Outside of Haystack |
tutorials/Tutorial18_GPL.py
Outdated
|
||
"""## Optionally download pre-generated questions or even generate them outside of Haystack | ||
|
||
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on GPU used), we'll download question/document pairs directly from the HuggingFace hub. | |
The first step of the GPL algorithm requires us to generate questions for a given text passage. Even though our pre-COVID trained model hasn't seen any COVID-related content, it can still produce sensible queries by copying words from the input text. As generating questions from 10k documents is a bit slow (depending on the GPU used), we'll download question/document pairs directly from the Hugging Face hub. |
tutorials/Tutorial18_GPL.py
Outdated
|
||
retriever.train(output["gpl_labels"]) | ||
|
||
"""## Verify that EmbeddingRetriever is adapted and save it for future use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""## Verify that EmbeddingRetriever is adapted and save it for future use | |
"""## Verify That EmbeddingRetriever Is Adapted and Save It for Future Use |
Just minor updates and it's good to go |
Hi @agnieszka-m @vblagoje the tutorial doesn't show up on our docs website. It still needs to be added here: https://github.com/deepset-ai/haystack-website/blob/ab49b74cdee82153f56348372e1d3ec2d294d32a/docs/latest/menu.json#L144 |
@agnieszka-m I'll make the PR |
* Add GPL adaptation tutorial * Latest round of Aga's corrections * Update Documentation & Code Style Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Proposed changes:
cc @TuanaCelik