diff --git a/docs/guides/weak-supervision.ipynb b/docs/guides/weak-supervision.ipynb index f227c68f4db..70411174449 100644 --- a/docs/guides/weak-supervision.ipynb +++ b/docs/guides/weak-supervision.ipynb @@ -33,9 +33,10 @@ "\n", "### Weak labeling using the UI\n", "\n", - "Since version 0.8.0 you can find and define rules directly in the UI. The [Define rules mode](../reference/webapp/define_rules.md) is found below the [Annotate mode](../reference/webapp/annotate_records.md) on the right sidebar.\n", - "\n", - "The video below shows how you can interactively find and save rules with the UI. For more a full example check the Weak supervision tutorial.\n", + "Since version 0.8.0 you can find and define rules directly in the UI. \n", + "The [Define rules mode](../reference/webapp/define_rules.md) is found in the right side bar of the [Dataset page](../reference/webapp/dataset.md).\n", + "The video below shows how you can interactively find and save rules with the UI. \n", + "For a full example check the [Weak supervision tutorial](../tutorials/weak-supervision-with-rubrix.ipynb).\n", "\n", "\n", "\n", @@ -112,7 +113,7 @@ "4. Once you are satisfied with your weak labels, use the matrix of the `WeakLabels` instance with your library/method of choice to build a training set or even train a downstream text classification model.\n", "\n", "\n", - "This guide shows you an end-to-end example using Snorkel and Flyingsquid. Let's get started!" + "This guide shows you an end-to-end example using Snorkel, Flyingsquid and Weasel. Let's get started!" ] }, { @@ -127,7 +128,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 2, "id": "6b1e00af-c6f9-42aa-81aa-2976c9591b36", "metadata": {}, "outputs": [ @@ -211,15 +212,22 @@ "" ], "text/plain": [ - " Unnamed: 0 author date text label video\n", - "0 0 Alessandro leite 2014-11-05T22:21:36 pls http://www10.vakinha.com.br/VaquinhaE.aspx... -1.0 1\n", - "1 1 Salim Tayara 2014-11-02T14:33:30 if your like drones, plz subscribe to Kamal Ta... -1.0 1\n", - "2 2 Phuc Ly 2014-01-20T15:27:47 go here to check the views :3 -1.0 1\n", - "3 3 DropShotSk8r 2014-01-19T04:27:18 Came here to check the views, goodbye. -1.0 1\n", - "4 4 css403 2014-11-07T14:25:48 i am 2,126,492,636 viewer :D -1.0 1" + " Unnamed: 0 author date \\\n", + "0 0 Alessandro leite 2014-11-05T22:21:36 \n", + "1 1 Salim Tayara 2014-11-02T14:33:30 \n", + "2 2 Phuc Ly 2014-01-20T15:27:47 \n", + "3 3 DropShotSk8r 2014-01-19T04:27:18 \n", + "4 4 css403 2014-11-07T14:25:48 \n", + "\n", + " text label video \n", + "0 pls http://www10.vakinha.com.br/VaquinhaE.aspx... -1.0 1 \n", + "1 if your like drones, plz subscribe to Kamal Ta... -1.0 1 \n", + "2 go here to check the views :3 -1.0 1 \n", + "3 Came here to check the views, goodbye. -1.0 1 \n", + "4 i am 2,126,492,636 viewer :D -1.0 1 " ] }, - "execution_count": 11, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } @@ -283,7 +291,7 @@ "id": "baafda4f-45c0-49d6-9c37-7473c6888ebe", "metadata": {}, "source": [ - "After this step, you have a fully browsable dataset available at `http://localhost:6900/weak_supervision_yt` (or the base URL where your Rubrix instance is hosted)." + "After this step, you have a fully browsable dataset available that you can access via the [Rubrix web app](../reference/webapp/index.md)." ] }, { @@ -293,7 +301,9 @@ "source": [ "## 2. Defining rules\n", "\n", - "Let's now define some of the rules proposed in the tutorial [Snorkel Intro Tutorial: Data Labeling](https://www.snorkel.org/use-cases/01-spam-tutorial). Most of these rules can be defined directly in the UI using [Elasticsearch's query string DSL](../reference/webapp/search_records.md). \n", + "Let's now define some of the rules proposed in the tutorial [Snorkel Intro Tutorial: Data Labeling](https://www.snorkel.org/use-cases/01-spam-tutorial). \n", + "Most of these rules can be defined directly with our web app in the [Define rules mode](../reference/webapp/define_rules.md) and [Elasticsearch's query strings](../reference/webapp/search_records.md). \n", + "Afterward, you can conveniently load them into your notebook with the [load_rules function](../reference/python/python_labeling.rst#rubrix.labeling.text_classification.rule.load_rules).\n", "\n", "Rules can also be defined programmatically as shown below. Depending on your use case and team structure you can mix and match both interfaces (UI or Python).\n", "\n", @@ -571,6 +581,231 @@ "Let's see some examples:" ] }, + { + "cell_type": "markdown", + "id": "1fb78718", + "metadata": {}, + "source": [ + "### A simple majority vote\n", + "\n", + "As a first example we will show you, how to use the `WeakLabels` object together with a simple majority vote model.\n", + "For this we will take the implementation by Snorkel, and evaluate it with the help of sklearn's metrics module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3ba07e7", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install snorkel scikit-learn -qqq" + ] + }, + { + "cell_type": "markdown", + "id": "08eb2b66", + "metadata": {}, + "source": [ + "The majority vote model is arguably the most straightforward label model.\n", + "On a per-record basis, it simply counts the votes for each label returned by the rules, and takes the majority vote.\n", + "Snorkel provides a neat implementation of this logic in its `MajorityLabelVoter`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15888261", + "metadata": {}, + "outputs": [], + "source": [ + "from snorkel.labeling.model import MajorityLabelVoter\n", + "\n", + "# instantiate the majority vote label model\n", + "majority_model = MajorityLabelVoter()" + ] + }, + { + "cell_type": "markdown", + "id": "293d163f-ba08-424c-93de-ef0420531ca9", + "metadata": {}, + "source": [ + "Let's go on and evaluate this baseline.\n", + "To break ties when there is no majority vote, we choose the _\"random\"_ policy that randomly selects one of the tied labels. \n", + "In this way we avoid a bias towards label models that produce fewer but more certain weak labels, and makes the comparison between the different label models fairer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d588f00-c8e1-4bec-8e99-bd78639257b5", + "metadata": {}, + "outputs": [], + "source": [ + "# compute accuracy\n", + "majority_model.score(\n", + " L=weak_labels.matrix(has_annotation=True),\n", + " Y=weak_labels.annotation(),\n", + " tie_break_policy=\"random\",\n", + ")\n", + "# {'accuracy': 0.844}" + ] + }, + { + "cell_type": "markdown", + "id": "a619247f-3d9d-44ae-bb9f-d07c9120c1dc", + "metadata": {}, + "source": [ + "As we will see further down, an accuracy of 0.844 is a very decent baseline.\n", + "Choosing to simply ignore tiebreaks and abstentions (by setting the tiebreak policy to _\"abstain\"_), we would obtain an accuracy of nearly 0.96.\n", + "\n", + "When predicting weak labels to train a down-stream model, you probably want to discard the abstentions and tiebreaks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8de05c3-84ef-40de-8f73-0157a4aa1074", + "metadata": {}, + "outputs": [], + "source": [ + "# get predictions for training a down-stream model\n", + "predictions = majority_model.predict(L=weak_labels.matrix(has_annotation=False))\n", + "\n", + "# records for training\n", + "training_records = weak_labels.records(has_annotation=False)\n", + "\n", + "# mask to ignore abstentions/tiebreaks\n", + "idx = predictions != -1\n", + "\n", + "# combine records and predictions\n", + "training_data = pd.DataFrame(\n", + " [\n", + " {\"text\": rec.inputs[\"text\"], \"label\": weak_labels.int2label[label]} \n", + " for rec, label in zip(training_records, predictions)\n", + " ]\n", + ")[idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 240, + "id": "39e06fd0-caa6-4707-a667-030b52ad4be9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | text | \n", + "label | \n", + "
---|---|---|
0 | \n", + "Hey I'm a British youtuber!!<br />I upload... | \n", + "SPAM | \n", + "
1 | \n", + "NOKIA spotted | \n", + "HAM | \n", + "
2 | \n", + "Dance :) | \n", + "HAM | \n", + "
3 | \n", + "You guys should check out this EXTRAORDINARY w... | \n", + "SPAM | \n", + "
4 | \n", + "Need money ? check my channel and subscribe,so... | \n", + "SPAM | \n", + "
... | \n", + "... | \n", + "... | \n", + "
1579 | \n", + "Please check out my acoustic cover channel :) ... | \n", + "SPAM | \n", + "
1580 | \n", + "PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!... | \n", + "SPAM | \n", + "
1581 | \n", + "<a href=\"http://www.gofundme.com/Helpmypitbull... | \n", + "SPAM | \n", + "
1582 | \n", + "I love this song so much!:-D I've heard it so ... | \n", + "HAM | \n", + "
1585 | \n", + "Check out this video on YouTube: | \n", + "SPAM | \n", + "
1055 rows × 2 columns
\n", + "\n", + " | text | \n", + "label | \n", + "
---|---|---|
0 | \n", + "Hey I'm a British youtuber!!<br />I upload... | \n", + "SPAM | \n", + "
1 | \n", + "NOKIA spotted | \n", + "HAM | \n", + "
2 | \n", + "Dance :) | \n", + "HAM | \n", + "
3 | \n", + "You guys should check out this EXTRAORDINARY w... | \n", + "SPAM | \n", + "
4 | \n", + "Need money ? check my channel and subscribe,so... | \n", + "SPAM | \n", + "
... | \n", + "... | \n", + "... | \n", + "
1172 | \n", + "Please check out my acoustic cover channel :) ... | \n", + "SPAM | \n", + "
1173 | \n", + "PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!... | \n", + "SPAM | \n", + "
1174 | \n", + "<a href=\"http://www.gofundme.com/Helpmypitbull... | \n", + "SPAM | \n", + "
1175 | \n", + "I love this song so much!:-D I've heard it so ... | \n", + "HAM | \n", + "
1176 | \n", + "Check out this video on YouTube: | \n", + "SPAM | \n", + "
1177 rows × 2 columns
\n", + "" + ], + "text/plain": [ + " text label\n", + "0 Hey I'm a British youtuber!!\n", + " | text | \n", + "label | \n", + "
---|---|---|
0 | \n", + "Hey I'm a British youtuber!!<br />I upload... | \n", + "SPAM | \n", + "
1 | \n", + "NOKIA spotted | \n", + "HAM | \n", + "
2 | \n", + "Dance :) | \n", + "HAM | \n", + "
3 | \n", + "You guys should check out this EXTRAORDINARY w... | \n", + "SPAM | \n", + "
4 | \n", + "Need money ? check my channel and subscribe,so... | \n", + "SPAM | \n", + "
... | \n", + "... | \n", + "... | \n", + "
1172 | \n", + "Please check out my acoustic cover channel :) ... | \n", + "SPAM | \n", + "
1173 | \n", + "PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!... | \n", + "SPAM | \n", + "
1174 | \n", + "<a href=\"http://www.gofundme.com/Helpmypitbull... | \n", + "SPAM | \n", + "
1175 | \n", + "I love this song so much!:-D I've heard it so ... | \n", + "HAM | \n", + "
1176 | \n", + "Check out this video on YouTube: | \n", + "SPAM | \n", + "
1177 rows × 2 columns
\n", + "