diff --git a/docs/guides/weak-supervision.ipynb b/docs/guides/weak-supervision.ipynb index f227c68f4db..70411174449 100644 --- a/docs/guides/weak-supervision.ipynb +++ b/docs/guides/weak-supervision.ipynb @@ -33,9 +33,10 @@ "\n", "### Weak labeling using the UI\n", "\n", - "Since version 0.8.0 you can find and define rules directly in the UI. The [Define rules mode](../reference/webapp/define_rules.md) is found below the [Annotate mode](../reference/webapp/annotate_records.md) on the right sidebar.\n", - "\n", - "The video below shows how you can interactively find and save rules with the UI. For more a full example check the Weak supervision tutorial.\n", + "Since version 0.8.0 you can find and define rules directly in the UI. \n", + "The [Define rules mode](../reference/webapp/define_rules.md) is found in the right side bar of the [Dataset page](../reference/webapp/dataset.md).\n", + "The video below shows how you can interactively find and save rules with the UI. \n", + "For a full example check the [Weak supervision tutorial](../tutorials/weak-supervision-with-rubrix.ipynb).\n", "\n", "\n", "\n", @@ -112,7 +113,7 @@ "4. Once you are satisfied with your weak labels, use the matrix of the `WeakLabels` instance with your library/method of choice to build a training set or even train a downstream text classification model.\n", "\n", "\n", - "This guide shows you an end-to-end example using Snorkel and Flyingsquid. Let's get started!" + "This guide shows you an end-to-end example using Snorkel, Flyingsquid and Weasel. Let's get started!" ] }, { @@ -127,7 +128,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 2, "id": "6b1e00af-c6f9-42aa-81aa-2976c9591b36", "metadata": {}, "outputs": [ @@ -211,15 +212,22 @@ "" ], "text/plain": [ - " Unnamed: 0 author date text label video\n", - "0 0 Alessandro leite 2014-11-05T22:21:36 pls http://www10.vakinha.com.br/VaquinhaE.aspx... -1.0 1\n", - "1 1 Salim Tayara 2014-11-02T14:33:30 if your like drones, plz subscribe to Kamal Ta... -1.0 1\n", - "2 2 Phuc Ly 2014-01-20T15:27:47 go here to check the views :3 -1.0 1\n", - "3 3 DropShotSk8r 2014-01-19T04:27:18 Came here to check the views, goodbye. -1.0 1\n", - "4 4 css403 2014-11-07T14:25:48 i am 2,126,492,636 viewer :D -1.0 1" + " Unnamed: 0 author date \\\n", + "0 0 Alessandro leite 2014-11-05T22:21:36 \n", + "1 1 Salim Tayara 2014-11-02T14:33:30 \n", + "2 2 Phuc Ly 2014-01-20T15:27:47 \n", + "3 3 DropShotSk8r 2014-01-19T04:27:18 \n", + "4 4 css403 2014-11-07T14:25:48 \n", + "\n", + " text label video \n", + "0 pls http://www10.vakinha.com.br/VaquinhaE.aspx... -1.0 1 \n", + "1 if your like drones, plz subscribe to Kamal Ta... -1.0 1 \n", + "2 go here to check the views :3 -1.0 1 \n", + "3 Came here to check the views, goodbye. -1.0 1 \n", + "4 i am 2,126,492,636 viewer :D -1.0 1 " ] }, - "execution_count": 11, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } @@ -283,7 +291,7 @@ "id": "baafda4f-45c0-49d6-9c37-7473c6888ebe", "metadata": {}, "source": [ - "After this step, you have a fully browsable dataset available at `http://localhost:6900/weak_supervision_yt` (or the base URL where your Rubrix instance is hosted)." + "After this step, you have a fully browsable dataset available that you can access via the [Rubrix web app](../reference/webapp/index.md)." ] }, { @@ -293,7 +301,9 @@ "source": [ "## 2. Defining rules\n", "\n", - "Let's now define some of the rules proposed in the tutorial [Snorkel Intro Tutorial: Data Labeling](https://www.snorkel.org/use-cases/01-spam-tutorial). Most of these rules can be defined directly in the UI using [Elasticsearch's query string DSL](../reference/webapp/search_records.md). \n", + "Let's now define some of the rules proposed in the tutorial [Snorkel Intro Tutorial: Data Labeling](https://www.snorkel.org/use-cases/01-spam-tutorial). \n", + "Most of these rules can be defined directly with our web app in the [Define rules mode](../reference/webapp/define_rules.md) and [Elasticsearch's query strings](../reference/webapp/search_records.md). \n", + "Afterward, you can conveniently load them into your notebook with the [load_rules function](../reference/python/python_labeling.rst#rubrix.labeling.text_classification.rule.load_rules).\n", "\n", "Rules can also be defined programmatically as shown below. Depending on your use case and team structure you can mix and match both interfaces (UI or Python).\n", "\n", @@ -571,6 +581,231 @@ "Let's see some examples:" ] }, + { + "cell_type": "markdown", + "id": "1fb78718", + "metadata": {}, + "source": [ + "### A simple majority vote\n", + "\n", + "As a first example we will show you, how to use the `WeakLabels` object together with a simple majority vote model.\n", + "For this we will take the implementation by Snorkel, and evaluate it with the help of sklearn's metrics module." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3ba07e7", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install snorkel scikit-learn -qqq" + ] + }, + { + "cell_type": "markdown", + "id": "08eb2b66", + "metadata": {}, + "source": [ + "The majority vote model is arguably the most straightforward label model.\n", + "On a per-record basis, it simply counts the votes for each label returned by the rules, and takes the majority vote.\n", + "Snorkel provides a neat implementation of this logic in its `MajorityLabelVoter`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15888261", + "metadata": {}, + "outputs": [], + "source": [ + "from snorkel.labeling.model import MajorityLabelVoter\n", + "\n", + "# instantiate the majority vote label model\n", + "majority_model = MajorityLabelVoter()" + ] + }, + { + "cell_type": "markdown", + "id": "293d163f-ba08-424c-93de-ef0420531ca9", + "metadata": {}, + "source": [ + "Let's go on and evaluate this baseline.\n", + "To break ties when there is no majority vote, we choose the _\"random\"_ policy that randomly selects one of the tied labels. \n", + "In this way we avoid a bias towards label models that produce fewer but more certain weak labels, and makes the comparison between the different label models fairer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d588f00-c8e1-4bec-8e99-bd78639257b5", + "metadata": {}, + "outputs": [], + "source": [ + "# compute accuracy\n", + "majority_model.score(\n", + " L=weak_labels.matrix(has_annotation=True),\n", + " Y=weak_labels.annotation(),\n", + " tie_break_policy=\"random\",\n", + ")\n", + "# {'accuracy': 0.844}" + ] + }, + { + "cell_type": "markdown", + "id": "a619247f-3d9d-44ae-bb9f-d07c9120c1dc", + "metadata": {}, + "source": [ + "As we will see further down, an accuracy of 0.844 is a very decent baseline.\n", + "Choosing to simply ignore tiebreaks and abstentions (by setting the tiebreak policy to _\"abstain\"_), we would obtain an accuracy of nearly 0.96.\n", + "\n", + "When predicting weak labels to train a down-stream model, you probably want to discard the abstentions and tiebreaks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8de05c3-84ef-40de-8f73-0157a4aa1074", + "metadata": {}, + "outputs": [], + "source": [ + "# get predictions for training a down-stream model\n", + "predictions = majority_model.predict(L=weak_labels.matrix(has_annotation=False))\n", + "\n", + "# records for training\n", + "training_records = weak_labels.records(has_annotation=False)\n", + "\n", + "# mask to ignore abstentions/tiebreaks\n", + "idx = predictions != -1\n", + "\n", + "# combine records and predictions\n", + "training_data = pd.DataFrame(\n", + " [\n", + " {\"text\": rec.inputs[\"text\"], \"label\": weak_labels.int2label[label]} \n", + " for rec, label in zip(training_records, predictions)\n", + " ]\n", + ")[idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 240, + "id": "39e06fd0-caa6-4707-a667-030b52ad4be9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
0Hey I&#39;m a British youtuber!!<br />I upload...SPAM
1NOKIA spottedHAM
2Dance :)HAM
3You guys should check out this EXTRAORDINARY w...SPAM
4Need money ? check my channel and subscribe,so...SPAM
.........
1579Please check out my acoustic cover channel :) ...SPAM
1580PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!...SPAM
1581<a href=\"http://www.gofundme.com/Helpmypitbull...SPAM
1582I love this song so much!:-D I've heard it so ...HAM
1585Check out this video on YouTube:SPAM
\n", + "

1055 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " text label\n", + "0 Hey I'm a British youtuber!!
I upload... SPAM\n", + "1 NOKIA spotted HAM\n", + "2 Dance :) HAM\n", + "3 You guys should check out this EXTRAORDINARY w... SPAM\n", + "4 Need money ? check my channel and subscribe,so... SPAM\n", + "... ... ...\n", + "1579 Please check out my acoustic cover channel :) ... SPAM\n", + "1580 PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!... SPAM\n", + "1581 \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
0Hey I&#39;m a British youtuber!!<br />I upload...SPAM
1NOKIA spottedHAM
2Dance :)HAM
3You guys should check out this EXTRAORDINARY w...SPAM
4Need money ? check my channel and subscribe,so...SPAM
.........
1172Please check out my acoustic cover channel :) ...SPAM
1173PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!...SPAM
1174<a href=\"http://www.gofundme.com/Helpmypitbull...SPAM
1175I love this song so much!:-D I've heard it so ...HAM
1176Check out this video on YouTube:SPAM
\n", + "

1177 rows × 2 columns

\n", + "" + ], + "text/plain": [ + " text label\n", + "0 Hey I'm a British youtuber!!
I upload... SPAM\n", + "1 NOKIA spotted HAM\n", + "2 Dance :) HAM\n", + "3 You guys should check out this EXTRAORDINARY w... SPAM\n", + "4 Need money ? check my channel and subscribe,so... SPAM\n", + "... ... ...\n", + "1172 Please check out my acoustic cover channel :) ... SPAM\n", + "1173 PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!... SPAM\n", + "1174
\n", + "\n", + "Note\n", + "\n", + "For an example of how to use the `WeakLabels` object with Snorkel's raw `LabelModel` class, you can check out the [WeakLabels reference](../reference/python/python_labeling.rst#rubrix.labeling.text_classification.weak_labels.WeakLabels).\n", + " \n", + "" ] }, { @@ -667,13 +1061,22 @@ "from rubrix.labeling.text_classification import FlyingSquid\n", "\n", "# we pass our WeakLabels instance to our FlyingSquid label model\n", - "label_model = FlyingSquid(weak_labels)\n", + "flyingsquid_model = FlyingSquid(weak_labels)\n", "\n", "# we fit the model\n", - "label_model.fit()\n", - "\n", + "flyingsquid_model.fit()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d81dbf5-edf9-4c8c-a876-5fcb16d73090", + "metadata": {}, + "outputs": [], + "source": [ "# we check its performance\n", - "label_model.score()" + "flyingsquid_model.score(tie_break_policy=\"random\")\n", + "# {'accuracy': 0.832, ...}" ] }, { @@ -681,8 +1084,9 @@ "id": "d282da18-cc57-437a-bfd1-c13a9ac1aec4", "metadata": {}, "source": [ - "After fitting your label model, you can quickly explore its predictions, before building a training set for training a downstream text classifier. \n", + "When choosing to simply ignore tiebreaks and abstentions in the score (by setting the tiebreak policy to _\"abstain\"_), we would obtain an accuracy of about 0.93.\n", "\n", + "After fitting your label model, you can quickly explore its predictions, before building a training set for training a downstream text classifier. \n", "This step is useful for validation, manual revision, or defining score thresholds for accepting labels from your label model (for example, only considering labels with a score greater then 0.8.)" ] }, @@ -694,10 +1098,137 @@ "outputs": [], "source": [ "# get your training records with the predictions of the label model\n", - "records_for_training = label_model.predict()\n", + "records_for_training = flyingsquid_model.predict()\n", "\n", "# log the records to a new dataset in Rubrix\n", - "rb.log(records_for_training, name=\"flyingsquid_results\")" + "rb.log(records_for_training, name=\"flyingsquid_results\")\n", + "\n", + "# extract training data\n", + "training_data = pd.DataFrame(\n", + " [\n", + " {\"text\": rec.inputs[\"text\"], \"label\": rec.prediction[0][0]} \n", + " for rec in records_for_training\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 231, + "id": "0d641340-f82b-4af8-b86b-7c06eaf59f61", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
0Hey I&#39;m a British youtuber!!<br />I upload...SPAM
1NOKIA spottedHAM
2Dance :)HAM
3You guys should check out this EXTRAORDINARY w...SPAM
4Need money ? check my channel and subscribe,so...SPAM
.........
1172Please check out my acoustic cover channel :) ...SPAM
1173PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!...SPAM
1174<a href=\"http://www.gofundme.com/Helpmypitbull...SPAM
1175I love this song so much!:-D I've heard it so ...HAM
1176Check out this video on YouTube:SPAM
\n", + "

1177 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " text label\n", + "0 Hey I'm a British youtuber!!
I upload... SPAM\n", + "1 NOKIA spotted HAM\n", + "2 Dance :) HAM\n", + "3 You guys should check out this EXTRAORDINARY w... SPAM\n", + "4 Need money ? check my channel and subscribe,so... SPAM\n", + "... ... ...\n", + "1172 Please check out my acoustic cover channel :) ... SPAM\n", + "1173 PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!... SPAM\n", + "1174