mitre · Anmol-Srivastava · Aug 8, 2023 · Aug 8, 2023 · Aug 9, 2023 · Aug 10, 2023
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -30,4 +30,4 @@ jobs:
     - name: Unit and coverage tests
       run: |
         pytest tests/menelaus --cov=menelaus/ --cov-report term
-        coverage report -m --fail-under=100
+        coverage report -m --fail-under=100 --omit=menelaus/experimental/*
diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -3,10 +3,15 @@ version: 2
 sphinx:
    configuration: docs/source/conf.py
 
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.8"
+
 python:
-  version: "3.8"
   install:
     - method: pip
       path: .
       extra_requirements:
-        - dev
+        - dev
+        - experimental
diff --git a/docs/source/examples/nlp/civil_comments_sample.pkl b/docs/source/examples/nlp/civil_comments_sample.pkl
diff --git a/docs/source/examples/nlp/wilds_datasets.ipynb b/docs/source/examples/nlp/wilds_datasets.ipynb
@@ -4,130 +4,209 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Overview \n",
+    "## Overview\n",
     "\n",
-    "This notebook is a work in progress. Eventually, the contents will demonstrate an NLP-based drift detection algorithm in action, but until the feature is developed, it shows the loading and use of two datasets to be used in the examples:\n",
+    "This example demonstrates an experimental NLP-based drift detection algorithm. It uses the \"Civil Comments\" dataset ([link](https://github.com/p-lambda/wilds/blob/main/wilds/datasets/civilcomments_dataset.py) to a Python loading script with additional details/links) from the `wilds` library, which contains online comments meant to be used in toxicity classification problems.\n",
     "\n",
-    "- Civil Comments dataset: online comments to be used in toxicity classification problems \n",
-    "- Amazon Reviews dataset: amazon reviews to be used in a variety of NLP problems\n",
+    "This example and the experimental modules often pull directly and indirectly from [`alibi-detect`](https://github.com/SeldonIO/alibi-detect/tree/master) and its own [example(s)](https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_text_imdb.html).\n",
     "\n",
-    "The data is accessed by using the `wilds` library, which contains several such datasets and wraps them in an API as shown below. \n",
+    "## Notes\n",
     "\n",
-    "#### Imports"
+    "This code is experimental, and has notable issues:\n",
+    "- transform functions are very slow, on even moderate batch sizes\n",
+    "- detector design is not generalized, and may not work on streaming problems, or with data representations of different types/shapes\n",
+    "- some warnings below are not addressed\n",
+    "- if not present, `toolz`, `tensorflow`, and `transformers` must be added via the `experimental` install option, and are not included by default\n",
+    "\n",
+    "## Imports\n",
+    "\n",
+    "Code (transforms, alarm, detector) is pulled from the experimental module in `menelaus`, which is live but not fully tested. Note that commented code shows `wilds` modules being used to access and save the dataset to disk, but are excluded to save time. The example hence assumes the dataset is locally available."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
-    "from wilds import get_dataset"
+    "import pickle\n",
+    "# import pandas as pd\n",
+    "# from wilds import get_dataset\n",
+    "\n",
+    "from menelaus.experimental.transform import auto_tokenize, extract_embedding, uae_reduce_dimension\n",
+    "from menelaus.experimental.detector import Detector\n",
+    "from menelaus.experimental.alarm import KolmogorovSmirnovAlarm"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Load Data\n",
+    "## Load Data\n",
+    "\n",
+    "Since some of the experimental modules are not very performant, the dataset is loaded and then limited to the first 300 data points (comments), which are split into three sequential batches of 100.\n",
     "\n",
-    "Note that initially, the large data files need to be downloaded first. Later examples may assume the data is already stored to disk."
+    "__Note__: for convenience in generating documentation, the sample is itself saved locally and read from disk in the below examples, but the commented code describes the steps. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# civil comments\n",
+    "# dataset_civil = get_dataset(dataset=\"civilcomments\", download=True, root_dir=\"./wilds_datasets\")\n",
+    "# dataset_civil = pd.read_csv('wilds_datasets/civilcomments_v1.0/all_data_with_identities.csv')\n",
+    "# dataset_civil = dataset_civil['comment_text'][:300].tolist()\n",
+    "\n",
+    "# with open('civil_comments_sample.pkl', 'wb') as f:\n",
+    "#     pickle.dump(dataset_civil, f)\n",
+    "\n",
+    "dataset_civil = None\n",
+    "with open('civil_comments_sample.pkl', 'rb') as f:\n",
+    "    dataset_civil = pickle.load(f)\n",
+    "\n",
+    "batch1 = dataset_civil[:100]\n",
+    "batch2 = dataset_civil[100:200]\n",
+    "batch3 = dataset_civil[200:300]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Transforms Pipeline\n",
+    "\n",
+    "The major step is to initialize the transform functions that will be applied to the comments, to turn them into detector-compatible representations. \n",
+    "\n",
+    "First, the comments must be tokenized:\n",
+    "- set up an `AutoTokenizer` model from the `transformers` library with a convenience function, by specifying the desired model name and other arguments\n",
+    "- the convenience function lets the configured tokenizer be called repeatedly, using batch 1 as the training data\n",
+    "\n",
+    "Then, the tokens must be made into embeddings:\n",
+    "- an initial transform function uses a `transformers` model to extract embeddings from given tokens\n",
+    "- the subsequent transform function reduces the dimension via an `UntrainedAutoEncoder` to a manageable size"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# tokens \n",
+    "tokenizer = auto_tokenize(model_name='bert-base-cased', padding='longest', return_tensors='tf')\n",
+    "tokens = tokenizer(data=batch1)\n",
+    "\n",
+    "# embedding (TODO abstract this layers line)\n",
+    "layers = [-_ for _ in range(1, 8 + 1)]\n",
+    "embedder = extract_embedding(model_name='bert-base-cased', embedding_type='hidden_state', layers=layers)\n",
+    "\n",
+    "# dimension reduction via Untrained AutoEncoder\n",
+    "uae_reduce = uae_reduce_dimension(enc_dim=32)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Detector Setup\n",
+    "\n",
+    "Next a detector is setup. First, a `KolmogorovSmirnovAlarm` is initialized with default settings. When the amount of columns (which reject the null KS test hypothesis) exceeds the default ratio (0.25), this alarm will indicate drift has occurred. \n",
+    "\n",
+    "Then the detector is constructed. It is given the initialized alarm, and the ordered list of transforms configured above. The detector is then made to step through each available batch, and its state is printed as output. Note that the first batch establishes the reference data, the second establishes the test data, and the third will require recalibration (test is combined into reference) if drift is detected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Downloading dataset to ./wilds_datasets\\amazon_v2.1...\n",
-      "You can also download the dataset manually at https://wilds.stanford.edu/downloads.\n",
-      "Downloading https://worksheets.codalab.org/rest/bundles/0xe3ed909786d34ee79d430d065582aa29/contents/blob/ to ./wilds_datasets\\amazon_v2.1\\archive.tar.gz\n"
-     ]
-    },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "100%|█████████▉| 1988272128/1989805589 [06:39<00:00, 4982930.49Byte/s]\n"
+      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "All the weights of TFBertModel were initialized from the PyTorch model.\n",
+      "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
      ]
     },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Extracting ./wilds_datasets\\amazon_v2.1\\archive.tar.gz to ./wilds_datasets\\amazon_v2.1\n",
       "\n",
-      "It took 7.56 minutes to download and uncompress the dataset.\n",
+      "State after initial batch: baseline\n",
       "\n"
      ]
     },
     {
-     "data": {
-      "text/plain": [
-       "<wilds.datasets.amazon_dataset.AmazonDataset at 0x26f9518ac50>"
-      ]
-     },
-     "execution_count": 2,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# amazon reviews\n",
-    "dataset_amazon = get_dataset(dataset=\"amazon\", download=True, root_dir=\"./wilds_datasets\")\n",
-    "dataset_amazon"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "All the weights of TFBertModel were initialized from the PyTorch model.\n",
+      "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
+     ]
+    },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Downloading dataset to ./wilds_datasets\\civilcomments_v1.0...\n",
-      "You can also download the dataset manually at https://wilds.stanford.edu/downloads.\n",
-      "Downloading https://worksheets.codalab.org/rest/bundles/0x8cd3de0634154aeaad2ee6eb96723c6e/contents/blob/ to ./wilds_datasets\\civilcomments_v1.0\\archive.tar.gz\n"
+      "\n",
+      "State after test batch: alarm\n",
+      "\n"
      ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "90914816Byte [00:17, 5109891.58Byte/s]                             \n"
+      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
+      "- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "All the weights of TFBertModel were initialized from the PyTorch model.\n",
+      "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
      ]
     },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Extracting ./wilds_datasets\\civilcomments_v1.0\\archive.tar.gz to ./wilds_datasets\\civilcomments_v1.0\n",
       "\n",
-      "It took 0.33 minutes to download and uncompress the dataset.\n",
+      "State after new batch, recalibration: alarm\n",
       "\n"
      ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "<wilds.datasets.civilcomments_dataset.CivilCommentsDataset at 0x26f9518b0a0>"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
     }
    ],
    "source": [
-    "# civil comments\n",
-    "dataset_civil = get_dataset(dataset=\"civilcomments\", download=True, root_dir=\"./wilds_datasets\")\n",
-    "dataset_civil"
+    "# detector + set reference\n",
+    "ks_alarm = KolmogorovSmirnovAlarm()\n",
+    "detector = Detector(alarm=ks_alarm, transforms=[tokenizer, embedder, uae_reduce])\n",
+    "detector.step(batch1)\n",
+    "print(f\"\\nState after initial batch: {detector.state}\\n\")\n",
+    "\n",
+    "# detector + add test   \n",
+    "detector.step(batch2)\n",
+    "print(f\"\\nState after test batch: {detector.state}\\n\")\n",
+    "\n",
+    "# recalibrate and re-evaluate (XXX - all batches must be same length)\n",
+    "detector.step(batch3)\n",
+    "print(f\"\\nState after new batch, recalibration: {detector.state}\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Final Notes\n",
+    "\n",
+    "We can see the baseline state after processing the initial batch, an alarm raised after observing test data, and then another alarm signal after a new test batch is observed and the reference is internally recalibrated."
    ]
   }
  ],

diff --git a/docs/source/refs.bib b/docs/source/refs.bib
@@ -183,4 +183,13 @@ @misc{souza2020
   year={2020},
   howpublished="\url{https://arxiv.org/abs/2005.00113}",
   note={Online; accessed 20-July-2022},
+}
+
+@software{alibi-detect,
+  title = {Alibi Detect: Algorithms for outlier, adversarial and drift detection},
+  author = {Van Looveren, Arnaud and Klaise, Janis and Vacanti, Giovanni and Cobb, Oliver and Scillitoe, Ashley and Samoilescu, Robert and Athorne, Alex},
+  url = {https://github.com/SeldonIO/alibi-detect},
+  version = {0.11.4},
+  date = {2023-07-07},
+  year = {2019}
 }