Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NLP transforms, KS alarm + detector #163

Open
wants to merge 29 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1940738
draft representation, alarm, detector
Anmol-Srivastava Aug 8, 2023
3b109a0
add toolz to requirements, break up detector components
Anmol-Srivastava Aug 8, 2023
350ff69
more prototyping on alarm/representation
Anmol-Srivastava Aug 9, 2023
8fbadf4
rename module to nlp_experimental, add to prototype
Anmol-Srivastava Aug 10, 2023
8b6baca
add representation workflow tests
Anmol-Srivastava Aug 10, 2023
a1a8fb6
add basic alarm workflow tests
Anmol-Srivastava Aug 10, 2023
12a9dcb
flesh out detector functions
Anmol-Srivastava Aug 10, 2023
316d474
emergency push to allow collaboration
Anmol-Srivastava Aug 18, 2023
184020d
comments
Anmol-Srivastava Aug 18, 2023
196de54
formatting
tms-bananaquit Aug 23, 2023
f2a0033
tweaks to transforms
Anmol-Srivastava Aug 31, 2023
0bde53a
Merge branch '152-nlp-detector-outline' of https://github.com/mitre/m…
tms-bananaquit Aug 31, 2023
348b20f
add outline for KS-based alarm, detector, NLP transforms
Anmol-Srivastava Oct 17, 2023
ced0bf6
exclude experimental work from test suite
Anmol-Srivastava Oct 17, 2023
29303ed
fill in recalibration step
Anmol-Srivastava Oct 19, 2023
e619b65
try different method to exclude experimental from cov test
Anmol-Srivastava Oct 19, 2023
1df0165
tweak to exclusion targets
Anmol-Srivastava Oct 19, 2023
352eba3
add alibi detect citation, reference in docstrings
Anmol-Srivastava Oct 24, 2023
d3ddd41
fix recalibration logic, finish docstrings for detector
Anmol-Srivastava Nov 6, 2023
21e23ac
add docstrings for alarm, add Alarm ABC
Anmol-Srivastava Nov 7, 2023
e4ecdaf
add docstrings to transforms
Anmol-Srivastava Nov 7, 2023
5af2ce7
rework top-level docstrings into init
Anmol-Srivastava Nov 8, 2023
84f8204
fill in documentation in NLP notebook
Anmol-Srivastava Nov 8, 2023
81fda3e
move dependencies to experimental option, document in notebook
Anmol-Srivastava Nov 10, 2023
db2a611
remove padding warning from nlp example
Anmol-Srivastava Nov 10, 2023
72bec69
try new build os section in RTD YML
Anmol-Srivastava Nov 13, 2023
0a61c27
remove version from python section in RTD YML
Anmol-Srivastava Nov 13, 2023
ca02be5
add experimental install option to RTD YML
Anmol-Srivastava Nov 13, 2023
11d6299
commit data sample for nlp example in RTD, run nbconvert
Anmol-Srivastava Nov 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ jobs:
- name: Unit and coverage tests
run: |
pytest tests/menelaus --cov=menelaus/ --cov-report term
coverage report -m --fail-under=100
coverage report -m --fail-under=100 --omit=menelaus/experimental/*
9 changes: 7 additions & 2 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,15 @@ version: 2
sphinx:
configuration: docs/source/conf.py

build:
os: ubuntu-22.04
tools:
python: "3.8"

python:
version: "3.8"
install:
- method: pip
path: .
extra_requirements:
- dev
- dev
- experimental
Binary file not shown.
205 changes: 142 additions & 63 deletions docs/source/examples/nlp/wilds_datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,130 +4,209 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Overview \n",
"## Overview\n",
"\n",
"This notebook is a work in progress. Eventually, the contents will demonstrate an NLP-based drift detection algorithm in action, but until the feature is developed, it shows the loading and use of two datasets to be used in the examples:\n",
"This example demonstrates an experimental NLP-based drift detection algorithm. It uses the \"Civil Comments\" dataset ([link](https://github.com/p-lambda/wilds/blob/main/wilds/datasets/civilcomments_dataset.py) to a Python loading script with additional details/links) from the `wilds` library, which contains online comments meant to be used in toxicity classification problems.\n",
"\n",
"- Civil Comments dataset: online comments to be used in toxicity classification problems \n",
"- Amazon Reviews dataset: amazon reviews to be used in a variety of NLP problems\n",
"This example and the experimental modules often pull directly and indirectly from [`alibi-detect`](https://github.com/SeldonIO/alibi-detect/tree/master) and its own [example(s)](https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_text_imdb.html).\n",
"\n",
"The data is accessed by using the `wilds` library, which contains several such datasets and wraps them in an API as shown below. \n",
"## Notes\n",
"\n",
"#### Imports"
"This code is experimental, and has notable issues:\n",
"- transform functions are very slow, on even moderate batch sizes\n",
"- detector design is not generalized, and may not work on streaming problems, or with data representations of different types/shapes\n",
"- some warnings below are not addressed\n",
"- if not present, `toolz`, `tensorflow`, and `transformers` must be added via the `experimental` install option, and are not included by default\n",
"\n",
"## Imports\n",
"\n",
"Code (transforms, alarm, detector) is pulled from the experimental module in `menelaus`, which is live but not fully tested. Note that commented code shows `wilds` modules being used to access and save the dataset to disk, but are excluded to save time. The example hence assumes the dataset is locally available."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from wilds import get_dataset"
"import pickle\n",
"# import pandas as pd\n",
"# from wilds import get_dataset\n",
"\n",
"from menelaus.experimental.transform import auto_tokenize, extract_embedding, uae_reduce_dimension\n",
"from menelaus.experimental.detector import Detector\n",
"from menelaus.experimental.alarm import KolmogorovSmirnovAlarm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Load Data\n",
"## Load Data\n",
"\n",
"Since some of the experimental modules are not very performant, the dataset is loaded and then limited to the first 300 data points (comments), which are split into three sequential batches of 100.\n",
"\n",
"Note that initially, the large data files need to be downloaded first. Later examples may assume the data is already stored to disk."
"__Note__: for convenience in generating documentation, the sample is itself saved locally and read from disk in the below examples, but the commented code describes the steps. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# civil comments\n",
"# dataset_civil = get_dataset(dataset=\"civilcomments\", download=True, root_dir=\"./wilds_datasets\")\n",
"# dataset_civil = pd.read_csv('wilds_datasets/civilcomments_v1.0/all_data_with_identities.csv')\n",
"# dataset_civil = dataset_civil['comment_text'][:300].tolist()\n",
"\n",
"# with open('civil_comments_sample.pkl', 'wb') as f:\n",
"# pickle.dump(dataset_civil, f)\n",
"\n",
"dataset_civil = None\n",
"with open('civil_comments_sample.pkl', 'rb') as f:\n",
" dataset_civil = pickle.load(f)\n",
"\n",
"batch1 = dataset_civil[:100]\n",
"batch2 = dataset_civil[100:200]\n",
"batch3 = dataset_civil[200:300]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transforms Pipeline\n",
"\n",
"The major step is to initialize the transform functions that will be applied to the comments, to turn them into detector-compatible representations. \n",
"\n",
"First, the comments must be tokenized:\n",
"- set up an `AutoTokenizer` model from the `transformers` library with a convenience function, by specifying the desired model name and other arguments\n",
"- the convenience function lets the configured tokenizer be called repeatedly, using batch 1 as the training data\n",
"\n",
"Then, the tokens must be made into embeddings:\n",
"- an initial transform function uses a `transformers` model to extract embeddings from given tokens\n",
"- the subsequent transform function reduces the dimension via an `UntrainedAutoEncoder` to a manageable size"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# tokens \n",
"tokenizer = auto_tokenize(model_name='bert-base-cased', padding='longest', return_tensors='tf')\n",
"tokens = tokenizer(data=batch1)\n",
"\n",
"# embedding (TODO abstract this layers line)\n",
"layers = [-_ for _ in range(1, 8 + 1)]\n",
"embedder = extract_embedding(model_name='bert-base-cased', embedding_type='hidden_state', layers=layers)\n",
"\n",
"# dimension reduction via Untrained AutoEncoder\n",
"uae_reduce = uae_reduce_dimension(enc_dim=32)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Detector Setup\n",
"\n",
"Next a detector is setup. First, a `KolmogorovSmirnovAlarm` is initialized with default settings. When the amount of columns (which reject the null KS test hypothesis) exceeds the default ratio (0.25), this alarm will indicate drift has occurred. \n",
"\n",
"Then the detector is constructed. It is given the initialized alarm, and the ordered list of transforms configured above. The detector is then made to step through each available batch, and its state is printed as output. Note that the first batch establishes the reference data, the second establishes the test data, and the third will require recalibration (test is combined into reference) if drift is detected."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading dataset to ./wilds_datasets\\amazon_v2.1...\n",
"You can also download the dataset manually at https://wilds.stanford.edu/downloads.\n",
"Downloading https://worksheets.codalab.org/rest/bundles/0xe3ed909786d34ee79d430d065582aa29/contents/blob/ to ./wilds_datasets\\amazon_v2.1\\archive.tar.gz\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|█████████▉| 1988272128/1989805589 [06:39<00:00, 4982930.49Byte/s]\n"
"Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
"- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
"All the weights of TFBertModel were initialized from the PyTorch model.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting ./wilds_datasets\\amazon_v2.1\\archive.tar.gz to ./wilds_datasets\\amazon_v2.1\n",
"\n",
"It took 7.56 minutes to download and uncompress the dataset.\n",
"State after initial batch: baseline\n",
"\n"
]
},
{
"data": {
"text/plain": [
"<wilds.datasets.amazon_dataset.AmazonDataset at 0x26f9518ac50>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# amazon reviews\n",
"dataset_amazon = get_dataset(dataset=\"amazon\", download=True, root_dir=\"./wilds_datasets\")\n",
"dataset_amazon"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
"- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
"All the weights of TFBertModel were initialized from the PyTorch model.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading dataset to ./wilds_datasets\\civilcomments_v1.0...\n",
"You can also download the dataset manually at https://wilds.stanford.edu/downloads.\n",
"Downloading https://worksheets.codalab.org/rest/bundles/0x8cd3de0634154aeaad2ee6eb96723c6e/contents/blob/ to ./wilds_datasets\\civilcomments_v1.0\\archive.tar.gz\n"
"\n",
"State after test batch: alarm\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"90914816Byte [00:17, 5109891.58Byte/s] \n"
"Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
"- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
"All the weights of TFBertModel were initialized from the PyTorch model.\n",
"If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting ./wilds_datasets\\civilcomments_v1.0\\archive.tar.gz to ./wilds_datasets\\civilcomments_v1.0\n",
"\n",
"It took 0.33 minutes to download and uncompress the dataset.\n",
"State after new batch, recalibration: alarm\n",
"\n"
]
},
{
"data": {
"text/plain": [
"<wilds.datasets.civilcomments_dataset.CivilCommentsDataset at 0x26f9518b0a0>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# civil comments\n",
"dataset_civil = get_dataset(dataset=\"civilcomments\", download=True, root_dir=\"./wilds_datasets\")\n",
"dataset_civil"
"# detector + set reference\n",
"ks_alarm = KolmogorovSmirnovAlarm()\n",
"detector = Detector(alarm=ks_alarm, transforms=[tokenizer, embedder, uae_reduce])\n",
"detector.step(batch1)\n",
"print(f\"\\nState after initial batch: {detector.state}\\n\")\n",
"\n",
"# detector + add test \n",
"detector.step(batch2)\n",
"print(f\"\\nState after test batch: {detector.state}\\n\")\n",
"\n",
"# recalibrate and re-evaluate (XXX - all batches must be same length)\n",
"detector.step(batch3)\n",
"print(f\"\\nState after new batch, recalibration: {detector.state}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Final Notes\n",
"\n",
"We can see the baseline state after processing the initial batch, an alarm raised after observing test data, and then another alarm signal after a new test batch is observed and the reference is internally recalibrated."
]
}
],
Expand Down
9 changes: 9 additions & 0 deletions docs/source/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -183,4 +183,13 @@ @misc{souza2020
year={2020},
howpublished="\url{https://arxiv.org/abs/2005.00113}",
note={Online; accessed 20-July-2022},
}

@software{alibi-detect,
title = {Alibi Detect: Algorithms for outlier, adversarial and drift detection},
author = {Van Looveren, Arnaud and Klaise, Janis and Vacanti, Giovanni and Cobb, Oliver and Scillitoe, Ashley and Samoilescu, Robert and Athorne, Alex},
url = {https://github.com/SeldonIO/alibi-detect},
version = {0.11.4},
date = {2023-07-07},
year = {2019}
}
Loading