Check embedding update (#295)

* Add Examples Notebook (#294) * Urgent fix to remove LIWC lexicons from public repo (#279) * delete small test lexicons * move .pkl files to assets and remove from GH * filesystem cleanup * update certainty pickle location * remove unpickling certainty * remove lexicons from pyproject * change lexical pkl path * add error handling when lexicons are not found * update warning message * add legal caveat and update name of certainty pkl to be correct * ensure lexicons are ignored * Update Documentation (Complete Conceptual Documentation, Document Assumptions) (#289) * new docs * lexicons hotfix * emilys doc edits * update deprecated github actions to latest * update last remaining text features * update index * update docs * update index * update docs * update docs and the feature dictionary * add basics.rst * add new basics page * update docs --------- Co-authored-by: Xinlan Emily Hu <xehu@wharton.upenn.edu> Co-authored-by: Xinlan Emily Hu <xehu@cs.stanford.edu> * update torch requirements to resolve compatibility issue on torch end (#290) * Update Website (#291) * website updates * renaming tpm-website to website * deploying via gh-pages * changed from tpm-website to website * deployed website * copyright and team * team headshots and footer * edits to the pages * website updates * updated links * updated homepage * link updates * mobile compatibility * mobile adjustments * navbar mobile updates * whitespace edits * homepage updates * feature table * website updates * renaming tpm-website to website * deploying via gh-pages * changed from tpm-website to website * edits to the pages * website updates * updated links * updated homepage * link updates * mobile compatibility * mobile adjustments * navbar mobile updates * homepage updates * add table of features * updated team page titles * include flask in requirements.txt * updates to table of features * load pages from top * fix to 404 issues * moved build under website folder * updates to package launch * hyperlink ./setup.sh * fix nav bar sizing and hamburger logo * include preprint * updates to "getting started" * update team --------- Co-authored-by: amytangzheng <amy.tang.zheng@gmail.com> * update documentation for clarity and correct typos in positivity z-score and information exchange and liwc * add demo notebook * update notebook and add information to docs * update documentation --------- Co-authored-by: Shruti Agarwal <46203852+agshruti12@users.noreply.github.com> Co-authored-by: amytangzheng <amy.tang.zheng@gmail.com> * update check embeddings with tqdm loading bar and BERT tokenization update * (1) allow BERT sentiments to be generated from the messages with punctuation, rather than the preprocessed messages; (2) batch BERT sentiment generation to make it more efficient; (3) add loading bar for generation of chat-level features --------- Co-authored-by: Shruti Agarwal <46203852+agshruti12@users.noreply.github.com> Co-authored-by: amytangzheng <amy.tang.zheng@gmail.com>
Watts-Lab · Sep 23, 2024 · 1c72695 · 1c72695
1 parent 6a05e80
commit 1c72695
Show file tree

Hide file tree

Showing 26 changed files with 219 additions and 91 deletions.
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/examples.doctree b/docs/build/doctrees/examples.doctree
diff --git a/docs/build/doctrees/feature_builder.doctree b/docs/build/doctrees/feature_builder.doctree
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/utils/check_embeddings.doctree b/docs/build/doctrees/utils/check_embeddings.doctree
diff --git a/docs/build/html/_sources/examples.rst.txt b/docs/build/html/_sources/examples.rst.txt
@@ -90,7 +90,7 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
  output_file_path_conv_level = "./jury_output_conversation_level.csv",
  turns = True
  )
- jury_feature_builder.featurize(col="message")
+ jury_feature_builder.featurize()
 
 Basic Input Columns
 ^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/build/html/_sources/index.rst.txt b/docs/build/html/_sources/index.rst.txt
@@ -76,7 +76,7 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w
  )
 
  # this line of code runs the FeatureBuilder on your data
- my_feature_builder.featurize(col="message")
+ my_feature_builder.featurize()
 
 Use the Table of Contents below to learn more about our tool. We recommend that you begin in the "Introduction" section, then explore other sections of the documentation as they become relevant to you. We recommend reading :ref:`basics` for a high-level overview of the requirements and parameters, and then reading through the :ref:`examples` for a detailed walkthrough and discussion of considerations.
 

diff --git a/docs/build/html/examples.html b/docs/build/html/examples.html
@@ -160,7 +160,7 @@ <h3>Configuring the FeatureBuilder<a class="headerlink" href="#configuring-the-f
  <span class="n">output_file_path_conv_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_conversation_level.csv&quot;</span><span class="p">,</span>
  <span class="n">turns</span> <span class="o">=</span> <span class="kc">True</span>
 <span class="p">)</span>
-<span class="n">jury_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">(</span><span class="n">col</span><span class="o">=</span><span class="s2">&quot;message&quot;</span><span class="p">)</span>
+<span class="n">jury_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">()</span>
 </pre></div>
 </div>
 <section id="basic-input-columns">

diff --git a/docs/build/html/feature_builder.html b/docs/build/html/feature_builder.html
@@ -174,22 +174,19 @@
 
 <dl class="py method">
 <dt class="sig sig-object py" id="feature_builder.FeatureBuilder.featurize">
-<span class="sig-name descname"><span class="pre">featurize</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">col</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'message'</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#feature_builder.FeatureBuilder.featurize" title="Link to this definition"></a></dt>
+<span class="sig-name descname"><span class="pre">featurize</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#feature_builder.FeatureBuilder.featurize" title="Link to this definition"></a></dt>
 <dd><p>Main driver function for feature generation.</p>
 <p>This function creates chat-level features, generates features for different
 truncation percentages of the data if specified, and produces user-level and
 conversation-level features. Finally, the features are saved into the
 designated output files.</p>
 <dl class="field-list simple">
-<dt class="field-odd">Parameters<span class="colon">:</span></dt>
-<dd class="field-odd"><p><strong>col</strong> (<em>str</em><em>, </em><em>optional</em>) – Column to preprocess, defaults to “message”</p>
+<dt class="field-odd">Returns<span class="colon">:</span></dt>
+<dd class="field-odd"><p>None</p>
 </dd>
-<dt class="field-even">Returns<span class="colon">:</span></dt>
+<dt class="field-even">Return type<span class="colon">:</span></dt>
 <dd class="field-even"><p>None</p>
 </dd>
-<dt class="field-odd">Return type<span class="colon">:</span></dt>
-<dd class="field-odd"><p>None</p>
-</dd>
 </dl>
 </dd></dl>
 

diff --git a/docs/build/html/index.html b/docs/build/html/index.html
@@ -139,7 +139,7 @@ <h2>Using the Package<a class="headerlink" href="#using-the-package" title="Link
 <span class="p">)</span>
 
 <span class="c1"># this line of code runs the FeatureBuilder on your data</span>
-<span class="n">my_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">(</span><span class="n">col</span><span class="o">=</span><span class="s2">&quot;message&quot;</span><span class="p">)</span>
+<span class="n">my_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">()</span>
 </pre></div>
 </div>
 <p>Use the Table of Contents below to learn more about our tool. We recommend that you begin in the “Introduction” section, then explore other sections of the documentation as they become relevant to you. We recommend reading <a class="reference internal" href="basics.html#basics"><span class="std std-ref">The Basics</span></a> for a high-level overview of the requirements and parameters, and then reading through the <a class="reference internal" href="examples.html#examples"><span class="std std-ref">Worked Example</span></a> for a detailed walkthrough and discussion of considerations.</p>

diff --git a/docs/build/html/searchindex.js b/docs/build/html/searchindex.js
diff --git a/docs/build/html/utils/check_embeddings.html b/docs/build/html/utils/check_embeddings.html
@@ -132,14 +132,15 @@
 
 <dl class="py function">
 <dt class="sig sig-object py" id="utils.check_embeddings.generate_bert">
-<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">generate_bert</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">chat_data</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_path</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">message_col</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.generate_bert" title="Link to this definition"></a></dt>
+<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">generate_bert</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">chat_data</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_path</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">message_col</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">batch_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">64</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.generate_bert" title="Link to this definition"></a></dt>
 <dd><p>Generates RoBERTa sentiment scores for the given chat data and saves them to a CSV file.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters<span class="colon">:</span></dt>
 <dd class="field-odd"><ul class="simple">
 <li><p><strong>chat_data</strong> (<em>pd.DataFrame</em>) – Contains message data to be analyzed for sentiments.</p></li>
 <li><p><strong>output_path</strong> (<em>str</em>) – Path to save the CSV file containing sentiment scores.</p></li>
 <li><p><strong>message_col</strong> (<em>str</em><em>, </em><em>optional</em>) – A string representing the column name that should be selected as the message. Defaults to “message”.</p></li>
+<li><p><strong>batch_size</strong> (<em>int</em>) – The size of each batch for processing sentiment analysis. Defaults to 64.</p></li>
 </ul>
 </dd>
 <dt class="field-even">Raises<span class="colon">:</span></dt>
@@ -224,17 +225,17 @@
 
 <dl class="py function">
 <dt class="sig sig-object py" id="utils.check_embeddings.get_sentiment">
-<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">get_sentiment</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">text</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.get_sentiment" title="Link to this definition"></a></dt>
-<dd><p>Analyzes the sentiment of the given text using a BERT model and returns the scores for positive, negative, and neutral sentiments.</p>
+<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">get_sentiment</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">texts</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.get_sentiment" title="Link to this definition"></a></dt>
+<dd><p>Analyzes the sentiment of the given list of texts using a BERT model and returns a DataFrame with scores for positive, negative, and neutral sentiments.</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters<span class="colon">:</span></dt>
-<dd class="field-odd"><p><strong>text</strong> (<em>str</em><em> or </em><em>None</em>) – The input text to analyze.</p>
+<dd class="field-odd"><p><strong>texts</strong> (<em>list</em><em> of </em><em>str</em>) – The list of input texts to analyze.</p>
 </dd>
 <dt class="field-even">Returns<span class="colon">:</span></dt>
-<dd class="field-even"><p>A dictionary with sentiment scores.</p>
+<dd class="field-even"><p>A DataFrame with sentiment scores.</p>
 </dd>
 <dt class="field-odd">Return type<span class="colon">:</span></dt>
-<dd class="field-odd"><p>dict</p>
+<dd class="field-odd"><p>pd.DataFrame</p>
 </dd>
 </dl>
 </dd></dl>

diff --git a/docs/source/basics.rst b/docs/source/basics.rst
@@ -49,6 +49,10 @@ Package Assumptions
 
 8. **Vector Data Cache**: Your data's vector data will be cached in **vector_directory**. This directory will be created if it doesn’t exist, but its contents should be reserved for cached vector files.
 
+ * Note: v0.1.3 and earlier compute vectors using _preprocessed_ text by default, which drops capitalization and punctuation. However, this can affect the interpretation of sentiment vectors; for example, "Hello!" has more positive sentiment than "hello." Consequently, from v0.1.4 onwards, we compute vectors using the raw input text, including punctuation and capitalization. To restore this behavior, please set **compute_vectors_from_preprocessed** to True.
+
+ * Additionally, we assume that empty messages are equivalent to "NaN vector," defined `here <https://raw.githubusercontent.com/Watts-Lab/team_comm_tools/refs/heads/main/src/team_comm_tools/features/assets/nan_vector.txt>`_.
+
 9. **Output Files**: We generate three outputs: **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features).
 
  * This should be a *path*, not just a filename. For example, "./my_file.csv", not just "my_file.csv."
@@ -79,4 +83,6 @@ Here are some parameters that can be customized. For more details, refer to the
 
 4. **ner_training_df** and **ner_cutoff**: Measure the number of named entities in each utterance (see :ref:`named_entity_recognition`).
 
-5. **regenerate_vectors**: Force-regenerate vector data even if it already exists.
+5. **regenerate_vectors**: Force-regenerate vector data even if it already exists.
+
+6. **compute_vectors_from_preprocessed**: Computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation), and this parameter now defaults to False.
diff --git a/docs/source/examples.rst b/docs/source/examples.rst
@@ -90,7 +90,7 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
  output_file_path_conv_level = "./jury_output_conversation_level.csv",
  turns = True
  )
- jury_feature_builder.featurize(col="message")
+ jury_feature_builder.featurize()
 
 Basic Input Columns
 ^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -76,7 +76,7 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w
  )
 
  # this line of code runs the FeatureBuilder on your data
- my_feature_builder.featurize(col="message")
+ my_feature_builder.featurize()
 
 Use the Table of Contents below to learn more about our tool. We recommend that you begin in the "Introduction" section, then explore other sections of the documentation as they become relevant to you. We recommend reading :ref:`basics` for a high-level overview of the requirements and parameters, and then reading through the :ref:`examples` for a detailed walkthrough and discussion of considerations.
 

diff --git a/examples/featurize.py b/examples/featurize.py
@@ -45,9 +45,14 @@
  output_file_path_chat_level = "./jury_TINY_output_chat_level.csv",
  output_file_path_user_level = "./jury_TINY_output_user_level.csv",
  output_file_path_conv_level = "./jury_TINY_output_conversation_level.csv",
- turns = False
+ turns = False,
+ custom_features = [
+ "(BERT) Mimicry",
+ "Moving Mimicry",
+ "Forward Flow",
+ "Discursive Diversity"]
  )
- tiny_juries_feature_builder.featurize(col="message")
+ tiny_juries_feature_builder.featurize()
 
  # Tiny multi-task
  tiny_multi_task_feature_builder = FeatureBuilder(
@@ -59,21 +64,26 @@
  output_file_path_conv_level = "./multi_task_TINY_output_conversation_level_stageId_cumulative.csv",
  turns = False
  )
- tiny_multi_task_feature_builder.featurize(col="message")
+ tiny_multi_task_feature_builder.featurize()
 
  # FULL DATASETS BELOW ------------------------------------- #
 
  # Juries
  # jury_feature_builder = FeatureBuilder(
  # input_df = juries_df,
- # grouping_keys = ["batch_num", "round_num"],
+ #  grouping_keys = ["batch_num", "round_num"],
  # vector_directory = "./vector_data/",
  # output_file_path_chat_level = "./jury_output_chat_level.csv",
  # output_file_path_user_level = "./jury_output_user_level.csv",
  # output_file_path_conv_level = "./jury_output_conversation_level.csv",
- # turns = True
+ # turns = True,
+ # custom_features = [
+ # "(BERT) Mimicry",
+ # "Moving Mimicry",
+ # "Forward Flow",
+ # "Discursive Diversity"]
  # )
- # jury_feature_builder.featurize(col="message")
+ # jury_feature_builder.featurize()
 
  # # CSOP (Abdullah)
  # csop_feature_builder = FeatureBuilder(
@@ -84,7 +94,7 @@
  # output_file_path_conv_level = "./csop_output_conversation_level.csv",
  # turns = True
  # )
- # csop_feature_builder.featurize(col="message")
+ # csop_feature_builder.featurize()
 
 
  # # CSOP II (Nak Won Rim)
@@ -96,4 +106,4 @@
  # output_file_path_conv_level = "./csopII_output_conversation_level.csv",
  # turns = True
  # )
- # csopII_feature_builder.featurize(col="message")
+ # csopII_feature_builder.featurize()
diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "team_comm_tools"
-version = "0.1.3"
+version = "0.1.4"
 requires-python = ">= 3.10"
 dependencies = [
  "chardet>=3.0.4",
@@ -36,6 +36,7 @@ dependencies = [
  "torchaudio==2.4.1",
  "torchvision==0.19.1",
  "transformers==4.44.0",
+ "tqdm>=4.66.5",
  "tzdata>=2023.3",
  "tzlocal==5.2"
 ]

diff --git a/requirements.txt b/requirements.txt
@@ -25,5 +25,6 @@ torch==2.4.1
 torchaudio==2.4.1
 torchvision==0.19.1
 transformers==4.44.0
+tqdm>=4.66.5
 tzdata>=2023.3
 tzlocal==5.2
diff --git a/src/team_comm_tools/feature_builder.py b/src/team_comm_tools/feature_builder.py
@@ -85,6 +85,9 @@ class FeatureBuilder:
  :param regenerate_vectors: If true, will regenerate vector data even if it already exists. Defaults to False.
  :type regenerate_vectors: bool, optional
 
+ :param compute_vectors_from_preprocessed: If true, computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation). Defaults to False.
+ :type compute_vectors_from_preprocessed: bool, optional
+
  :return: The FeatureBuilder doesn't return anything; instead, it writes the generated features to files in the specified paths. It will also print out its progress, so you should see "All Done!" in the terminal, which will indicate that the features have been generated.
  :rtype: None
 
@@ -108,14 +111,16 @@ def __init__(
  within_task = False,
  ner_training_df: pd.DataFrame = None,
  ner_cutoff: int = 0.9,
- regenerate_vectors: bool = False
+ regenerate_vectors: bool = False,
+ compute_vectors_from_preprocessed: bool = False
  ) -> None:
 
  # Defining input and output paths.
  self.chat_data = input_df.copy()
  self.orig_data = input_df.copy()
  self.ner_training = ner_training_df
  self.vector_directory = vector_directory
+
  print("Initializing Featurization...")
  self.output_file_path_conv_level = output_file_path_conv_level
  self.output_file_path_user_level = output_file_path_user_level
@@ -218,6 +223,11 @@ def __init__(
  self.ner_cutoff = ner_cutoff
  self.regenerate_vectors = regenerate_vectors
 
+ if(compute_vectors_from_preprocessed == True):
+ self.vector_colname = self.message_col # because the message col will eventually get preprocessed
+ else:
+ self.vector_colname = self.message_col + "_original" # because this contains the original message
+
  # check grouping rules
  if self.conversation_id_col not in self.chat_data.columns and len(self.grouping_keys)==0:
  if(self.conversation_id_col == "conversation_num"):
@@ -240,7 +250,7 @@ def __init__(
 
  # set new identifier column for cumulative grouping.
  if self.cumulative_grouping and len(grouping_keys) == 3:
- print("NOTE: User has requested cumulative grouping. Auto-generating the key `conversation_num` as the conversation identifier for cumulative convrersations.")
+ print("NOTE: User has requested cumulative grouping. Auto-generating the key `conversation_num` as the conversation identifier for cumulative conversations.")
  self.conversation_id_col = "conversation_num"
 
  # Input columns are the columns that come in the raw chat data
@@ -338,7 +348,7 @@ def __init__(
  if(not need_sentiment and feature_dict[feature]["bert_sentiment_data"]):
  need_sentiment = True
 
- check_embeddings(self.chat_data, self.vect_path, self.bert_path, need_sentence, need_sentiment, self.regenerate_vectors, self.message_col)
+ check_embeddings(self.chat_data, self.vect_path, self.bert_path, need_sentence, need_sentiment, self.regenerate_vectors, message_col = self.vector_colname)
 
  if(need_sentence):
  self.vect_data = pd.read_csv(self.vect_path, encoding='mac_roman')
@@ -401,7 +411,7 @@ def merge_conv_data_with_original(self) -> None:
  if {'index'}.issubset(self.conv_data.columns):
  self.conv_data = self.conv_data.drop(columns=['index'])
 
- def featurize(self, col: str="message") -> None:
+ def featurize(self) -> None:
  """
  Main driver function for feature generation.
 
@@ -410,9 +420,6 @@ def featurize(self, col: str="message") -> None:
  conversation-level features. Finally, the features are saved into the 
  designated output files.
 
- :param col: Column to preprocess, defaults to "message"
- :type col: str, optional
-
  :return: None
  :rtype: None
  """
@@ -494,7 +501,7 @@ def preprocess_chat_data(self) -> None:
  # create new column that retains punctuation
  self.chat_data["message_lower_with_punc"] = self.chat_data[self.message_col].astype(str).apply(preprocess_text_lowercase_but_retain_punctuation)
 
- # Preprocessing the text in `col` and then overwriting the column `col`.
+ # Preprocessing the text in `message_col` and then overwriting the column `message_col`.
  # TODO: We should probably use classes to abstract preprocessing module as well?
  self.chat_data[self.message_col] = self.chat_data[self.message_col].astype(str).apply(preprocess_text)