Skip to content

Commit

Permalink
Check embedding update (#295)
Browse files Browse the repository at this point in the history
* Add Examples Notebook (#294)

* Urgent fix to remove LIWC lexicons from public repo (#279)

* delete small test lexicons

* move .pkl files to assets and remove from GH

* filesystem cleanup

* update certainty pickle location

* remove unpickling certainty

* remove lexicons from pyproject

* change lexical pkl path

* add error handling when lexicons are not found

* update warning message

* add legal caveat and update name of certainty pkl to be correct

* ensure lexicons are ignored

* Update Documentation (Complete Conceptual Documentation, Document Assumptions) (#289)

* new docs

* lexicons hotfix

* emilys doc edits

* update deprecated github actions to latest

* update last remaining text features

* update index

* update docs

* update index

* update docs

* update docs and the feature dictionary

* add basics.rst

* add new basics page

* update docs

---------

Co-authored-by: Xinlan Emily Hu <xehu@wharton.upenn.edu>
Co-authored-by: Xinlan Emily Hu <xehu@cs.stanford.edu>

* update torch requirements to resolve compatibility issue on torch end (#290)

* Update Website (#291)

* website updates

* renaming tpm-website to website

* deploying via gh-pages

* changed from tpm-website to website

* deployed website

* copyright and team

* team headshots and footer

* edits to the pages

* website updates

* updated links

* updated homepage

* link updates

* mobile compatibility

* mobile adjustments

* navbar mobile updates

* whitespace edits

* homepage updates

* feature table

* website updates

* renaming tpm-website to website

* deploying via gh-pages

* changed from tpm-website to website

* edits to the pages

* website updates

* updated links

* updated homepage

* link updates

* mobile compatibility

* mobile adjustments

* navbar mobile updates

* homepage updates

* add table of features

* updated team page titles

* include flask in requirements.txt

* updates to table of features

* load pages from top

* fix to 404 issues

* moved build under website folder

* updates to package launch

* hyperlink ./setup.sh

* fix nav bar sizing and hamburger logo

* include preprint

* updates to "getting started"

* update team

---------

Co-authored-by: amytangzheng <amy.tang.zheng@gmail.com>

* update documentation for clarity and correct typos in positivity z-score and information exchange and liwc

* add demo notebook

* update notebook and add information to docs

* update documentation

---------

Co-authored-by: Shruti Agarwal <46203852+agshruti12@users.noreply.github.com>
Co-authored-by: amytangzheng <amy.tang.zheng@gmail.com>

* update check embeddings with tqdm loading bar and BERT tokenization update

* (1) allow BERT sentiments to be generated from the messages with punctuation, rather than the preprocessed messages; (2) batch BERT sentiment generation to make it more efficient; (3) add loading bar for generation of chat-level features

---------

Co-authored-by: Shruti Agarwal <46203852+agshruti12@users.noreply.github.com>
Co-authored-by: amytangzheng <amy.tang.zheng@gmail.com>
  • Loading branch information
3 people committed Sep 23, 2024
1 parent 6a05e80 commit 1c72695
Show file tree
Hide file tree
Showing 26 changed files with 219 additions and 91 deletions.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/examples.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/feature_builder.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/utils/check_embeddings.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/build/html/_sources/examples.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
output_file_path_conv_level = "./jury_output_conversation_level.csv",
turns = True
)
jury_feature_builder.featurize(col="message")
jury_feature_builder.featurize()
Basic Input Columns
^^^^^^^^^^^^^^^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w
)
# this line of code runs the FeatureBuilder on your data
my_feature_builder.featurize(col="message")
my_feature_builder.featurize()
Use the Table of Contents below to learn more about our tool. We recommend that you begin in the "Introduction" section, then explore other sections of the documentation as they become relevant to you. We recommend reading :ref:`basics` for a high-level overview of the requirements and parameters, and then reading through the :ref:`examples` for a detailed walkthrough and discussion of considerations.

Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/examples.html
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ <h3>Configuring the FeatureBuilder<a class="headerlink" href="#configuring-the-f
<span class="n">output_file_path_conv_level</span> <span class="o">=</span> <span class="s2">&quot;./jury_output_conversation_level.csv&quot;</span><span class="p">,</span>
<span class="n">turns</span> <span class="o">=</span> <span class="kc">True</span>
<span class="p">)</span>
<span class="n">jury_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">(</span><span class="n">col</span><span class="o">=</span><span class="s2">&quot;message&quot;</span><span class="p">)</span>
<span class="n">jury_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">()</span>
</pre></div>
</div>
<section id="basic-input-columns">
Expand Down
11 changes: 4 additions & 7 deletions docs/build/html/feature_builder.html
Original file line number Diff line number Diff line change
Expand Up @@ -174,22 +174,19 @@

<dl class="py method">
<dt class="sig sig-object py" id="feature_builder.FeatureBuilder.featurize">
<span class="sig-name descname"><span class="pre">featurize</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">col</span></span><span class="p"><span class="pre">:</span></span><span class="w"> </span><span class="n"><span class="pre">str</span></span><span class="w"> </span><span class="o"><span class="pre">=</span></span><span class="w"> </span><span class="default_value"><span class="pre">'message'</span></span></em><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#feature_builder.FeatureBuilder.featurize" title="Link to this definition"></a></dt>
<span class="sig-name descname"><span class="pre">featurize</span></span><span class="sig-paren">(</span><span class="sig-paren">)</span> <span class="sig-return"><span class="sig-return-icon">&#x2192;</span> <span class="sig-return-typehint"><span class="pre">None</span></span></span><a class="headerlink" href="#feature_builder.FeatureBuilder.featurize" title="Link to this definition"></a></dt>
<dd><p>Main driver function for feature generation.</p>
<p>This function creates chat-level features, generates features for different
truncation percentages of the data if specified, and produces user-level and
conversation-level features. Finally, the features are saved into the
designated output files.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><p><strong>col</strong> (<em>str</em><em>, </em><em>optional</em>) – Column to preprocess, defaults to “message”</p>
<dt class="field-odd">Returns<span class="colon">:</span></dt>
<dd class="field-odd"><p>None</p>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dt class="field-even">Return type<span class="colon">:</span></dt>
<dd class="field-even"><p>None</p>
</dd>
<dt class="field-odd">Return type<span class="colon">:</span></dt>
<dd class="field-odd"><p>None</p>
</dd>
</dl>
</dd></dl>

Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ <h2>Using the Package<a class="headerlink" href="#using-the-package" title="Link
<span class="p">)</span>

<span class="c1"># this line of code runs the FeatureBuilder on your data</span>
<span class="n">my_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">(</span><span class="n">col</span><span class="o">=</span><span class="s2">&quot;message&quot;</span><span class="p">)</span>
<span class="n">my_feature_builder</span><span class="o">.</span><span class="n">featurize</span><span class="p">()</span>
</pre></div>
</div>
<p>Use the Table of Contents below to learn more about our tool. We recommend that you begin in the “Introduction” section, then explore other sections of the documentation as they become relevant to you. We recommend reading <a class="reference internal" href="basics.html#basics"><span class="std std-ref">The Basics</span></a> for a high-level overview of the requirements and parameters, and then reading through the <a class="reference internal" href="examples.html#examples"><span class="std std-ref">Worked Example</span></a> for a detailed walkthrough and discussion of considerations.</p>
Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/searchindex.js

Large diffs are not rendered by default.

13 changes: 7 additions & 6 deletions docs/build/html/utils/check_embeddings.html
Original file line number Diff line number Diff line change
Expand Up @@ -132,14 +132,15 @@

<dl class="py function">
<dt class="sig sig-object py" id="utils.check_embeddings.generate_bert">
<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">generate_bert</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">chat_data</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_path</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">message_col</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.generate_bert" title="Link to this definition"></a></dt>
<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">generate_bert</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">chat_data</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">output_path</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">message_col</span></span></em>, <em class="sig-param"><span class="n"><span class="pre">batch_size</span></span><span class="o"><span class="pre">=</span></span><span class="default_value"><span class="pre">64</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.generate_bert" title="Link to this definition"></a></dt>
<dd><p>Generates RoBERTa sentiment scores for the given chat data and saves them to a CSV file.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><ul class="simple">
<li><p><strong>chat_data</strong> (<em>pd.DataFrame</em>) – Contains message data to be analyzed for sentiments.</p></li>
<li><p><strong>output_path</strong> (<em>str</em>) – Path to save the CSV file containing sentiment scores.</p></li>
<li><p><strong>message_col</strong> (<em>str</em><em>, </em><em>optional</em>) – A string representing the column name that should be selected as the message. Defaults to “message”.</p></li>
<li><p><strong>batch_size</strong> (<em>int</em>) – The size of each batch for processing sentiment analysis. Defaults to 64.</p></li>
</ul>
</dd>
<dt class="field-even">Raises<span class="colon">:</span></dt>
Expand Down Expand Up @@ -224,17 +225,17 @@

<dl class="py function">
<dt class="sig sig-object py" id="utils.check_embeddings.get_sentiment">
<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">get_sentiment</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">text</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.get_sentiment" title="Link to this definition"></a></dt>
<dd><p>Analyzes the sentiment of the given text using a BERT model and returns the scores for positive, negative, and neutral sentiments.</p>
<span class="sig-prename descclassname"><span class="pre">utils.check_embeddings.</span></span><span class="sig-name descname"><span class="pre">get_sentiment</span></span><span class="sig-paren">(</span><em class="sig-param"><span class="n"><span class="pre">texts</span></span></em><span class="sig-paren">)</span><a class="headerlink" href="#utils.check_embeddings.get_sentiment" title="Link to this definition"></a></dt>
<dd><p>Analyzes the sentiment of the given list of texts using a BERT model and returns a DataFrame with scores for positive, negative, and neutral sentiments.</p>
<dl class="field-list simple">
<dt class="field-odd">Parameters<span class="colon">:</span></dt>
<dd class="field-odd"><p><strong>text</strong> (<em>str</em><em> or </em><em>None</em>) – The input text to analyze.</p>
<dd class="field-odd"><p><strong>texts</strong> (<em>list</em><em> of </em><em>str</em>) – The list of input texts to analyze.</p>
</dd>
<dt class="field-even">Returns<span class="colon">:</span></dt>
<dd class="field-even"><p>A dictionary with sentiment scores.</p>
<dd class="field-even"><p>A DataFrame with sentiment scores.</p>
</dd>
<dt class="field-odd">Return type<span class="colon">:</span></dt>
<dd class="field-odd"><p>dict</p>
<dd class="field-odd"><p>pd.DataFrame</p>
</dd>
</dl>
</dd></dl>
Expand Down
8 changes: 7 additions & 1 deletion docs/source/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ Package Assumptions

8. **Vector Data Cache**: Your data's vector data will be cached in **vector_directory**. This directory will be created if it doesn’t exist, but its contents should be reserved for cached vector files.

* Note: v0.1.3 and earlier compute vectors using _preprocessed_ text by default, which drops capitalization and punctuation. However, this can affect the interpretation of sentiment vectors; for example, "Hello!" has more positive sentiment than "hello." Consequently, from v0.1.4 onwards, we compute vectors using the raw input text, including punctuation and capitalization. To restore this behavior, please set **compute_vectors_from_preprocessed** to True.

* Additionally, we assume that empty messages are equivalent to "NaN vector," defined `here <https://raw.githubusercontent.com/Watts-Lab/team_comm_tools/refs/heads/main/src/team_comm_tools/features/assets/nan_vector.txt>`_.

9. **Output Files**: We generate three outputs: **output_file_path_chat_level** (Utterance- or Chat-Level Features), **output_file_path_user_level** (Speaker- or User-Level Features), and **output_file_path_conv_level** (Conversation-Level Features).

* This should be a *path*, not just a filename. For example, "./my_file.csv", not just "my_file.csv."
Expand Down Expand Up @@ -79,4 +83,6 @@ Here are some parameters that can be customized. For more details, refer to the

4. **ner_training_df** and **ner_cutoff**: Measure the number of named entities in each utterance (see :ref:`named_entity_recognition`).

5. **regenerate_vectors**: Force-regenerate vector data even if it already exists.
5. **regenerate_vectors**: Force-regenerate vector data even if it already exists.

6. **compute_vectors_from_preprocessed**: Computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation), and this parameter now defaults to False.
2 changes: 1 addition & 1 deletion docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ Now we are ready to call the FeatureBuilder on our data. All we need to do is de
output_file_path_conv_level = "./jury_output_conversation_level.csv",
turns = True
)
jury_feature_builder.featurize(col="message")
jury_feature_builder.featurize()
Basic Input Columns
^^^^^^^^^^^^^^^^^^^^
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Once you import the tool, you will be able to declare a FeatureBuilder object, w
)
# this line of code runs the FeatureBuilder on your data
my_feature_builder.featurize(col="message")
my_feature_builder.featurize()
Use the Table of Contents below to learn more about our tool. We recommend that you begin in the "Introduction" section, then explore other sections of the documentation as they become relevant to you. We recommend reading :ref:`basics` for a high-level overview of the requirements and parameters, and then reading through the :ref:`examples` for a detailed walkthrough and discussion of considerations.

Expand Down
26 changes: 18 additions & 8 deletions examples/featurize.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,14 @@
output_file_path_chat_level = "./jury_TINY_output_chat_level.csv",
output_file_path_user_level = "./jury_TINY_output_user_level.csv",
output_file_path_conv_level = "./jury_TINY_output_conversation_level.csv",
turns = False
turns = False,
custom_features = [
"(BERT) Mimicry",
"Moving Mimicry",
"Forward Flow",
"Discursive Diversity"]
)
tiny_juries_feature_builder.featurize(col="message")
tiny_juries_feature_builder.featurize()

# Tiny multi-task
tiny_multi_task_feature_builder = FeatureBuilder(
Expand All @@ -59,21 +64,26 @@
output_file_path_conv_level = "./multi_task_TINY_output_conversation_level_stageId_cumulative.csv",
turns = False
)
tiny_multi_task_feature_builder.featurize(col="message")
tiny_multi_task_feature_builder.featurize()

# FULL DATASETS BELOW ------------------------------------- #

# Juries
# jury_feature_builder = FeatureBuilder(
# input_df = juries_df,
# grouping_keys = ["batch_num", "round_num"],
# grouping_keys = ["batch_num", "round_num"],
# vector_directory = "./vector_data/",
# output_file_path_chat_level = "./jury_output_chat_level.csv",
# output_file_path_user_level = "./jury_output_user_level.csv",
# output_file_path_conv_level = "./jury_output_conversation_level.csv",
# turns = True
# turns = True,
# custom_features = [
# "(BERT) Mimicry",
# "Moving Mimicry",
# "Forward Flow",
# "Discursive Diversity"]
# )
# jury_feature_builder.featurize(col="message")
# jury_feature_builder.featurize()

# # CSOP (Abdullah)
# csop_feature_builder = FeatureBuilder(
Expand All @@ -84,7 +94,7 @@
# output_file_path_conv_level = "./csop_output_conversation_level.csv",
# turns = True
# )
# csop_feature_builder.featurize(col="message")
# csop_feature_builder.featurize()


# # CSOP II (Nak Won Rim)
Expand All @@ -96,4 +106,4 @@
# output_file_path_conv_level = "./csopII_output_conversation_level.csv",
# turns = True
# )
# csopII_feature_builder.featurize(col="message")
# csopII_feature_builder.featurize()
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "team_comm_tools"
version = "0.1.3"
version = "0.1.4"
requires-python = ">= 3.10"
dependencies = [
"chardet>=3.0.4",
Expand Down Expand Up @@ -36,6 +36,7 @@ dependencies = [
"torchaudio==2.4.1",
"torchvision==0.19.1",
"transformers==4.44.0",
"tqdm>=4.66.5",
"tzdata>=2023.3",
"tzlocal==5.2"
]
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,6 @@ torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
transformers==4.44.0
tqdm>=4.66.5
tzdata>=2023.3
tzlocal==5.2
23 changes: 15 additions & 8 deletions src/team_comm_tools/feature_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,9 @@ class FeatureBuilder:
:param regenerate_vectors: If true, will regenerate vector data even if it already exists. Defaults to False.
:type regenerate_vectors: bool, optional
:param compute_vectors_from_preprocessed: If true, computes vectors using preprocessed text (that is, with capitalization and punctuation removed). This was the default behavior for v.0.1.3 and earlier, but we now default to computing metrics on the unpreprocessed text (which INCLUDES capitalization and punctuation). Defaults to False.
:type compute_vectors_from_preprocessed: bool, optional
:return: The FeatureBuilder doesn't return anything; instead, it writes the generated features to files in the specified paths. It will also print out its progress, so you should see "All Done!" in the terminal, which will indicate that the features have been generated.
:rtype: None
Expand All @@ -108,14 +111,16 @@ def __init__(
within_task = False,
ner_training_df: pd.DataFrame = None,
ner_cutoff: int = 0.9,
regenerate_vectors: bool = False
regenerate_vectors: bool = False,
compute_vectors_from_preprocessed: bool = False
) -> None:

# Defining input and output paths.
self.chat_data = input_df.copy()
self.orig_data = input_df.copy()
self.ner_training = ner_training_df
self.vector_directory = vector_directory

print("Initializing Featurization...")
self.output_file_path_conv_level = output_file_path_conv_level
self.output_file_path_user_level = output_file_path_user_level
Expand Down Expand Up @@ -218,6 +223,11 @@ def __init__(
self.ner_cutoff = ner_cutoff
self.regenerate_vectors = regenerate_vectors

if(compute_vectors_from_preprocessed == True):
self.vector_colname = self.message_col # because the message col will eventually get preprocessed
else:
self.vector_colname = self.message_col + "_original" # because this contains the original message

# check grouping rules
if self.conversation_id_col not in self.chat_data.columns and len(self.grouping_keys)==0:
if(self.conversation_id_col == "conversation_num"):
Expand All @@ -240,7 +250,7 @@ def __init__(

# set new identifier column for cumulative grouping.
if self.cumulative_grouping and len(grouping_keys) == 3:
print("NOTE: User has requested cumulative grouping. Auto-generating the key `conversation_num` as the conversation identifier for cumulative convrersations.")
print("NOTE: User has requested cumulative grouping. Auto-generating the key `conversation_num` as the conversation identifier for cumulative conversations.")
self.conversation_id_col = "conversation_num"

# Input columns are the columns that come in the raw chat data
Expand Down Expand Up @@ -338,7 +348,7 @@ def __init__(
if(not need_sentiment and feature_dict[feature]["bert_sentiment_data"]):
need_sentiment = True

check_embeddings(self.chat_data, self.vect_path, self.bert_path, need_sentence, need_sentiment, self.regenerate_vectors, self.message_col)
check_embeddings(self.chat_data, self.vect_path, self.bert_path, need_sentence, need_sentiment, self.regenerate_vectors, message_col = self.vector_colname)

if(need_sentence):
self.vect_data = pd.read_csv(self.vect_path, encoding='mac_roman')
Expand Down Expand Up @@ -401,7 +411,7 @@ def merge_conv_data_with_original(self) -> None:
if {'index'}.issubset(self.conv_data.columns):
self.conv_data = self.conv_data.drop(columns=['index'])

def featurize(self, col: str="message") -> None:
def featurize(self) -> None:
"""
Main driver function for feature generation.
Expand All @@ -410,9 +420,6 @@ def featurize(self, col: str="message") -> None:
conversation-level features. Finally, the features are saved into the
designated output files.
:param col: Column to preprocess, defaults to "message"
:type col: str, optional
:return: None
:rtype: None
"""
Expand Down Expand Up @@ -494,7 +501,7 @@ def preprocess_chat_data(self) -> None:
# create new column that retains punctuation
self.chat_data["message_lower_with_punc"] = self.chat_data[self.message_col].astype(str).apply(preprocess_text_lowercase_but_retain_punctuation)

# Preprocessing the text in `col` and then overwriting the column `col`.
# Preprocessing the text in `message_col` and then overwriting the column `message_col`.
# TODO: We should probably use classes to abstract preprocessing module as well?
self.chat_data[self.message_col] = self.chat_data[self.message_col].astype(str).apply(preprocess_text)

Expand Down
Loading

0 comments on commit 1c72695

Please sign in to comment.