Priya/docs 2 (#269)

* added docstrings for politeness_v2.py * added docstrings for politeness_v2_helper.py * added docstrings for reddit_tags.py * added docstrings for word_mimicry.py * Update index.rst politeness_v2, politeness_v2_helper,reddit_tags,word_mimicry * documentation * index updated * textblob polarity and subjectivity * textblob polarity and subjectivity * added feature names in .rst files * proportion of first person pronouns * hedges * dale chall score * time difference * positivity z scores * positivity z scores * politeness strategies - Convokit * replaced TEMPLATE with feature name for concecptual features * implemented suggestions * implemented suggestions * reset version number --------- Co-authored-by: Xinlan Emily Hu <xehu@cs.stanford.edu> Co-authored-by: sundy1994 <yuxuanzh@seas.upenn.edu>
Watts-Lab · Aug 8, 2024 · 800abed · 800abed
1 parent 83c33d3
commit 800abed
Show file tree

Hide file tree

Showing 15 changed files with 619 additions and 1 deletion.
diff --git a/docs/source/features_conceptual/TEMPLATE.rst b/docs/source/features_conceptual/TEMPLATE.rst
@@ -1,4 +1,4 @@
-.. _TEMPLATE:
+ .. _TEMPLATE:
 
 FEATURE NAME
 ============

diff --git a/docs/source/features_conceptual/dale_chall_score.rst b/docs/source/features_conceptual/dale_chall_score.rst
@@ -0,0 +1,51 @@
+.. _dale_chall_score:
+
+Dale-Chall Score
+============
+
+High-Level Intuition
+*********************
+A score that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text.
+
+Citation
+*********
+`Cao, et al. (2020) <https://dl.acm.org/doi/pdf/10.1145/3432929?casa_token=B5WlyazkwNIAAAAA:E-1nT55uQnGslAHCfO21sdeaXfaefJsT5ZpU2hq49eagiYaGSGpohlmTyUn4NslWtNOZuAl3XvcFXQ>`_
+
+Implementation Basics 
+**********************
+
+The Dale–Chall readability formula is a readability test that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text. It uses a list of 3000 words that groups of fourth-grade American students could reliably understand, considering any word not on that list to be difficult.
+
+The formula for calculating the raw score of the Dale–Chall readability score (1948) is given below:
+
+0.1579(difficult words ×100/words) + 0.0496(words/sentences)
+
+Scores range from 0 - 10, details can be found below:
+
+`Dale Chall Score <https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula>`_
+
+Credits: Wikipedia
+
+Implementation Notes/Caveats 
+*****************************
+NA
+
+Interpreting the Feature 
+*************************
+
+Scores range from 0 to 10, and can be interpreted as:
+
+=====  =============================================
+Score  Notes
+=====  =============================================
+4.9    easily understood by an average 4th-grade student or lower
+5.0–5.9  easily understood by an average 5th- or 6th-grade student
+6.0–6.9  easily understood by an average 7th- or 8th-grade student
+7.0–7.9  easily understood by an average 9th- or 10th-grade student
+8.0–8.9  easily understood by an average 11th- or 12th-grade student
+9.0–9.9  easily understood by an average college student
+=====  =============================================
+
+Related Features 
+*****************
+NA
diff --git a/docs/source/features_conceptual/hedge.rst b/docs/source/features_conceptual/hedge.rst
@@ -0,0 +1,29 @@
+.. _hedge:
+
+Hedge
+============
+
+High-Level Intuition
+*********************
+Captures whether a speaker appears to “hedge” their statement and express lack of certainty.
+
+Citation
+*********
+`Ranganath, et al. (2013) <https://web.stanford.edu/~jurafsky/pubs/ranganath2013.pdf>`_
+
+Implementation Basics 
+**********************
+A score of 1 is assigned if hedge phrases (”I think,” “a little,” “maybe,” “possibly”) are present, and a score of 0 is assigned otherwise.
+
+Implementation Notes/Caveats 
+*****************************
+This is a bag of words feature, which is a naive approach towards detecting hedges.
+
+Interpreting the Feature 
+*************************
+A score of 1 is assigned if hedge phrases (”I think,” “a little,” “maybe,” “possibly”) are present, and a score of 0 is assigned otherwise.
+
+
+Related Features 
+*****************
+Politeness Strategies - which also measures hedges
diff --git a/docs/source/features_conceptual/index.rst b/docs/source/features_conceptual/index.rst
@@ -14,7 +14,19 @@ Utterance- (Chat) Level Features
    :maxdepth: 1
 
    named_entity_recognition
+   information_exchange
+   message_length
+   message_quantity
+   online_discussion_tags
    word_ttr
+   textblob_polarity
+   textblob_subjectivity
+   proportion_of_first_person_pronouns
+   hedge
+   dale_chall_score
+   time_difference
+   positivity_z_score
+   politeness_strategies
 
 Conversation-Level Features
 ****************************

diff --git a/docs/source/features_conceptual/information_exchange.rst b/docs/source/features_conceptual/information_exchange.rst
@@ -0,0 +1,66 @@
+.. _information_exchange:
+
+Information Exchange
+============
+
+High-Level Intuition
+*********************
+Actual "information" exchanged, i.e. word count minus first-person singular pronouns,z-scored at both the chat and conversation levels.
+
+Citation
+*********
+Improving Teamwork Using Real-Time Language Feedback, Tausczik and Pennebaker,2013: https://www.cs.cmu.edu/~ylataus/files/TausczikPennebaker2013.pdf
+
+Implementation Basics 
+**********************
+Word count minus first-person singular pronouns was taken as a measure of information exchange. Then converted to z-scores.
+
+1. Word count minus first-person singular pronouns --> "info_exchange_wordcount"
+2. Compute the z-score at the chat level: compute z-score for each message across all conversations --> "zscore_chats"
+3. Compute the z-score at the conversation level: group by batch and round, then compute the z-score for each conversation --> "zscore_conversation"
+
+Implementation Notes/Caveats 
+*****************************
+1. Personal opinion acts as an important part in on-task communications, and this feature specifically excludes first person pronouns
+that might be indicative of personal opinions, which might not be ideal in all cases.
+2. This method does not capture the quality of the information itself because it solely relies on the quantity. A person might say a lot of words but none of the information is meaningful to the topic.
+
+
+Interpreting the Feature 
+*************************
+
+We are assuming this is a single conversation, and this is a dataset that consists of only one conversation.
+
+Example:
+Messages in a conversation:
+1. "I went to the store."
+   - info_exchange_wordcount: 4 (5 words minus 1 first person pronoun "I")
+
+2. "Bought some groceries for dinner."
+   - info_exchange_wordcount: 5 (5 words minus 0 first person pronouns)
+
+3. "It's raining today."
+   - info_exchange_wordcount: 3 (3 words minus 0 first person pronouns)
+
+Mean = 4, Standard deviation ≈ 0.82
+
+#### z-scores:
+Read more about z-scores here: https://www.statology.org/z-score-python/
+
+- **zscore_chats**:
+  - Message 1: 0 
+  - Message 2: 1.22
+  - Message 3: -1.22
+
+- **zscore_conversation**: Same values as above since it's one conversation.
+
+### Interpretation:
+- zscore_chats:
+  - 0: Average information exchange.
+  - 1.22: Higher-than-average.
+  - -1.22: Lower-than-average.
+
+Related Features 
+*****************
+
+Generally strongly correlated with message length, especially in cases where there are not a ton of pronoun use.
diff --git a/docs/source/features_conceptual/message_length.rst b/docs/source/features_conceptual/message_length.rst
@@ -0,0 +1,35 @@
+.. _message_length:
+
+Message Length
+============
+
+High-Level Intuition
+*********************
+Returns the number of words in a message/utterance
+
+Citation
+*********
+NA
+
+Implementation Basics 
+**********************
+
+Returns the number of words in a message/utterance by splitting on the whitespace, after preprocessing to remove punctuation.
+
+Implementation Notes/Caveats 
+*****************************
+This feature does not recognize successive punctuation marks as words. 
+For example,for "?????", the message length will be 0.
+
+Interpreting the Feature 
+*************************
+
+Analyzing word count can help in understanding the nature of the interaction—whether it’s more casual and quick-paced or detailed and thorough.
+Longer messages may indicate more detailed explanations, more extensive engagement, or more complex topics being discussed. 
+Conversely, shorter messages might be more direct, concise, or reflect quick interactions.
+
+For example, a curt "Hi" has a message length of 1, whereas a more detailed "Hello, How are you doing today?" has a message length of 6.
+
+Related Features 
+*****************
+NA
diff --git a/docs/source/features_conceptual/message_quantity.rst b/docs/source/features_conceptual/message_quantity.rst
@@ -0,0 +1,33 @@
+.. _message_quantity:
+
+Message Quantity
+============
+
+High-Level Intuition
+*********************
+This function by itself is trivial; by definition, each message counts as 1. However, at the conversation level, we use this function to count the total number of messages/utterance via aggregation.
+
+Citation
+*********
+NA
+
+Implementation Basics 
+**********************
+
+This function is trivial; by definition, each message counts as 1. However, at the conversation level, we use this function to count the total number of messages/utterance via aggregation.
+
+Implementation Notes/Caveats 
+*****************************
+This feature becomes relevant at the conversation level, but is trivial at the chat level.
+
+Interpreting the Feature 
+*************************
+
+This feature provides a measure of the conversation's length and activity. 
+A higher count indicates a more extensive, while a lower count may suggest a brief interaction. 
+It is important to check this feature while comparing different conversations as the number of utterances can be a confounder and affect the outcomes of the conversation
+
+
+Related Features 
+*****************
+NA
diff --git a/docs/source/features_conceptual/online_discussions_tags.rst b/docs/source/features_conceptual/online_discussions_tags.rst
@@ -0,0 +1,53 @@
+.. _online_discussion_tag:
+
+Online Discussion Tags
+============
+
+High-Level Intuition
+*********************
+This feature detects special metrics specific to communications in an online setting, such as capitalized words, hyperlinks and quotes, amongst others noted below.
+
+Citation
+*********
+NA
+
+Implementation Basics 
+**********************
+
+Calculates a number of metrics specific to communications in an online setting:
+
+1. Num all caps: Number of words that are in all caps
+2. Num links: Number of links to external resources
+3. Num Reddit Users: Number of usernames referred to, in u/RedditUser format.
+4. Num Emphasis: The number of times someone used **emphasis** in their message
+5. Num Bullet Points: The number of bullet points used in a message.
+6. Num Line Breaks: The number of line breaks in a message.
+7. Num Quotes: The number of “quotes” in a message.
+8. Num Block Quotes Responses: The number of times someone uses a block quote (”>”), indicating a longer quotation
+9. Num Ellipses: The number of times someone uses ellipses (…) in their message
+10. Num Parentheses: The number of sets of fully closed parenthetical statements in a message
+11. Num Emoji: The number of emoticons in a message, e.g., “:)”
+
+Implementation Notes/Caveats 
+*****************************
+1. This feature should be run on text that is not preprocesed to remove puncutations, hyperlinks and before conversion to lowercase
+2. The "Reddit Users" features might not be informative in Non-Reddit contexts
+
+Interpreting the Feature 
+*************************
+Note: These are a few examples for illustration. This is not a comprehensive list. 
+
+1. Num all caps:
+Example: This is a sentence with SEVERAL words in ALL CAPS.
+
+Interpretation: This can be used to understand the number of emphasized words, often associated with high arousal
+
+7. Num Quotes:
+Example: Oh, yet another of the "amazing" meetings where we discuss the same thing for hours!
+
+Interpretation: Can be interpreted as a sarcastic comment.
+
+
+Related Features 
+*****************
+NA
diff --git a/docs/source/features_conceptual/politeness_receptiveness_markers.rst b/docs/source/features_conceptual/politeness_receptiveness_markers.rst
@@ -0,0 +1,85 @@
+.. _politeness_receptiveness_markers:
+
+Politeness Receptiveness Markers
+============
+
+High-Level Intuition
+*********************
+A collection of conversational markers that indicates the use of politeness / receptiveness.
+
+Citation
+*********
+`Yeomans et al., (2020) <https://www.mikeyeomans.info/papers/receptiveness.pdf>`_
+`SECR Module (For computing features from Yeomans et al., 2020) <https://github.com/bbevis/SECR/tree/main>`_
+
+Implementation Basics 
+**********************
+
+We follow a very similar framework to the SECR Module to compute a 39 politeness features for each chat in a conversation. The chats are first preprocessed in the following ways:
+
+1. Convert all words to lowercase
+2. Remove/expand contractions (i.e don’t to do not; can’t to cannot; let’s to let us)
+3. Ensure all characters are legal traditional A-Z alphabet letters by using corresponding RegExs
+
+We then calculate the general categories of features in different ways, following similar structure as the SECR module.
+
+1. count_matches and Adverb_Limiter: calculates features using a standard bag-of-words approach, detecting the number of keywords from a pre-specified list stored in keywords.py.
+2. get_dep_pairs/get_dep_pairs_noneg: use Spacy to get dependency pairs for relevant words, using token.dep_ to differentiate with negation.
+3. Question: Question-related features are computed by counting the number of question words in a chat.
+4. word_start: detect certain conjunctions/affirmation words using pre-specified dictionary
+
+The corresponding counts are then returned concatenated to the original dataframe.
+
+
+Implementation Notes/Caveats 
+*****************************
+NA
+
+Interpreting the Feature 
+*************************
+
+The SECR module contains the following 39 features.
+
+Impersonal_Pronoun
+First_Person_Single
+Hedges
+Negation
+Subjectivity
+Negative_Emotion
+Reasoning
+Agreement
+Second_Person
+Adverb_Limiter
+Disagreement
+Acknowledgement
+First_Person_Plural
+For_Me
+WH_Questions
+YesNo_Questions
+Bare_Command
+Truth_Intensifier
+Apology
+Ask_Agency
+By_The_Way
+Can_You
+Conjunction_Start
+Could_You
+Filler_Pause
+For_You
+Formal_Title
+Give_Agency
+Affirmation
+Gratitude
+Hello
+Informal_Title
+Let_Me_Know
+Swearing
+Reassurance
+Please
+Positive_Emotion
+Goodbye
+Token_count
+
+Related Features 
+*****************
+Politness Strategies