-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* added docstrings for politeness_v2.py * added docstrings for politeness_v2_helper.py * added docstrings for reddit_tags.py * added docstrings for word_mimicry.py * Update index.rst politeness_v2, politeness_v2_helper,reddit_tags,word_mimicry * documentation * index updated * textblob polarity and subjectivity * textblob polarity and subjectivity * added feature names in .rst files * proportion of first person pronouns * hedges * dale chall score * time difference * positivity z scores * positivity z scores * politeness strategies - Convokit * replaced TEMPLATE with feature name for concecptual features * implemented suggestions * implemented suggestions * reset version number --------- Co-authored-by: Xinlan Emily Hu <xehu@cs.stanford.edu> Co-authored-by: sundy1994 <yuxuanzh@seas.upenn.edu>
- Loading branch information
1 parent
83c33d3
commit 800abed
Showing
15 changed files
with
619 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
.. _TEMPLATE: | ||
.. _TEMPLATE: | ||
|
||
FEATURE NAME | ||
============ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
.. _dale_chall_score: | ||
|
||
Dale-Chall Score | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
A score that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text. | ||
|
||
Citation | ||
********* | ||
`Cao, et al. (2020) <https://dl.acm.org/doi/pdf/10.1145/3432929?casa_token=B5WlyazkwNIAAAAA:E-1nT55uQnGslAHCfO21sdeaXfaefJsT5ZpU2hq49eagiYaGSGpohlmTyUn4NslWtNOZuAl3XvcFXQ>`_ | ||
|
||
Implementation Basics | ||
********************** | ||
|
||
The Dale–Chall readability formula is a readability test that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text. It uses a list of 3000 words that groups of fourth-grade American students could reliably understand, considering any word not on that list to be difficult. | ||
|
||
The formula for calculating the raw score of the Dale–Chall readability score (1948) is given below: | ||
|
||
0.1579(difficult words ×100/words) + 0.0496(words/sentences) | ||
|
||
Scores range from 0 - 10, details can be found below: | ||
|
||
`Dale Chall Score <https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula>`_ | ||
|
||
Credits: Wikipedia | ||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
NA | ||
|
||
Interpreting the Feature | ||
************************* | ||
|
||
Scores range from 0 to 10, and can be interpreted as: | ||
|
||
===== ============================================= | ||
Score Notes | ||
===== ============================================= | ||
4.9 easily understood by an average 4th-grade student or lower | ||
5.0–5.9 easily understood by an average 5th- or 6th-grade student | ||
6.0–6.9 easily understood by an average 7th- or 8th-grade student | ||
7.0–7.9 easily understood by an average 9th- or 10th-grade student | ||
8.0–8.9 easily understood by an average 11th- or 12th-grade student | ||
9.0–9.9 easily understood by an average college student | ||
===== ============================================= | ||
|
||
Related Features | ||
***************** | ||
NA |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
.. _hedge: | ||
|
||
Hedge | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
Captures whether a speaker appears to “hedge” their statement and express lack of certainty. | ||
|
||
Citation | ||
********* | ||
`Ranganath, et al. (2013) <https://web.stanford.edu/~jurafsky/pubs/ranganath2013.pdf>`_ | ||
|
||
Implementation Basics | ||
********************** | ||
A score of 1 is assigned if hedge phrases (”I think,” “a little,” “maybe,” “possibly”) are present, and a score of 0 is assigned otherwise. | ||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
This is a bag of words feature, which is a naive approach towards detecting hedges. | ||
|
||
Interpreting the Feature | ||
************************* | ||
A score of 1 is assigned if hedge phrases (”I think,” “a little,” “maybe,” “possibly”) are present, and a score of 0 is assigned otherwise. | ||
|
||
|
||
Related Features | ||
***************** | ||
Politeness Strategies - which also measures hedges |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
.. _information_exchange: | ||
|
||
Information Exchange | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
Actual "information" exchanged, i.e. word count minus first-person singular pronouns,z-scored at both the chat and conversation levels. | ||
|
||
Citation | ||
********* | ||
Improving Teamwork Using Real-Time Language Feedback, Tausczik and Pennebaker,2013: https://www.cs.cmu.edu/~ylataus/files/TausczikPennebaker2013.pdf | ||
|
||
Implementation Basics | ||
********************** | ||
Word count minus first-person singular pronouns was taken as a measure of information exchange. Then converted to z-scores. | ||
|
||
1. Word count minus first-person singular pronouns --> "info_exchange_wordcount" | ||
2. Compute the z-score at the chat level: compute z-score for each message across all conversations --> "zscore_chats" | ||
3. Compute the z-score at the conversation level: group by batch and round, then compute the z-score for each conversation --> "zscore_conversation" | ||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
1. Personal opinion acts as an important part in on-task communications, and this feature specifically excludes first person pronouns | ||
that might be indicative of personal opinions, which might not be ideal in all cases. | ||
2. This method does not capture the quality of the information itself because it solely relies on the quantity. A person might say a lot of words but none of the information is meaningful to the topic. | ||
|
||
|
||
Interpreting the Feature | ||
************************* | ||
|
||
We are assuming this is a single conversation, and this is a dataset that consists of only one conversation. | ||
|
||
Example: | ||
Messages in a conversation: | ||
1. "I went to the store." | ||
- info_exchange_wordcount: 4 (5 words minus 1 first person pronoun "I") | ||
|
||
2. "Bought some groceries for dinner." | ||
- info_exchange_wordcount: 5 (5 words minus 0 first person pronouns) | ||
|
||
3. "It's raining today." | ||
- info_exchange_wordcount: 3 (3 words minus 0 first person pronouns) | ||
|
||
Mean = 4, Standard deviation ≈ 0.82 | ||
|
||
#### z-scores: | ||
Read more about z-scores here: https://www.statology.org/z-score-python/ | ||
|
||
- **zscore_chats**: | ||
- Message 1: 0 | ||
- Message 2: 1.22 | ||
- Message 3: -1.22 | ||
|
||
- **zscore_conversation**: Same values as above since it's one conversation. | ||
|
||
### Interpretation: | ||
- zscore_chats: | ||
- 0: Average information exchange. | ||
- 1.22: Higher-than-average. | ||
- -1.22: Lower-than-average. | ||
|
||
Related Features | ||
***************** | ||
|
||
Generally strongly correlated with message length, especially in cases where there are not a ton of pronoun use. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
.. _message_length: | ||
|
||
Message Length | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
Returns the number of words in a message/utterance | ||
|
||
Citation | ||
********* | ||
NA | ||
|
||
Implementation Basics | ||
********************** | ||
|
||
Returns the number of words in a message/utterance by splitting on the whitespace, after preprocessing to remove punctuation. | ||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
This feature does not recognize successive punctuation marks as words. | ||
For example,for "?????", the message length will be 0. | ||
|
||
Interpreting the Feature | ||
************************* | ||
|
||
Analyzing word count can help in understanding the nature of the interaction—whether it’s more casual and quick-paced or detailed and thorough. | ||
Longer messages may indicate more detailed explanations, more extensive engagement, or more complex topics being discussed. | ||
Conversely, shorter messages might be more direct, concise, or reflect quick interactions. | ||
|
||
For example, a curt "Hi" has a message length of 1, whereas a more detailed "Hello, How are you doing today?" has a message length of 6. | ||
|
||
Related Features | ||
***************** | ||
NA |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
.. _message_quantity: | ||
|
||
Message Quantity | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
This function by itself is trivial; by definition, each message counts as 1. However, at the conversation level, we use this function to count the total number of messages/utterance via aggregation. | ||
|
||
Citation | ||
********* | ||
NA | ||
|
||
Implementation Basics | ||
********************** | ||
|
||
This function is trivial; by definition, each message counts as 1. However, at the conversation level, we use this function to count the total number of messages/utterance via aggregation. | ||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
This feature becomes relevant at the conversation level, but is trivial at the chat level. | ||
|
||
Interpreting the Feature | ||
************************* | ||
|
||
This feature provides a measure of the conversation's length and activity. | ||
A higher count indicates a more extensive, while a lower count may suggest a brief interaction. | ||
It is important to check this feature while comparing different conversations as the number of utterances can be a confounder and affect the outcomes of the conversation | ||
|
||
|
||
Related Features | ||
***************** | ||
NA |
53 changes: 53 additions & 0 deletions
53
docs/source/features_conceptual/online_discussions_tags.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
.. _online_discussion_tag: | ||
|
||
Online Discussion Tags | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
This feature detects special metrics specific to communications in an online setting, such as capitalized words, hyperlinks and quotes, amongst others noted below. | ||
|
||
Citation | ||
********* | ||
NA | ||
|
||
Implementation Basics | ||
********************** | ||
|
||
Calculates a number of metrics specific to communications in an online setting: | ||
|
||
1. Num all caps: Number of words that are in all caps | ||
2. Num links: Number of links to external resources | ||
3. Num Reddit Users: Number of usernames referred to, in u/RedditUser format. | ||
4. Num Emphasis: The number of times someone used **emphasis** in their message | ||
5. Num Bullet Points: The number of bullet points used in a message. | ||
6. Num Line Breaks: The number of line breaks in a message. | ||
7. Num Quotes: The number of “quotes” in a message. | ||
8. Num Block Quotes Responses: The number of times someone uses a block quote (”>”), indicating a longer quotation | ||
9. Num Ellipses: The number of times someone uses ellipses (…) in their message | ||
10. Num Parentheses: The number of sets of fully closed parenthetical statements in a message | ||
11. Num Emoji: The number of emoticons in a message, e.g., “:)” | ||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
1. This feature should be run on text that is not preprocesed to remove puncutations, hyperlinks and before conversion to lowercase | ||
2. The "Reddit Users" features might not be informative in Non-Reddit contexts | ||
|
||
Interpreting the Feature | ||
************************* | ||
Note: These are a few examples for illustration. This is not a comprehensive list. | ||
|
||
1. Num all caps: | ||
Example: This is a sentence with SEVERAL words in ALL CAPS. | ||
|
||
Interpretation: This can be used to understand the number of emphasized words, often associated with high arousal | ||
|
||
7. Num Quotes: | ||
Example: Oh, yet another of the "amazing" meetings where we discuss the same thing for hours! | ||
|
||
Interpretation: Can be interpreted as a sarcastic comment. | ||
|
||
|
||
Related Features | ||
***************** | ||
NA |
85 changes: 85 additions & 0 deletions
85
docs/source/features_conceptual/politeness_receptiveness_markers.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
.. _politeness_receptiveness_markers: | ||
|
||
Politeness Receptiveness Markers | ||
============ | ||
|
||
High-Level Intuition | ||
********************* | ||
A collection of conversational markers that indicates the use of politeness / receptiveness. | ||
|
||
Citation | ||
********* | ||
`Yeomans et al., (2020) <https://www.mikeyeomans.info/papers/receptiveness.pdf>`_ | ||
`SECR Module (For computing features from Yeomans et al., 2020) <https://github.com/bbevis/SECR/tree/main>`_ | ||
|
||
Implementation Basics | ||
********************** | ||
|
||
We follow a very similar framework to the SECR Module to compute a 39 politeness features for each chat in a conversation. The chats are first preprocessed in the following ways: | ||
|
||
1. Convert all words to lowercase | ||
2. Remove/expand contractions (i.e don’t to do not; can’t to cannot; let’s to let us) | ||
3. Ensure all characters are legal traditional A-Z alphabet letters by using corresponding RegExs | ||
|
||
We then calculate the general categories of features in different ways, following similar structure as the SECR module. | ||
|
||
1. count_matches and Adverb_Limiter: calculates features using a standard bag-of-words approach, detecting the number of keywords from a pre-specified list stored in keywords.py. | ||
2. get_dep_pairs/get_dep_pairs_noneg: use Spacy to get dependency pairs for relevant words, using token.dep_ to differentiate with negation. | ||
3. Question: Question-related features are computed by counting the number of question words in a chat. | ||
4. word_start: detect certain conjunctions/affirmation words using pre-specified dictionary | ||
|
||
The corresponding counts are then returned concatenated to the original dataframe. | ||
|
||
|
||
Implementation Notes/Caveats | ||
***************************** | ||
NA | ||
|
||
Interpreting the Feature | ||
************************* | ||
|
||
The SECR module contains the following 39 features. | ||
|
||
Impersonal_Pronoun | ||
First_Person_Single | ||
Hedges | ||
Negation | ||
Subjectivity | ||
Negative_Emotion | ||
Reasoning | ||
Agreement | ||
Second_Person | ||
Adverb_Limiter | ||
Disagreement | ||
Acknowledgement | ||
First_Person_Plural | ||
For_Me | ||
WH_Questions | ||
YesNo_Questions | ||
Bare_Command | ||
Truth_Intensifier | ||
Apology | ||
Ask_Agency | ||
By_The_Way | ||
Can_You | ||
Conjunction_Start | ||
Could_You | ||
Filler_Pause | ||
For_You | ||
Formal_Title | ||
Give_Agency | ||
Affirmation | ||
Gratitude | ||
Hello | ||
Informal_Title | ||
Let_Me_Know | ||
Swearing | ||
Reassurance | ||
Please | ||
Positive_Emotion | ||
Goodbye | ||
Token_count | ||
|
||
Related Features | ||
***************** | ||
Politness Strategies |
Oops, something went wrong.