Skip to content

Commit

Permalink
Priya/docs 2 (#269)
Browse files Browse the repository at this point in the history
* added docstrings for politeness_v2.py

* added docstrings for politeness_v2_helper.py

* added docstrings for reddit_tags.py

* added docstrings for word_mimicry.py

* Update index.rst

politeness_v2, politeness_v2_helper,reddit_tags,word_mimicry

* documentation

* index updated

* textblob polarity and subjectivity

* textblob polarity and subjectivity

* added feature names in .rst files

* proportion of first person pronouns

* hedges

* dale chall score

* time difference

* positivity z scores

* positivity z scores

* politeness strategies - Convokit

* replaced TEMPLATE with feature name for concecptual features

* implemented suggestions

* implemented suggestions

* reset version number

---------

Co-authored-by: Xinlan Emily Hu <xehu@cs.stanford.edu>
Co-authored-by: sundy1994 <yuxuanzh@seas.upenn.edu>
  • Loading branch information
3 people authored Aug 8, 2024
1 parent 83c33d3 commit 800abed
Show file tree
Hide file tree
Showing 15 changed files with 619 additions and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/features_conceptual/TEMPLATE.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _TEMPLATE:
.. _TEMPLATE:

FEATURE NAME
============
Expand Down
51 changes: 51 additions & 0 deletions docs/source/features_conceptual/dale_chall_score.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
.. _dale_chall_score:

Dale-Chall Score
============

High-Level Intuition
*********************
A score that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text.

Citation
*********
`Cao, et al. (2020) <https://dl.acm.org/doi/pdf/10.1145/3432929?casa_token=B5WlyazkwNIAAAAA:E-1nT55uQnGslAHCfO21sdeaXfaefJsT5ZpU2hq49eagiYaGSGpohlmTyUn4NslWtNOZuAl3XvcFXQ>`_

Implementation Basics
**********************

The Dale–Chall readability formula is a readability test that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text. It uses a list of 3000 words that groups of fourth-grade American students could reliably understand, considering any word not on that list to be difficult.

The formula for calculating the raw score of the Dale–Chall readability score (1948) is given below:

0.1579(difficult words ×100/words) + 0.0496(words/sentences)

Scores range from 0 - 10, details can be found below:

`Dale Chall Score <https://en.wikipedia.org/wiki/Dale%E2%80%93Chall_readability_formula>`_

Credits: Wikipedia

Implementation Notes/Caveats
*****************************
NA

Interpreting the Feature
*************************

Scores range from 0 to 10, and can be interpreted as:

===== =============================================
Score Notes
===== =============================================
4.9 easily understood by an average 4th-grade student or lower
5.0–5.9 easily understood by an average 5th- or 6th-grade student
6.0–6.9 easily understood by an average 7th- or 8th-grade student
7.0–7.9 easily understood by an average 9th- or 10th-grade student
8.0–8.9 easily understood by an average 11th- or 12th-grade student
9.0–9.9 easily understood by an average college student
===== =============================================

Related Features
*****************
NA
29 changes: 29 additions & 0 deletions docs/source/features_conceptual/hedge.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. _hedge:

Hedge
============

High-Level Intuition
*********************
Captures whether a speaker appears to “hedge” their statement and express lack of certainty.

Citation
*********
`Ranganath, et al. (2013) <https://web.stanford.edu/~jurafsky/pubs/ranganath2013.pdf>`_

Implementation Basics
**********************
A score of 1 is assigned if hedge phrases (”I think,” “a little,” “maybe,” “possibly”) are present, and a score of 0 is assigned otherwise.

Implementation Notes/Caveats
*****************************
This is a bag of words feature, which is a naive approach towards detecting hedges.

Interpreting the Feature
*************************
A score of 1 is assigned if hedge phrases (”I think,” “a little,” “maybe,” “possibly”) are present, and a score of 0 is assigned otherwise.


Related Features
*****************
Politeness Strategies - which also measures hedges
12 changes: 12 additions & 0 deletions docs/source/features_conceptual/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,19 @@ Utterance- (Chat) Level Features
:maxdepth: 1

named_entity_recognition
information_exchange
message_length
message_quantity
online_discussion_tags
word_ttr
textblob_polarity
textblob_subjectivity
proportion_of_first_person_pronouns
hedge
dale_chall_score
time_difference
positivity_z_score
politeness_strategies

Conversation-Level Features
****************************
Expand Down
66 changes: 66 additions & 0 deletions docs/source/features_conceptual/information_exchange.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
.. _information_exchange:

Information Exchange
============

High-Level Intuition
*********************
Actual "information" exchanged, i.e. word count minus first-person singular pronouns,z-scored at both the chat and conversation levels.

Citation
*********
Improving Teamwork Using Real-Time Language Feedback, Tausczik and Pennebaker,2013: https://www.cs.cmu.edu/~ylataus/files/TausczikPennebaker2013.pdf

Implementation Basics
**********************
Word count minus first-person singular pronouns was taken as a measure of information exchange. Then converted to z-scores.

1. Word count minus first-person singular pronouns --> "info_exchange_wordcount"
2. Compute the z-score at the chat level: compute z-score for each message across all conversations --> "zscore_chats"
3. Compute the z-score at the conversation level: group by batch and round, then compute the z-score for each conversation --> "zscore_conversation"

Implementation Notes/Caveats
*****************************
1. Personal opinion acts as an important part in on-task communications, and this feature specifically excludes first person pronouns
that might be indicative of personal opinions, which might not be ideal in all cases.
2. This method does not capture the quality of the information itself because it solely relies on the quantity. A person might say a lot of words but none of the information is meaningful to the topic.


Interpreting the Feature
*************************

We are assuming this is a single conversation, and this is a dataset that consists of only one conversation.

Example:
Messages in a conversation:
1. "I went to the store."
- info_exchange_wordcount: 4 (5 words minus 1 first person pronoun "I")

2. "Bought some groceries for dinner."
- info_exchange_wordcount: 5 (5 words minus 0 first person pronouns)

3. "It's raining today."
- info_exchange_wordcount: 3 (3 words minus 0 first person pronouns)

Mean = 4, Standard deviation ≈ 0.82

#### z-scores:
Read more about z-scores here: https://www.statology.org/z-score-python/

- **zscore_chats**:
- Message 1: 0
- Message 2: 1.22
- Message 3: -1.22

- **zscore_conversation**: Same values as above since it's one conversation.

### Interpretation:
- zscore_chats:
- 0: Average information exchange.
- 1.22: Higher-than-average.
- -1.22: Lower-than-average.

Related Features
*****************

Generally strongly correlated with message length, especially in cases where there are not a ton of pronoun use.
35 changes: 35 additions & 0 deletions docs/source/features_conceptual/message_length.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.. _message_length:

Message Length
============

High-Level Intuition
*********************
Returns the number of words in a message/utterance

Citation
*********
NA

Implementation Basics
**********************

Returns the number of words in a message/utterance by splitting on the whitespace, after preprocessing to remove punctuation.

Implementation Notes/Caveats
*****************************
This feature does not recognize successive punctuation marks as words.
For example,for "?????", the message length will be 0.

Interpreting the Feature
*************************

Analyzing word count can help in understanding the nature of the interaction—whether it’s more casual and quick-paced or detailed and thorough.
Longer messages may indicate more detailed explanations, more extensive engagement, or more complex topics being discussed.
Conversely, shorter messages might be more direct, concise, or reflect quick interactions.

For example, a curt "Hi" has a message length of 1, whereas a more detailed "Hello, How are you doing today?" has a message length of 6.

Related Features
*****************
NA
33 changes: 33 additions & 0 deletions docs/source/features_conceptual/message_quantity.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
.. _message_quantity:

Message Quantity
============

High-Level Intuition
*********************
This function by itself is trivial; by definition, each message counts as 1. However, at the conversation level, we use this function to count the total number of messages/utterance via aggregation.

Citation
*********
NA

Implementation Basics
**********************

This function is trivial; by definition, each message counts as 1. However, at the conversation level, we use this function to count the total number of messages/utterance via aggregation.

Implementation Notes/Caveats
*****************************
This feature becomes relevant at the conversation level, but is trivial at the chat level.

Interpreting the Feature
*************************

This feature provides a measure of the conversation's length and activity.
A higher count indicates a more extensive, while a lower count may suggest a brief interaction.
It is important to check this feature while comparing different conversations as the number of utterances can be a confounder and affect the outcomes of the conversation


Related Features
*****************
NA
53 changes: 53 additions & 0 deletions docs/source/features_conceptual/online_discussions_tags.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
.. _online_discussion_tag:

Online Discussion Tags
============

High-Level Intuition
*********************
This feature detects special metrics specific to communications in an online setting, such as capitalized words, hyperlinks and quotes, amongst others noted below.

Citation
*********
NA

Implementation Basics
**********************

Calculates a number of metrics specific to communications in an online setting:

1. Num all caps: Number of words that are in all caps
2. Num links: Number of links to external resources
3. Num Reddit Users: Number of usernames referred to, in u/RedditUser format.
4. Num Emphasis: The number of times someone used **emphasis** in their message
5. Num Bullet Points: The number of bullet points used in a message.
6. Num Line Breaks: The number of line breaks in a message.
7. Num Quotes: The number of “quotes” in a message.
8. Num Block Quotes Responses: The number of times someone uses a block quote (”>”), indicating a longer quotation
9. Num Ellipses: The number of times someone uses ellipses (…) in their message
10. Num Parentheses: The number of sets of fully closed parenthetical statements in a message
11. Num Emoji: The number of emoticons in a message, e.g., “:)”

Implementation Notes/Caveats
*****************************
1. This feature should be run on text that is not preprocesed to remove puncutations, hyperlinks and before conversion to lowercase
2. The "Reddit Users" features might not be informative in Non-Reddit contexts

Interpreting the Feature
*************************
Note: These are a few examples for illustration. This is not a comprehensive list.

1. Num all caps:
Example: This is a sentence with SEVERAL words in ALL CAPS.

Interpretation: This can be used to understand the number of emphasized words, often associated with high arousal

7. Num Quotes:
Example: Oh, yet another of the "amazing" meetings where we discuss the same thing for hours!

Interpretation: Can be interpreted as a sarcastic comment.


Related Features
*****************
NA
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
.. _politeness_receptiveness_markers:

Politeness Receptiveness Markers
============

High-Level Intuition
*********************
A collection of conversational markers that indicates the use of politeness / receptiveness.

Citation
*********
`Yeomans et al., (2020) <https://www.mikeyeomans.info/papers/receptiveness.pdf>`_
`SECR Module (For computing features from Yeomans et al., 2020) <https://github.com/bbevis/SECR/tree/main>`_

Implementation Basics
**********************

We follow a very similar framework to the SECR Module to compute a 39 politeness features for each chat in a conversation. The chats are first preprocessed in the following ways:

1. Convert all words to lowercase
2. Remove/expand contractions (i.e don’t to do not; can’t to cannot; let’s to let us)
3. Ensure all characters are legal traditional A-Z alphabet letters by using corresponding RegExs

We then calculate the general categories of features in different ways, following similar structure as the SECR module.

1. count_matches and Adverb_Limiter: calculates features using a standard bag-of-words approach, detecting the number of keywords from a pre-specified list stored in keywords.py.
2. get_dep_pairs/get_dep_pairs_noneg: use Spacy to get dependency pairs for relevant words, using token.dep_ to differentiate with negation.
3. Question: Question-related features are computed by counting the number of question words in a chat.
4. word_start: detect certain conjunctions/affirmation words using pre-specified dictionary

The corresponding counts are then returned concatenated to the original dataframe.


Implementation Notes/Caveats
*****************************
NA

Interpreting the Feature
*************************

The SECR module contains the following 39 features.

Impersonal_Pronoun
First_Person_Single
Hedges
Negation
Subjectivity
Negative_Emotion
Reasoning
Agreement
Second_Person
Adverb_Limiter
Disagreement
Acknowledgement
First_Person_Plural
For_Me
WH_Questions
YesNo_Questions
Bare_Command
Truth_Intensifier
Apology
Ask_Agency
By_The_Way
Can_You
Conjunction_Start
Could_You
Filler_Pause
For_You
Formal_Title
Give_Agency
Affirmation
Gratitude
Hello
Informal_Title
Let_Me_Know
Swearing
Reassurance
Please
Positive_Emotion
Goodbye
Token_count

Related Features
*****************
Politness Strategies
Loading

0 comments on commit 800abed

Please sign in to comment.