update in line with dev

Watts-Lab · Aug 15, 2024 · 10edefc · 10edefc
2 parents 139e7d3 + 80faa60
commit 10edefc
Show file tree

Hide file tree

Showing 241 changed files with 6,755 additions and 1,040 deletions.
diff --git a/.github/workflows/github-actions-test-simple.yml b/.github/workflows/github-actions-test-simple.yml
@@ -4,9 +4,9 @@ on: [push]
 jobs:
  Testing-Features:
  runs-on: ubuntu-latest
- defaults:
- run:
- working-directory: ./src
+ # defaults:
+ #  run:
+ #  working-directory: ./src/team_comm_tools/
  steps:
  - name: Check out repository code
  uses: actions/checkout@v4
@@ -18,24 +18,20 @@ jobs:
 
  - name: Install dependencies
  run: |
- pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz --no-deps
- pip install -r requirements.txt
+ python -m pip install --upgrade pip
+ ./setup.sh
 
- - name: Additional setup
- run: |
- git submodule update --init --recursive
- python import_nltk.py
+ - name: Install package in editable mode
+ run: pip install -e .
 
  - name: Run featurizer
  run: |
- cd ..
  cd tests
  python3 run_tests.py
  python3 run_package_grouping_tests.py
 
  - name: Run tests
  run: |
- cd ..
  cd tests
  pytest test_feature_metrics.py
  pytest test_package.py

diff --git a/.gitignore b/.gitignore
@@ -1,14 +1,42 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+.Python
+env/
+venv/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# IntelliJ
+.idea/
+
+# macOS specific files
 .DS_Store
-src/data/.DS_Store
-src/.DS_Store
-src/features/lexicons/liwc_lexicons/
-src/features/lexicons/liwc_lexicons_small_test/
-src/features/lexicons/certainty.txt
-src/modules/
-src/output/*
-src/__pycache__/
-src/features/__pycache__/
-src/ipython_notebooks/.ipynb_checkpoints/
+
+# unwanted files
+src/team_comm_tools/features/lexicons/liwc_lexicons/
+src/team_comm_tools/features/lexicons/liwc_lexicons_small_test/
+src/team_comm_tools/features/lexicons/certainty.txt
+src/team_comm_tools/modules/
+src/team_comm_tools/output/*
+src/team_comm_tools/ipython_notebooks/.ipynb_checkpoints/
 tests/ipython_notebooks/.ipynb_checkpoints/
 tests/data/vector_data/
 tests/test.log
@@ -22,4 +50,9 @@ src/data/vectors/sentiment
 src/features/lexicons/certainty.txt
 examples/vector_data/*
 examples/output/*
-node_modules/
+node_modules/
+
+# testing
+/output
+/vector_data
+test.py
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,5 @@
+recursive-include src/team_comm_tools/features/lexicons *
+include src/mypackage/features/assets/*
+include README.md
+include LICENSE
+include requirements.txt
diff --git a/README.md b/README.md
@@ -1,55 +1,115 @@
+[![Testing Features](https://github.com/Watts-Lab/team_comm_tools/workflows/Testing%20Features/badge.svg)](https://github.com/Watts-Lab/team_comm_tools/actions?query=workflow:"Testing+Features")
+[![GitHub release](https://img.shields.io/github/release/Watts-Lab/team_comm_tools?include_prereleases=&sort=semver&color=blue)](https://github.com/Watts-Lab/team_comm_tools/releases/)
+[![License](https://img.shields.io/badge/License-MIT-blue)](#license)
+
 # The Team Communication Toolkit
-The Team Communication Toolkit is a research project and Python package that aims to make it easier for social scientists to explore text-based conversational data.
+The Team Communication Toolkit is a Python package that makes it easy for social scientists to analyze and understand *text-based communication data*. Our aim is to facilitate seamless analyses of conversational data --- especially among groups and teams! --- by providing a single interface for researchers to generate and explore dozens of research-backed conversational features.
 
-## Getting Started
+We are a research project created by the [Computational Social Science Lab at UPenn](https://css.seas.upenn.edu/) and funded by the [Wharton AI and Analytics Initiative](https://ai-analytics.wharton.upenn.edu/).
 
-If you are new to this repository, welcome! Please follow the steps below to get started.
+<div align="center">
 
-### Step 1: Clone the Repo
-First, clone this repository into your local development environment: 
+[![View - Home Page](https://img.shields.io/badge/View_site-GH_Pages-2ea44f?style=for-the-badge)](https://teamcommtools.seas.upenn.edu/)
 
-```
-git clone https://github.com/Watts-Lab/team_comm_tools.git
-```
+[![View - Documentation](https://img.shields.io/badge/view-Documentation-blue?style=for-the-badge)](https://conversational-featurizer.readthedocs.io/en/latest/ "Go to project documentation")
 
-### Step 2: Download Dependencies
-Second, we *strongly* recommend using a virtual environment to install the dependencies required for the project.
-The dependencies of the project are listed in `src/requirements.txt`: https://github.com/Watts-Lab/team_comm_tools/blob/main/src/requirements.txt
+The Team Communication Toolkit is an academic project and is intended to be used for academic purposes only.
 
-**Python Version**: We recommend >= `python3.11` when running this repository.
+</div>
 
-#### Run Initial Scripts for Dependencies
-Before starting the featurizer, you need to run the following to obtain dependencies for the project:
+# Getting Started
 
+To use our tool, please ensure that you have Python >= 3.10 installed and a working version of [pip](https://pypi.org/project/pip/), which is Python's package installer. Then, in your local environment, run the following:
+
+```sh
+pip install team_comm_tools
 ```
-python3 -m spacy download en_core_web_sm
-```
-```
-import nltk
-nltk.download('nps_chat')
-nltk.download('punkt')
+
+This command will automatically install our package and all required dependencies.
+
+## Troubleshooting
+
+In the event that some dependency installations fail (for example, you may get an error that `en_core_web_sm` from Spacy is not found, or that there is a missing NLTK resource), please run this simple one-line command in your terminal, which will force the installation of Spacy and NLTK dependencies:
+
+```sh
+download_resources
 ```
 
-### Step 3: Run the Featurizer
-At this point, you should be ready to run the featurizer! Navigate to the `examples` folder, and use the following command:
+If you encounter a further issue in which the 'wordnet' package from NLTK is not found, it may be related to a known bug in NLTK in which the wordnet package does not unzip automatically. If this is the case, please follow the instructions to manually unzip it, documented in [this thread](https://github.com/nltk/nltk/issues/3028).
+
+## Import Recommendations: Virtual Environment and Pip
+
+**We strongly recommend using a virtual environment in Python to run the package.** We have several specific dependency requirements. One important one is that we are currently only compatible with numpy < 2.0.0 because [numpy 2.0.0 and above](https://numpy.org/devdocs/release/2.0.0-notes.html#changes) made significant changes that are not compatible with other dependencies of our package. As those dependencies are updated, we will support later versions of numpy.
+
+**We also strongly recommend using thet your version of pip is up-to-date (>=24.0).** There have been reports in which users have had trouble downloading dependencies (specifically, the Spacy package) with older versions of pip. If you get an error with downloading `en_core_web_sm`, we recommend updating pip.
+
 
+## Using the FeatureBuilder
+After you import the package and install dependencies, you can then use our tool in your Python script as follows:
+
+```python
+from team_comm_tools import FeatureBuilder
 ```
-python3 featurize.py
+
+*Note*: PyPI treats hyphens and underscores equally, so `pip install team_comm_tools` and `pip install team-comm-tools` are equivalent. However, Python does NOT treat them equally, and **you should use underscores when you import the package, like this: `from team_comm_tools import FeatureBuilder`**.
+
+Once you import the tool, you will be able to declare a FeatureBuilder object, which is the heart of our tool. Here is some sample syntax:
+
+```python
+# this section of code declares a FeatureBuilder object
+my_feature_builder = FeatureBuilder(
+ input_df = my_pandas_dataframe,
+ # this means there's a column in your data called 'conversation_id' that uniquely identifies a conversation
+ conversation_id_col = "conversation_id", 
+ # this means there's a column in your data called 'speaker_id' that uniquely identifies a speaker
+ speaker_id_col = "speaker_id",
+ # this means there's a column in your data called 'messagae' that contains the content you want to featurize
+ message_col = "message",
+ # this means there's a column in your data called 'timestamp' that conains the time associated with each message; we also accept a list of (timestamp_start, timestamp_end), in case your data is formatted in that way.
+ timestamp_col= "timestamp",
+ # this is where we'll cache things like sentence vectors; this directory doesn't have to exist; we'll create it for you!
+ vector_directory = "./vector_data/",
+ # give us names for the utterance (chat), speaker (user), and conversation-level outputs
+ output_file_path_chat_level = "./my_output_chat_level.csv", 
+ output_file_path_user_level = "./my_output_user_level.csv",
+ output_file_path_conv_level = "./my_output_conversation_level.csv",
+ # if true, this will combine successive turns by the same speaker.
+ turns = False,
+ # these features depend on sentence vectors, so they take longer to generate on larger datasets. Add them in manually if you are interested in adding them to your output!
+ custom_features = [ 
+ "(BERT) Mimicry",
+ "Moving Mimicry",
+ "Forward Flow",
+ "Discursive Diversity"
+ ],
+)
+
+# this line of code runs the FeatureBuilder on your data
+my_feature_builder.featurize(col="message")
 ```
-This calls the `featurizer.py` file, which declares a FeatureBuilder object for different dataset of interest, and featurizes them using our framework. The `featurize.py` file provides an end-to-end worked example of how you can declare a FeatureBuilder and call it on data; equally, you can replace this file with any file / notebook of your choosing, as long as you import the FeatureBuilder module.
 
-## Contributing Code and Automated Unit Testing
-If you would like to contribute to the repository, we have implemented a [Pull Request Template](https://github.com/Watts-Lab/team_comm_tools/blob/main/.github/pull_request_template.md) with a basic checklist that you should consider when adding code (e.g., improving documentation or developing a new feature).
+### Data Format
+We accept input data in the format of a Pandas DataFrame. Your data needs to have three (3) required input columns and one optional column.
+
+1. A **conversation ID**, 
+2. A **speaker ID**, 
+3. A **message/text input**, which contains the content that you want to get featurized;
+4. (Optional) a **timestamp**. This is not necessary for generating features, but behaviors related to the conversation's pace (for example, the average delay between messages; the "burstiness" of a conversation) cannot be measured without it.
+
+### Featurized Outputs: Levels of Analysis
 
-We have also implemented automated unit testing of all code (which runs upon every push to GitHub), allowing us to ensure that new features function as expected and do not break any previous features. The points below highlight key steps to using our automated test suite.
+Notably, not all communication features are made equal, as they can be defined at different levels of analysis. For example, a single utterance ("you are great!") may be described as a "positive statement." An individual who makes many such utterances may be described as a "positive person." Finally, the entire team may enjoy a "positive conversation," an interaction in which everyone speaks positively to each other. In this way, the same concept of positivity can be applied to three levels: 
 
-1. Draft test inputs (`conversation_num`, `speaker`, `message`) and expected outputs for your feature. 
+1. The **utterance**,
+2. The **speaker**, and
+3. The **conversation**
 
-- For example, "This is a test message." should return 5 for `num_words` at the chat level (note that `conversation_num` and `speaker` have no effect on the ultimate result, so they can be chosen arbitrarily).
-- Testing a conversation level feature, say `discursive_diversity`, requires a series of chats rather than just one chat. For example, "This is a test message." (speaker 1), "This is a test message." (speaker 1), "This is a test message." (speaker 2), "This is a test message." (speaker 2), within the same conversation, should return 0. Note that the `conversation_num` for each new test should be distinct from all previous `conversation_num`, even if the feature being tested is different.
+**We generate a separate output file for each level.** When you declare a FeatureBuilder, you will need to specify an output path for each level of analysis.
 
-2. Once you have test inputs, add each CHAT (and its associated conversation_num and speaker) as a separate row in either `test_chat_level.csv` or `test_conv_level.csv`, within `./src/testing/data/cleaned_data`. The format of the CSV is as follows: `id, conversation_num, speaker_nickname, message, expected_column, expected_value`, where `expected_column` is the feature name (i.e. num_words).
+For more information, please refer to the [Introduction on our Read the Docs Page](https://conversational-featurizer.readthedocs.io/en/latest/intro.html#intro).
 
-3. Push all your changes to GitHub, including feature development and test dataset additions. Go under the "Actions" tab in the toolbar. Notice there's a new job running called "Testing-Features". A green checkmark at the conclusion of this job indicates all new tests have passed. A red cross means some test has failed. Navigate to the uploaded "Artifact" (near the bottom of the status page) for list of failed tests and their associated inputs/outputs.
+# Learn More
+Please visit our website, [https://teamcommtools.seas.upenn.edu/](https://teamcommtools.seas.upenn.edu/), for general information about our project and research. For more detailed documentation on our features and examples, please visit our [Read the Docs Page](https://conversational-featurizer.readthedocs.io/en/latest/).
 
-4. Debug and iterate!
+# Becoming a Contributor
+If you would like to make pull requests to this open-sourced repository, please read our [GitHub Repo Getting Started Guide](/github_repo_getting_started.md). We welcome new feature contributions or improvements to our framework.
diff --git a/docs/.readthedocs.yaml b/docs/.readthedocs.yaml
@@ -15,9 +15,14 @@ build:
  # rust: "1.64"
  # golang: "1.19"
  jobs:
- pre_install: # Stuff in src/requirements.txt dependso on en_core_web_sm, which in turn depends on spacy
+ pre_install: # Stuff in src/requirements.txt depends on en_core_web_sm, which in turn depends on spacy
  - pip install spacy==3.7.2 
  - bash -c "python3 -m spacy download en_core_web_sm"
+ post_install: # Install NLTK resources after the install step
+ - python3 -m nltk.downloader nps_chat
+ - python3 -m nltk.downloader punkt
+ - python3 -m nltk.downloader stopwords
+ - python3 -m nltk.downloader wordnet
 
 # Build documentation in the "docs/" directory with Sphinx
 sphinx:
@@ -34,4 +39,4 @@ sphinx:
 python:
  install:
  - requirements: ./docs/requirements.txt
- - requirements: ./src/requirements.txt
+ - requirements: ./requirements.txt
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/examples.doctree b/docs/build/doctrees/examples.doctree
diff --git a/docs/build/doctrees/feature_builder.doctree b/docs/build/doctrees/feature_builder.doctree
diff --git a/docs/build/doctrees/features/basic_features.doctree b/docs/build/doctrees/features/basic_features.doctree
diff --git a/docs/build/doctrees/features/burstiness.doctree b/docs/build/doctrees/features/burstiness.doctree
diff --git a/docs/build/doctrees/features/certainty.doctree b/docs/build/doctrees/features/certainty.doctree
diff --git a/docs/build/doctrees/features/discursive_diversity.doctree b/docs/build/doctrees/features/discursive_diversity.doctree
diff --git a/docs/build/doctrees/features/fflow.doctree b/docs/build/doctrees/features/fflow.doctree
diff --git a/docs/build/doctrees/features/get_all_DD_features.doctree b/docs/build/doctrees/features/get_all_DD_features.doctree
diff --git a/docs/build/doctrees/features/get_user_network.doctree b/docs/build/doctrees/features/get_user_network.doctree
diff --git a/docs/build/doctrees/features/hedge.doctree b/docs/build/doctrees/features/hedge.doctree
diff --git a/docs/build/doctrees/features/index.doctree b/docs/build/doctrees/features/index.doctree
diff --git a/docs/build/doctrees/features/info_exchange_zscore.doctree b/docs/build/doctrees/features/info_exchange_zscore.doctree
diff --git a/docs/build/doctrees/features/information_diversity.doctree b/docs/build/doctrees/features/information_diversity.doctree
diff --git a/docs/build/doctrees/features/keywords.doctree b/docs/build/doctrees/features/keywords.doctree
diff --git a/docs/build/doctrees/features/lexical_features_v2.doctree b/docs/build/doctrees/features/lexical_features_v2.doctree
diff --git a/docs/build/doctrees/features/named_entity_recognition_features.doctree b/docs/build/doctrees/features/named_entity_recognition_features.doctree
diff --git a/docs/build/doctrees/features/other_lexical_features.doctree b/docs/build/doctrees/features/other_lexical_features.doctree
diff --git a/docs/build/doctrees/features/politeness_features.doctree b/docs/build/doctrees/features/politeness_features.doctree
diff --git a/docs/build/doctrees/features/politeness_v2.doctree b/docs/build/doctrees/features/politeness_v2.doctree
diff --git a/docs/build/doctrees/features/politeness_v2_helper.doctree b/docs/build/doctrees/features/politeness_v2_helper.doctree
diff --git a/docs/build/doctrees/features/question_num.doctree b/docs/build/doctrees/features/question_num.doctree
diff --git a/docs/build/doctrees/features/readability.doctree b/docs/build/doctrees/features/readability.doctree
diff --git a/docs/build/doctrees/features/reddit_tags.doctree b/docs/build/doctrees/features/reddit_tags.doctree
diff --git a/docs/build/doctrees/features/temporal_features.doctree b/docs/build/doctrees/features/temporal_features.doctree
diff --git a/docs/build/doctrees/features/textblob_sentiment_analysis.doctree b/docs/build/doctrees/features/textblob_sentiment_analysis.doctree
diff --git a/docs/build/doctrees/features/turn_taking_features.doctree b/docs/build/doctrees/features/turn_taking_features.doctree
diff --git a/docs/build/doctrees/features/user_centroids.doctree b/docs/build/doctrees/features/user_centroids.doctree
diff --git a/docs/build/doctrees/features/variance_in_DD.doctree b/docs/build/doctrees/features/variance_in_DD.doctree
diff --git a/docs/build/doctrees/features/within_person_discursive_range.doctree b/docs/build/doctrees/features/within_person_discursive_range.doctree
diff --git a/docs/build/doctrees/features/word_mimicry.doctree b/docs/build/doctrees/features/word_mimicry.doctree
diff --git a/docs/build/doctrees/features_conceptual/TEMPLATE.doctree b/docs/build/doctrees/features_conceptual/TEMPLATE.doctree
diff --git a/docs/build/doctrees/features_conceptual/content_word_accommodation.doctree b/docs/build/doctrees/features_conceptual/content_word_accommodation.doctree
diff --git a/docs/build/doctrees/features_conceptual/function_word_accommodation.doctree b/docs/build/doctrees/features_conceptual/function_word_accommodation.doctree
diff --git a/docs/build/doctrees/features_conceptual/index.doctree b/docs/build/doctrees/features_conceptual/index.doctree
diff --git a/docs/build/doctrees/features_conceptual/mimicry_bert.doctree b/docs/build/doctrees/features_conceptual/mimicry_bert.doctree
diff --git a/docs/build/doctrees/features_conceptual/moving_mimicry.doctree b/docs/build/doctrees/features_conceptual/moving_mimicry.doctree
diff --git a/docs/build/doctrees/features_conceptual/positivity_bert.doctree b/docs/build/doctrees/features_conceptual/positivity_bert.doctree
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/intro.doctree b/docs/build/doctrees/intro.doctree
diff --git a/docs/build/doctrees/utils/assign_chunk_nums.doctree b/docs/build/doctrees/utils/assign_chunk_nums.doctree
diff --git a/docs/build/doctrees/utils/calculate_chat_level_features.doctree b/docs/build/doctrees/utils/calculate_chat_level_features.doctree
diff --git a/docs/build/doctrees/utils/calculate_conversation_level_features.doctree b/docs/build/doctrees/utils/calculate_conversation_level_features.doctree
diff --git a/docs/build/doctrees/utils/calculate_user_level_features.doctree b/docs/build/doctrees/utils/calculate_user_level_features.doctree
diff --git a/docs/build/doctrees/utils/check_embeddings.doctree b/docs/build/doctrees/utils/check_embeddings.doctree
diff --git a/docs/build/doctrees/utils/gini_coefficient.doctree b/docs/build/doctrees/utils/gini_coefficient.doctree
diff --git a/docs/build/doctrees/utils/preload_word_lists.doctree b/docs/build/doctrees/utils/preload_word_lists.doctree
diff --git a/docs/build/doctrees/utils/preprocess.doctree b/docs/build/doctrees/utils/preprocess.doctree
diff --git a/docs/build/doctrees/utils/summarize_features.doctree b/docs/build/doctrees/utils/summarize_features.doctree
diff --git a/docs/build/doctrees/utils/zscore_chats_and_conversation.doctree b/docs/build/doctrees/utils/zscore_chats_and_conversation.doctree
diff --git a/docs/build/html/.buildinfo b/docs/build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 75f637156aeb7a84b151cb277f439962
+config: 9a01a2cd3d4384710101b4a99edd7683
 tags: 645f666f9bcd5a90fca523b33c5a78b7