Releases: CornellNLP/ConvoKit
ConvoKit Version 3.0.1
We are excited to announce the release of ConvoKit 3.0.1, which focuses on bug fixes, adding new datasets, and dependency upgrades. Key updates include:
- Fixed issue with ConvoKit's download method that prevented datasets from being downloaded to the configured directory.
- Fixed the support for downloading non-corpus objects
- Updated the conversational forecasting transformer to make it more flexible
- Added five new datasets, with documentation available on our website and documentation site.
- Addressed compatibility issues related to Numpy by building against Numpy 2.0+ and upgrading dependency packages accordingly.
We address some potential issues on our Troubleshooting page, especially with Numpy. If you encounter any issues, feel free to join our Discord community for more support, or submit an issue on GitHub. Thank you!
Notice that we no longer support Python 3.8 (EOL) and 3.9 (not supported by Numpy 2.0.0+).
You can refer to the following pull requests for more details:
-
Fixing bugs:
-
New datasets:
-
Dependency packages:
Contributors:
- Kaixiang Zhang (Sean)
- Ethan Xia
- Yash Chatha
- Laerdon Yah-Sung Kim
- Jonathan P. Chang
ConvoKit Version 3.0.0
We're excited to announce the public release of Convokit 3.0!
The new version of ConvoKit now supports MongoDB as a backend choice for working with corpus data. This update provides several benefits, such as taking advantage of MongoDB's lazy loading to handle extremely large corpora, and ensuring resilience to unexpected crashes by continuously writing all changes to the database.
To learn more about using MongoDB as a backend choice, refer to our documentation at https://convokit.cornell.edu/documentation/storage_options.html.
Database Backend
Historically, ConvoKit allows you to work with conversational data directly in program memory through the Corpus class. Moreover, long term storage is provided by dumping the contents of a Corpus onto disk using the JSON format. This paradigm works well for distributing and storing static datasets, and for doing computations on conversational data that follow the pattern of doing computations on some or all of the data over a short time period and optionally storing these results on disk. For example, ConvoKit distributes datasets included with the library in JSON format, which you can load into program memory to explore and compute with.
In ConvoKit version 3.0.0, we introduce a new option for working with conversational data: the MongoDB backend. Consider a use case where you want to collect conversational data over a long time period and ensure you maintain a persistent representation of the dataset if your data collection program unexpectedly crashes. In the memory backend paradigm, this would require regularly dumping your corpus to JSON files, requiring repeated expensive write operations. On the other hand, with the new database backend, all your data is automatically saved for long term storage in the database as it is added to the corpus.
Documentation
Please refer to this database setup document to setup a mongoDB database and this storage document for a further explanation of how the database backend option works.
Tests
Updated tests to include db_mode testing.
Examples
Updated examples to include demonstration of db_mode usage.
Bug Fixes
- Fixed issue where
corpus.utterances
throws an error inpolitenessAPI
as it should callcorpus.iter_utterances()
instead. Corpus items should not access their private variables and should use the public "getters" for access. - Fixed bug in
coordination.py
for the usage of metadata mutability. - Fixed issue in Pairer with
pair_mode
set tomaximize
causing the pairing function to return an integer, which causes an error in pairing objects.
Breaking Changes
Modified ConvoKit.Metadata
to disallow any mutability to metadata fields. Implemented by returning deepcopy of metadata field storage every time the field is accessed. It is intended to align the behaviors between memory and DB modes. #197
Change Log
Added:
- Added DB backend mode to allow working with corpora using database as a supporting backend. #175 #184
- Extended
__init__
inmodel/corpus.py
with parameters for DB functionality. #175 - Updated
model/backendMapper
to separate memory and DB transactions. #175 - Introduces a new layer of abstraction between Corpus components (Utterance, Speaker, Conversation, ConvoKitMeta) and concrete data mapping. Data mapping is now handled by a BackendMapper instance variable in the Corpus. #169
Changed:
- Modified files in the ConvoKit model to support both memory mode and DB mode backends. #175
- Removed deprecated arguments and functions from ConvoKit model. #176
- Updated demo examples with older version of ConvoKit object references. #192
Fixed:
- Fixed usage of the mutability of metadata within
coordination.py
. #197 - Fixed issue in the Pairer module when
pair_mode
was set tomaximize
, causing the pairing function to return an integer and subsequently leading to an error. #197 - Fixed issue that caused
corpus.utterances
to throw an error withinpolitenessAPI
. #170 - Fixed FightingWords to allow overlapping classes. #189
Python Version Requirement Update:
- With Python 3.7 reached EOL (end of life) on June 27, 2023, ConvoKit now requires Python 3.8 or above.
ConvoKit version 2.5.3
v2.5.2 release adds support for Chinese politeness strategy extraction. Currently, ConvoKit's politenessStrategies supports three politeness strategy collections covering two languages.
v2.5.3 release fixes a minor bug that occurs when using TextParser with SpaCy>3.2.0.
ConvoKit version 2.5.2
This release adds support for Chinese politeness strategy extraction. Currently, ConvoKit's politenessStrategies supports three politeness strategy collections covering two languages.
ConvoKit version 2.5.1
This release includes a new method from_pandas
in the Corpus class that should simplify the Corpus creation process.
It generates a ConvoKit corpus from pandas dataframes of speakers, utterances, and conversations.
A notebook demonstrating the use of this method can be found here.
ConvoKit version 2.5
This release contains an implementation of the Expected Conversational Context Framework, and associated demos.
ConvoKit version 2.4
This release describes changes that have been implemented as part of the v2.4 release.
Public-facing functionality
ConvoKitMatrix and Vectors
Vectors and Matrices now get first-class treatment in ConvoKit. Vector data can now be stored in a ConvoKitMatrix object that is integrated with the Corpus and its objects, allowing for straightforward access from Corpus component objects, user-friendly display of vectors data, and more. Read our introduction to vectors for more details.
Accordingly, we have re-implemented the relevant Transformers that were already using array or vector-like data to leverage on this new data structure, namely:
- PromptTypes
- HyperConvo
- BoWTransformer
- BoWClassifier - now renamed to VectorClassifier
- PairedBoW - now renamed to PairedVectorClassifier
The last two Transformers can now be used for any general vector data, as opposed to just bag-of-words vector data.
Metadata deletion
We have implemented a formal way to delete metadata attributes from a Corpus component object. Prior to this, metadata attributes were deleted from objects individually -- leading to possible inconsistencies between the ConvoKitIndex (that tracks what metadata attributes currently exist) and the Corpus component objects. To rectify this, we now disallow deletion of metadata attributes from objects individually. Such deletion should instead be carried out using the Corpus method delete_metadata()
.
Other changes
- FightingWords and BoWTransformer now have default
text_func
values for the three main component types: utterance, speaker, and conversation. corpus.iterate_by()
is now deprecated.- The API of PromptTypes has been modified: rather than selecting types of prompt and response utterances to use in the constructor, we now give users the option to select prompts and responses as arguments to the
fit
andtransform
calls.
Other internal changes
- In light of SIGDIAL 2020, we have a new video introduction and Jupyter notebook tutorial introducing new users to ConvoKit.
- ConvoKitIndex now tracks a list of class types for each metadata attribute, instead of a single class type. This will lead to changes in
index.json
during dumps of any currently existing corpora, but will have no compatibility issues with loading from existing corpora. - We updated the following demos that make use of Vectors and PromptTypes: PromptTypes and Predicting conversations gone awry
ConvoKit version 2.3.2
This release describes changes that have happened since the v2.3 release, and includes changes from both v2.3.1 and v2.3.2.
Functionality
Naming changes
Utterance.root
has been renamed toUtterance.conversation_id
User
has been renamed toSpeaker
. Functions with 'user' in the name have been renamed accordinglyUser.name
has been renamed toSpeaker.id
(Backwards compatibility will be maintained for all the deprecated attributes and functions.)
Corpus
- Corpus now allows users to generate
pandas
DataFrames for its internal components usingget_conversations_dataframe()
,get_utterances_dataframe()
, andget_speakers_dataframe()
. Conversation
objects have aget_chronological_speaker_list()
method for getting a chronological list of conversation participantsConversation
'sprint_conversation_structure()
method has a new argumentlimit
for limiting the number of utterances displayed to the number specified inlimit
.
Transformers
- New
invalid_val
argument forHyperConvo
that automatically replaces NaN values with the default value specified ininvalid_val
. FightingWords.summarize()
now provides labelled plots
Bug fixes
- Fixed minor bug in
download()
when downloading Reddit corpora. - Fixed bugs in
HyperConvo
that were causing NaN warnings and incorrect calculation. Fixed minor bug that was causing HyperConvo annotations to not be JSON-serializable. - Fixed bug in
Classifier
andBoWClassifier
that was causing inconsistent behaviour for compressed vs. uncompressed vector metadata
Other changes
- Warnings in ConvoKit for deprecation have been made more consistent.
- We now have continuous integration for pushes and pull requests! Thanks to @mwilbz for helping set this up.
ConvoKit version 2.3
Functionality
Transformers new summarize() functionality
Some Transformers now have a summarize() function that summarizes the annotated corpus (i.e. annotated by a transform() call) in a way that gives the user a high-level view / interpretation of the annotated metadata.
New Transformers
We introduce several new Transformers: Classifier, Bag-of-Words Classifier, Ranker, Pairer, Paired Prediction, Paired Bag-of-Words Prediction, Fighting Words, and (Conversational) Forecaster (with variants: Bag-of-Words and CRAFT).
New TextProcessor
We introduce TextCleaner, which does text cleaning for online text data. This cleaner depends on the clean-text package.
Enhanced Conversation functionality
- Conversation.check_integrity() can be used to check if a conversation has a valid and intact reply-to chain (i.e. only one root utterance, every utterance specified by reply-to exists, etc)
- Conversation.print_conversation_structure() is a way of pretty-printing a Conversation's thread structure (whether displaying just its utterances' ids, texts, or other details is customizable)
- Conversation.get_chronological_utterance_list() provides a list of the Conversation's utterances sorted from earliest to latest timestamp
Tree operations
- Conversation.traverse() allows for Conversations to be traversed as a tree structure, e.g. breadth-first, depth-first, pre-order, post-order. Specifically, traverse() returns an iterator of Utterances or UtteranceNodes (a wrapper class for working with Utterances in a conversational tree setting)
- Conversation allows for subtree extraction using any arbitrary utterance in the Conversation as the new root
- Conversation.get_root_to_leaf_paths() returns all the root to leaf paths in the conversation tree
Other changes
Public-facing interface changes
- All Corpus objects now support a full set of all possible object iterators (e.g. User.iter_utterances() or Corpus.iter_users()) with selector functions (i.e. filters that select for the corpus object to be generated)
- Corpus has new methods for checking for the presence of corpus objects, e.g. corpus.has_utterance(), corpus.has_conversation(), corpus.has_user()
- A random User / Utterance / Conversation can be obtained from a Corpus with corpus.random_user() / corpus.random_utterance() / corpus.random_conversation()
- User objects now have ids, not names. Corpus.get_usernames() and User.name are deprecated (in favor of Corpus.get_user_ids() and User.id respectively) and print a warning when used.
- Corpora can be mutated to only include specific Conversations by using Corpus.filter_conversations_by()
- Corpus filtering by utterance is no longer supported to avoid encouraging Corpus mutations that break Conversation reply-to chains. Corpus.filter_utterances_by() is now deprecated and no longer usable.
- Corpus object (i.e. User, Utterance, Conversation) ids and metadata keys must now be strings or None. It used to be that any Hashable object could be used, but this posed problems for corpus dumping to and loading from jsons.
- Deletion of a metadata key for one object results in deletion of that metadata key for all objects of that object type
- Corpus.dump() automatically increments the version number of the Corpus by 1.
- Corpus.download() now has a use_local boolean parameter that allows offline users to skip the online check for a new dataset version and uses the local version by default.
- Fixed a bug where specified conversation and user metadata were not getting excluded correctly during Corpus initialisation step
- __str__ is now implemented to provide a concise human-readable string display of the Corpus object (that hides private variables)
- Fixed some bugs with Hypergraph motif counting
Internal changes
- Corpus initialisation and dumping have been heavily refactored to improve future maintainability.
- There is a new CorpusObject parent class that User, Utterance, and Conversation inherit from. This parent class implements some shared functionality for all Corpus objects.
- Corpus now uses a ConvokitIndex object to correctly track the metadata state of itself and its Corpus objects. Previously, this index was computed on the spot when Corpus.dump() was called, and referred to when loading a Corpus. However, any changes to a loaded Corpus object would not update the internal index of the Corpus, meaning the index could be inconsistent with the Corpus state.
- Corpus objects (Corpus, User, Utterance, Conversation) all use a ConvokitMeta object instead of a simple dict() for their metadata. This change is necessary to ensure that updates to the metadata (key additions / deletions) are reflected in ConvokitIndex. However, because ConvokitMeta inherits from the dict class, there is no change to how users should work with the .meta attribute.
- Users and Utterances now have 'owner' attributes to indicate the Corpus they belong to. This change is necessary for the maintaining of a consistent index. (Conversations have always had this attribute.)
- Introduces optional dependencies on the clean-text and torch packages for sanitizing text under the FightingWords Transformer and running a neural network as part of the Forecaster-CRAFT Transformer respectively.
- A single script for running all existing test suites has been created to speed up testing before deployment: tests/run_all_tests.py
ConvoKit version 2.2 update
Updates to various parts of ConvoKit:
Text processing
Added support for creating Transformers that compute utterance attributes. Also updated support for dependency-parsing text. An example of how this new functionality can be used is found here.
Corpus
Added some functionality to
- support loading and storage of auxiliary data
- handling of vector representations
- organizing users' activities within conversations
- build dataframes containing attributes of various objects
Prompt types
Updated the code used to compute prompt types and phrasing motifs, deprecating the old QuestionTypology module. An example of how the updated code is used can be found here and here.
User Conversation Diversity
Updated code used to compute linguistic divergence.
Other
Added support for pipelining, and some limited support for computing per-utterance attributes.