Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoupling language model computes from within Surprise #187

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

TushaarGVS
Copy link

@TushaarGVS TushaarGVS commented Dec 9, 2022

Description

This change attempts to decouple the language model computes out of the Surprise transformer, so as to enable more flexibility in choosing several other language models, including KenLM, and in enhancing speed of these computations. Furthermore, this change introduces a LanguageModel class, and also implements two LanguageModel-inherited classes, including modifying the existing cross-entropy function to the ConvoKitLanguageModel class and the Kenlm class that uses the KenLM model. Furthermore, the computation of surprise has been thread-parallelized, and the language model evaluation functions themselves have also been thread-parallelized for runtime efficiency (although, this efficiency is quite minimal since we use threads and Python employs GIL to disallow for extensive thread-parallelization; any speed-up observed can be attributed to file I/O operations and NumPy operations; this is to be explored further).

The demos/surprise_demo.ipynb and demos/tennis_demo.ipynb demo files have been updated to reflect the changes made. To ensure most possible backward compatibility, only the smooth argument has been removed from the Surprise class signature; the fit and transform methods still run with the same arguments (use kwargs to input arguments pertaining to added functionality). Use LanguageModel.config to view the config of individual LanguageModel classes, and LanguageModel.type to view the type of the language model.

Other changes include code modularization, utils addition, and other code formatting. TQDM imports have been modified to enable notebook rendering when using notebooks, and additional TQDM (disappearing) bars have been included to track the progress of surprise computations.


Motivation and context

This change was made to enable flexibility in language modeling. Currently, we use cross-entropy as a measure of how surprising the conversations are; however, it needn't be the case that cross-entropy has to be used. One may choose to employ perplexity, or KenLM log-probability scores. In this regard, decoupling surprise (language model) computes are of importance. This changes attempts to achieve the same.


How has this been tested?

The code has been tested by comparing this implementation to an earlier implementation, and with the same preprocessing, the same results were observed on the tennis dataset (working on the utterance-level). Furthermore, the cross-entropy refactoring has been tested using the old ConvoKit surprise demo, and (with some randomness in choosing the context and target) the outputs seem to be about the same, i.e., almost the same ordering in most and least surprising involvements. In the tennis demo, with the use of cross-entropy, the values remained the same.


Usage

Create a new LanguageModel class object as follows:

convokit_lm = convokit.ConvoKitLanguageModel(smooth=True, n_jobs=1)  # default language model
kenlm = convokit.Kenlm(kenlm_path='/Users/tushaar/kenlm', 
                       model_filename='kenlm_surprise',
                       trained_model_filepath='/Users/tushaar/Desktop/kenlm_models/surprise.arpa', 
                       is_persistent=True)

Use the created LanguageModel when transforming as follows:

surprise.transform(corpus, language_model=kenlm, eval_type='cross_entropy')

Pending

Documentation and final testing (estimated: December 15, 2022).

This commit attempts to decouple the perplexity computes out of the
transformer, so as to enable more flexibility in choosing several
other language models, including KenLM. Furthermore, this change
also implements two perplexity modules, including modifying the
existing cross-entropy function to the CrossEntropy class and the
KenlmPerplexity class that uses the KenLM model. Furthermore, the
computation of surprise has been thread-parallelized, and the
perplexity functions themselves have also been thread-parallelized
for runtime efficiency.

The surprise_demo has been updated to reflect the changes made. To
ensure most possible backward compatibility, only the 'smooth'
argument has been removed from the Surprise class signature; the
fit and transform methods still run with the same arguments (use
kwargs to input arguments pertaining to added functionality).

Use Perplexity.config to view the config of individual perplexity
classes, and Perplexity.type to view the perplexity type.

Other changes include code modularization, utils addition, and
other code formatting. TQDM imports have been modified to enable
notebook rendering when using notebooks, and additional TQDM
(disappearing) bars have been included to track the progress of
surprise computations.
This change adds typing hints to all the functions in the Surprise
transformer. Further, it also includes a minor correction of having
varied surprise_scores dictionary per object type.
The change has been tested by comparing to an earlier KenLM
implementation, and with the same preprocessing, the same results
were observed on the tennis dataset (works on utterance-level).
Further, the cross-entropy refactoring has been tested using the
old ConvoKit surprise demo, and (with some randomness in choosing
the context and target) the outputs seem to be about the same,
i.e., almost the same ordering in most and least surprising
involvements. In the tennis demo, the values remained the same.
The change make minor changes to the types of specific functions
that were added after the first typing commit.
@jpwchang
Copy link
Collaborator

Thanks for the contribution! There are a few changes needed to get everything working with our github infrastructure:

  • The KenLM code introduces a new dependency on (presumably) the kenlm python package. This dependency needs to be added to setup.py so that it gets automatically installed alongside convokit; right now the automated tests cannot run since this dependency is missing.
  • Please automatically format the new code with the Black formatter to maintain consistency with the style guidelines. Instructions can be found in the contribution guidelines

In addition to the above, I have a few further minor suggestions related to style and consistency:

  • In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score"
  • I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above
  • This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise
  • I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that

Thanks again for the contribution, hopefully these asks aren't too much. Let me know if anything is unclear!

@cristiandnm
Copy link
Contributor

cristiandnm commented Dec 14, 2022 via email

@TushaarGVS
Copy link
Author

Thanks for the extensive review, Jonathan, I will address the comments and post my questions before I push the final version of my changes for a PR.

@TushaarGVS
Copy link
Author

TushaarGVS commented Dec 16, 2022

In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score"

Jonathan, the "main" function is called evaluate(..., eval_type='') that takes eval_type as one of the arguments, where the user can specify the name of their evaluation function, and then that function gets called using eval_fn = getattr(self, eval_type) in evaluate(...). The function cross_entropy is a dummy function that is shown in LanguageModel class, and it really needn't be implemented by the user, if not used (hence only a RuntimeError, which throws an error only upon usage).

Also, KenLM returns log-likelihood score, which is essentially cross-entropy that we compute; hence, both our model and KenLM compute cross-entropy.

This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise

This is a great idea, I will merge utils into the main utils file.

I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above

It was initially called CrossEntropy, but based on my answer above, it seems too restrictive to call it CrossEntropyLanguageModel. I think we should come up with a better name.

I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that

I have used _ to indicate semi-private variables; variables that aren't supposed to be accessible outside the class and __ for true private variables; those that can't be accessed (even if you wanted to) outside the class (at least not without _class_name__variable_name). I want to leave that functionality as is, because I think it is a good programming practice, but let me know if this is a big issue concerning inconsistency, I will change it accordingly.

This change formats the code in accordance with the existing black
formatter, and it deletes the utility files created for the surprise
transformer and includes them in the main utils file.
@TushaarGVS
Copy link
Author

Thanks Jonathan, very good points. One note: I would like to not force KenLM as a requirement fit convokit. I think if LM modules like forecaster modules; so the same way we don’t force tensorflow for convokit, we should not force KenLM. This should be documented properly (as I believe we do that for the CRAFT forecaster module).

I have addressed this, Cristian. The code only installs kenlm when installed with pip install convokit[kenlm].

This commit includes extensive documentation of all the modules
concerning the Surprise transformer, LanguageModel, Kenlm, and
ConvoKitLanguageModel classes, and the newly added util functions.
@TushaarGVS
Copy link
Author

I have finished documenting all the changes made, all the classes and functions now have extensive documentation. Also, following Jonathan’s comments, I have moved utility functions to the utils file. Following your comment, I have ensured that kenlm is not a direct dependency of convokit, instead, it can be used by using pip install convokit[kenlm].

Additionally, I have also implemented perplexity() function for both the language model classes (ConvoKitLanguageModel and Kenlm), in addition to cross_entropy(). I have also run both the demos with the latest changes to ensure that I have not introduced any breaking changes.

This change includes unit tests to test the functionality of the
Surprise transformer, LanguageModel (using nltk.lm), and the added
ConvoKitLanguageModel. The newly added tests were verified to ensure
that they run as intended.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants