Skip to content

Segmenters

Ruan Chaves edited this page Jun 3, 2023 · 6 revisions

This portion of the documentation outlines the various segmenter classes available in the library.

TransformerWordSegmenter

Description

The TransformerWordSegmenter is the main segmenter class for performing hashtag segmentation in our library. You can optionally re-rank the segmentations produced by the segmenter in TransformerWordSegmenter using a reranker model. The strategy for combining segmenter and reranker scores is defined by the ensembler, the third component of the TransformerWordSegmenter.

In simple terms, the segmenter explores a subset of potential segmentations by implementing a beamsearch algorithm, while the reranker generates scores for the best segmentations identified by the segmenter. This architecture is described in our associated paper.

Our research indicates that masked language models are effective as rerankers, but not as segmenters. For example, a suitable configuration could be to use a gpt2 model for segmentation and a bert model for reranking:

from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="incremental",
    segmenter_device="cuda",
    reranker_device="cuda",
    segmenter_gpu_batch_size=1000,
    reranker_gpu_batch_size=1000,
    reranker_model_name_or_path="bert-base-cased",
    reranker_model_type="masked"
)

Note: The segmenter_gpu_batch_size and reranker_gpu_batch_size parameters set the batch size on the CPU if segmenter_device and reranker_device are set to cpu instead of cuda. These arguments retain their original names for backward compatibility with previous library versions.

Using the segment Method

The segment method can be called directly by providing a list of hashtags. However, you can also adjust its keyword arguments to modify the segmentation speed or balance between the segmenter and reranker:

segmentations = ws.segment([
        "#weneedanationalpark",
        "#icecold"
    ],
    topk = 20,
    steps = 13,
    alpha = 0.222,
    beta = 0.111,
    use_reranker = True,
    return_ranks = False)

The function of each of these keyword arguments is explained in the following subsections.

Beamsearch Keyword Arguments: topk and steps

The topk parameter determines how many hashtags to pass to the next iteration of the tree during beamsearch. For example, if topk is set to 20, a maximum of 20 best candidate segmentations will be passed to the next step. Each iteration of the beamsearch algorithm introduces a space at every possible location where a space hasn't been inserted yet.

The steps parameter defines the maximum depth of the beamsearch tree, i.e., the maximum number of spaces that will be inserted in a hashtag.

Understanding your dataset's characteristics can allow you to adjust topk and steps to accelerate hashtag segmentation.

  • Ideally, topk should match the length of the largest hashtag in your dataset. However, you can set it to a lower value, accepting the risk of overlooking some correct candidates.

  • steps should be equivalent to the maximum expected number of spaces in a hashtag. Ideally, it should equal topk and the length of the largest hashtag in your dataset. Yet, you can choose lower values, particularly if you are certain that your data's hashtags will contain a specific maximum number of words.

Ensembler keyword arguments: alpha and beta

These arguments are used when a reranker is utilized in the ws.segment function:

  • alpha: This argument assigns the weight for the predictions generated by the segmenter.
  • beta: This argument assigns the weight for the predictions generated by the reranker.

Both alpha and beta can take any values between 0 and 1. If not specified, default values are set such that the segmenter's weight is twice that of the reranker (alpha = 0.222, beta = 0.111). Optimal values can be identified through a grid search on a validation set.

Other keyword arguments

  • use_reranker

If use_reranker is set to True, the reranker, if initialized with the TransformerWordSegmenter object, will be used during the segmentation.

  • return_ranks

The ws.segment method can give you ranks of different segmentations. To get these ranks, set return_ranks to True. It gives you a dictionary. This dictionary has the ranks, the DataFrame used, and the final segmentations. A lower score indicates a better segmentation.

These ranks are useful if you want to mix the ranks from the segmenter and the reranker in a way that is different from the default Top2Ensembler used by the hashformers library. For example, you could change two or more ranks into the TREC format. Then, using the trectools library, you can combine these ranks in different ways.

An alternative to this procedure is to replace the ensembler used by TransformerWordSegmenter with your own custom ensembler, as described on the next section.

hashtag_list = [
    "#myoldphonesucks",
    "#latinosinthedeepsouth",
    "#weneedanationalpark"
]

ranks = ws.segment(
    hashtag_list,
    use_reranker=True,
    return_ranks=True
)

# Segmenter rank, a dataframe
segmenter_rank_df = ranks.segmenter_rank

# Reranker rank, a dataframe
reranker_rank_df = ranks.reranker_rank

You can expect to see a segmenter_rank_df and a reranker_rank_df in the format of the pandas dataframe below:

    characters 	            segmentation 	            score
0 	latinosinthedeepsouth 	latinos in the deep south 	50.041458
1 	latinosinthedeepsouth 	latino s in the deep south 	53.423897
2 	latinosinthedeepsouth 	latinosin the deep south 	53.662689
3 	latinosinthedeepsouth 	la tinos in the deep south 	54.122768
4 	latinosinthedeepsouth 	latinos in the deepsouth 	54.437469
... 	... 	... 	...
905 	weneedanationalpark 	weneed anatio nalpark 	80.100243
906 	weneedanationalpark 	weneedanati onalpa rk 	80.674561
907 	weneedanationalpark 	weneedanat ionalpa rk 	81.096085
908 	weneedanationalpark 	weneedanat ionalpar k 	82.248749
909 	weneedanationalpark 	weneedanat iona lpark 	82.592850

910 rows × 3 columns

Customization

Customizing MiniconsLM

It is possible that you want to calculate log-likelihoods in a way that is different from what is implemented by default in the library. In that case, you can follow the steps below:

  • Initialize TransformerWordSegmenter with both segmenter and reranker model types initialized to None.

  • Write a class that inherits from MiniconsLM and make your own implementation of the method get_batch_scores(self, batch), overriding the default method. You can check the implementation for the MiniconsLM class here.

  • After that, simply use the set_segmenter and (optionally) set_reranker methods in your TransformerWordSegmenter object.

Let's say, for instance, that I want to calculate the log-likelihood of each hashtag after adding the prefix "This is a twitter hashtag: ". By default, hashformers does not add any prefix to the hashtags, but we can customize the MiniconsLM class and make our own custom PrefixedMiniconsLM.

This will be done by replacing the scorer calls to sequence_score method by calls to the partial_score method, as documented here in the minicons library.

from hashformers import TransformerWordSegmenter as WordSegmenter
import warnings

class PrefixedMiniconsLM(MiniconsLM):

    def __init__(self, model_name_or_path, device='cuda', gpu_batch_size=20, model_type='IncrementalLMScorer'):
        super().__init__(model_name_or_path, device=device, gpu_batch_size=gpu_batch_size, model_type=model_type)

    def get_batch_scores(self, batch):

        prefixes = ["This is a twitter hashtag: "] * len(batch)

        if self.model_type == 'IncrementalLMScorer':
            batch = [ x + y for x,y in zip(prefixes, batch) ]
            return self.incremental_sequence_score(batch)
        elif self.model_type == 'MaskedLMScorer':
            return self.scorer.partial_score(prefixes, batch, reduction = lambda x: -x.sum(0).item())
        elif self.model_type == 'Seq2SeqScorer':
            return self.scorer.partial_score(prefixes, batch, source_format = 'blank')
        else:
            warnings.warn(f"Model type {self.model_type} not implemented. Assuming reduction = lambda x: -x.sum(0).item()")
            return self.scorer.partial_score(prefixes, batch, reduction = lambda x: -x.sum(0).item())

ws = WordSegmenter(
    segmenter_model_type=None,
    reranker_model_type=None
)

segmenter_minicons_lm = PrefixedMiniconsLM(
    "gpt2", 
    device="cuda", 
    gpu_batch_size=1000, 
    model_type="IncrementalLMScorer"
)

reranker_minicons_lm = PrefixedMiniconsLM(
    "bert-base-cased", 
    device="cuda", 
    gpu_batch_size=1000, 
    model_type="MaskedLMScorer"
)

ws.set_segmenter(segmenter_minicons_lm)

ws.set_reranker(reranker_minicons_lm)

hashtag_list = [
    "#myoldphonesucks",
    "#latinosinthedeepsouth",
    "#weneedanationalpark"
]

segmentations = ws.segment(hashtag_list)

Customizing the Ensembler

You can replace the default Top2Ensembler in a TransformerWordSegmenter by calling ws.set_ensembler(ensembler).

Your ensembler should implement a run method that takes as inputs segmenter_run, reranker_run, and alpha and beta keyword arguments.

segmenter_run and reranker_run are objects of the class ProbabilityDictionary. An object of this class can be initialized with ProbabilityDictionary(dictionary) for a dictionary that has hashtags as keys and scores as values.

The output of your run method should also be a ProbabilityDictionary.

For demonstration purposes, let's implement a RandomEnsembler class that randomly takes scores from the segmenter and the reranker:

from hashformers.beamsearch.data_structures import ProbabilityDictionary
import numpy as np
from hashformers import TransformerWordSegmenter as WordSegmenter

class RandomEnsembler(object):
    def __init__(self):
        pass

    def run(self, segmenter_run, reranker_run, alpha=0.222, beta=0.111):
        segmenter_run_dict = segmenter_run.dictionary
        reranker_run_dict = reranker_run.dictionary

        alpha_weight = alpha / ( alpha + beta )
        beta_weight = 1 - alpha
        weights = [alpha_weight, beta_weight]

        ensemble_dict = {}
        for key, value in segmenter_run_dict.items():
            if not np.random.choice(2, 1, replace=False, p=weights)[0]:
                ensemble_dict[key] = value
            else:
                ensemble_dict[key] = reranker_run_dict.get(key, value)

        return ProbabilityDictionary(ensemble_dict)

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="incremental",
    reranker_model_name_or_path="bert-base-cased",
    reranker_model_type="masked"
)

random_ensembler = RandomEnsembler()
ws.set_ensembler(random_ensembler)

hashtag_list = [
    "#myoldphonesucks",
    "#latinosinthedeepsouth",
    "#weneedanationalpark"
]

segmentations = ws.segment(hashtag_list)

TweetSegmenter

Description

The tweet segmenter is a segmenter designed mostly for convenience if you still don't have your own pipeline for extracting hashtags from tweets.

The TweetSegmenter will extract hashtags from tweets for you, and return the tweet text with the segmented hashtag.

Usage

from hashformers import TransformerWordSegmenter as WordSegmenter

from hashformers import TweetSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="incremental"
)

ts = TweetSegmenter(word_segmenter=ws)

tweets = ["Love love love all these people ️ ️ ️ #friends #bff #celebrate #blessed #sundayfunday",
"In the zone | : @user #colorsworldwide #RnBOnly @ The Regent"
]

segmented_tweets = ts.segment(tweets)

Customization

Preprocessing and Word Segmenter keyword arguments

Although he default preprocessing argumetns are quite reasonable, it is also possible to control some of the preprocessing keyword arguments for the TweetSegmenter, like setting all hashtags to lowercase by passing the key value pair " lower: True" to the dictionary on the preprocessing_kwargs argument.

You can also control the segmenter_kwargs that will be passed to the word segmenter model that was supplied to the tweet segmenter during initialization ( e.g. a TransformerWordSegmenter ).

p_kwargs = {
        "hashtag_token": None,
        "lower": False,
        "separator": " ",
        "hashtag_character: "#"
}

s_kwargs = { 
    "topk" : 20,
    "steps" : 13,
    "alpha" : 0.222,
    "beta" : 0.111,
    "use_reranker" : True,
    "return_ranks": False
}

segmented_tweets = ts.segment(
    tweets, 
    regex_flag = 0, 
    preprocessing_kwargs = p_kwargs, segmenter_kwargs = s_kwargs 
)

Matcher Customization

Finally, you can also change the matcher that is used to extract hashtags from the tweets. The default TweetMatcher will be efficient for most use cases. However, let's say, for instance, you want to remove any hashtags above 10 characters from your tweets, and that you don't want to segment them. You can insert this preprocessing step on your matcher, and then pass it to the TweetSegmenter.

from ttp import ttp
from hashformers import TransformerWordSegmenter as WordSegmenter
from hashformers import TweetSegmenter

class TrimmedTwitterTextMatcher(object):
    
    def __init__(self):
        """
        Initializes TwitterTextMatcher object with a Parser from ttp module.
        """
        self.parser = ttp.Parser()
    
    def __call__(self, tweets):
        """
        Makes the TwitterTextMatcher instance callable. It parses the given tweets and returns their tags.

        Args:
            tweets: A list of strings, where each string is a tweet.

        Returns:
            A list of hashtags for each tweet.
        """
        output = []
        hashtags = [ self.parser.parse(x).tags for x in tweets ]
        for hashtag_list in hashtags:
            output.append(
                [ x for x in hashtags if len(x) < 10]
            )
        return output

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    segmenter_model_type="incremental"
)

matcher = TrimmedTwitterTextMatcher()

ts = TweetSegmenter(word_segmenter=ws, matcher=matcher)

tweets = ["Love love love all these people ️ ️ ️ #friends #bff #celebrate #blessed #sundayfunday",
"In the zone | : @user #colorsworldwide #RnBOnly @ The Regent"
]

segmented_tweets = ts.segment(tweets)

RegexWordSegmenter

The RegexWordSegmenter is a hashtag segmentation tool that utilizes regex rules for segmentation. While the primary emphasis of our library is on transformer model-based hashtag segmentation, this class serves as a practical fallback or baseline solution.

The RegexWordSegmenter allows you to provide a custom list of regex rules. These rules are applied in the order they are supplied, meaning the first rule will be applied before the second, and so forth. By default, if no list of regex rules is provided, the segmenter segments based on uppercase characters.

from hashformers import RegexWordSegmenter as WordSegmenter

ws = WordSegmenter(regex_rules=[r'([A-Z]+)', r'\d+'])

hashtags = ["#3DModel", "#2023Graduation"]

segmentations = ws.segment(hashtags)

In the code snippet above, the first rule splits the hashtag at every capital letter, and the second rule separates any string of digits from the rest of the hashtag. The output for the #3DModel hashtag would be "3 D Model" and for #2023Graduation, it would be "2023 Graduation".