-
Notifications
You must be signed in to change notification settings - Fork 4
Segmenters
This portion of the documentation outlines the various segmenter classes available in the library.
The TransformerWordSegmenter
is the main segmenter class for performing hashtag segmentation in our library. You can optionally re-rank the segmentations produced by the segmenter in TransformerWordSegmenter
using a reranker model. The strategy for combining segmenter and reranker scores is defined by the ensembler, the third component of the TransformerWordSegmenter
.
In simple terms, the segmenter explores a subset of potential segmentations by implementing a beamsearch algorithm, while the reranker generates scores for the best segmentations identified by the segmenter. This architecture is described in our associated paper.
Our research indicates that masked language models are effective as rerankers, but not as segmenters. For example, a suitable configuration could be to use a gpt2
model for segmentation and a bert
model for reranking:
from hashformers import TransformerWordSegmenter as WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
segmenter_model_type="incremental",
segmenter_device="cuda",
reranker_device="cuda",
segmenter_gpu_batch_size=1000,
reranker_gpu_batch_size=1000,
reranker_model_name_or_path="bert-base-cased",
reranker_model_type="masked"
)
Note: The segmenter_gpu_batch_size
and reranker_gpu_batch_size
parameters set the batch size on the CPU if segmenter_device
and reranker_device
are set to cpu
instead of cuda
. These arguments retain their original names for backward compatibility with previous library versions.
The segment
method can be called directly by providing a list of hashtags. However, you can also adjust its keyword arguments to modify the segmentation speed or balance between the segmenter and reranker:
segmentations = ws.segment([
"#weneedanationalpark",
"#icecold"
],
topk = 20,
steps = 13,
alpha = 0.222,
beta = 0.111,
use_reranker = True,
return_ranks = False)
The function of each of these keyword arguments is explained in the following subsections.
The topk
parameter determines how many hashtags to pass to the next iteration of the tree during beamsearch. For example, if topk
is set to 20, a maximum of 20 best candidate segmentations will be passed to the next step. Each iteration of the beamsearch algorithm introduces a space at every possible location where a space hasn't been inserted yet.
The steps
parameter defines the maximum depth of the beamsearch tree, i.e., the maximum number of spaces that will be inserted in a hashtag.
Understanding your dataset's characteristics can allow you to adjust topk
and steps
to accelerate hashtag segmentation.
-
Ideally,
topk
should match the length of the largest hashtag in your dataset. However, you can set it to a lower value, accepting the risk of overlooking some correct candidates. -
steps
should be equivalent to the maximum expected number of spaces in a hashtag. Ideally, it should equaltopk
and the length of the largest hashtag in your dataset. Yet, you can choose lower values, particularly if you are certain that your data's hashtags will contain a specific maximum number of words.
These arguments are used when a reranker is utilized in the ws.segment function:
-
alpha
: This argument assigns the weight for the predictions generated by the segmenter. -
beta
: This argument assigns the weight for the predictions generated by the reranker.
Both alpha and beta can take any values between 0 and 1. If not specified, default values are set such that the segmenter's weight is twice that of the reranker (alpha = 0.222, beta = 0.111
). Optimal values can be identified through a grid search on a validation set.
use_reranker
If use_reranker
is set to True
, the reranker, if initialized with the TransformerWordSegmenter
object, will be used during the segmentation.
return_ranks
The ws.segment
method can give you ranks of different segmentations. To get these ranks, set return_ranks
to True
. It gives you a dictionary. This dictionary has the ranks, the DataFrame used, and the final segmentations. A lower score indicates a better segmentation.
These ranks are useful if you want to mix the ranks from the segmenter and the reranker in a way that is different from the default Top2Ensembler
used by the hashformers library. For example, you could change two or more ranks into the TREC format. Then, using the trectools library, you can combine these ranks in different ways.
An alternative to this procedure is to replace the ensembler used by TransformerWordSegmenter
with your own custom ensembler, as described on the next section.
hashtag_list = [
"#myoldphonesucks",
"#latinosinthedeepsouth",
"#weneedanationalpark"
]
ranks = ws.segment(
hashtag_list,
use_reranker=True,
return_ranks=True
)
# Segmenter rank, a dataframe
segmenter_rank_df = ranks.segmenter_rank
# Reranker rank, a dataframe
reranker_rank_df = ranks.reranker_rank
You can expect to see a segmenter_rank_df
and a reranker_rank_df
in the format of the pandas dataframe below:
characters segmentation score
0 latinosinthedeepsouth latinos in the deep south 50.041458
1 latinosinthedeepsouth latino s in the deep south 53.423897
2 latinosinthedeepsouth latinosin the deep south 53.662689
3 latinosinthedeepsouth la tinos in the deep south 54.122768
4 latinosinthedeepsouth latinos in the deepsouth 54.437469
... ... ... ...
905 weneedanationalpark weneed anatio nalpark 80.100243
906 weneedanationalpark weneedanati onalpa rk 80.674561
907 weneedanationalpark weneedanat ionalpa rk 81.096085
908 weneedanationalpark weneedanat ionalpar k 82.248749
909 weneedanationalpark weneedanat iona lpark 82.592850
910 rows × 3 columns
It is possible that you want to calculate log-likelihoods in a way that is different from what is implemented by default in the library. In that case, you can follow the steps below:
-
Initialize
TransformerWordSegmenter
with both segmenter and reranker model types initialized toNone
. -
Write a class that inherits from
MiniconsLM
and make your own implementation of the methodget_batch_scores(self, batch)
, overriding the default method. You can check the implementation for the MiniconsLM class here. -
After that, simply use the
set_segmenter
and (optionally)set_reranker
methods in yourTransformerWordSegmenter
object.
Let's say, for instance, that I want to calculate the log-likelihood of each hashtag after adding the prefix "This is a twitter hashtag: "
. By default, hashformers does not add any prefix to the hashtags, but we can customize the MiniconsLM
class and make our own custom PrefixedMiniconsLM
.
This will be done by replacing the scorer calls to sequence_score
method by calls to the partial_score
method, as documented here in the minicons library.
from hashformers import TransformerWordSegmenter as WordSegmenter
import warnings
class PrefixedMiniconsLM(MiniconsLM):
def __init__(self, model_name_or_path, device='cuda', gpu_batch_size=20, model_type='IncrementalLMScorer'):
super().__init__(model_name_or_path, device=device, gpu_batch_size=gpu_batch_size, model_type=model_type)
def get_batch_scores(self, batch):
prefixes = ["This is a twitter hashtag: "] * len(batch)
if self.model_type == 'IncrementalLMScorer':
batch = [ x + y for x,y in zip(prefixes, batch) ]
return self.incremental_sequence_score(batch)
elif self.model_type == 'MaskedLMScorer':
return self.scorer.partial_score(prefixes, batch, reduction = lambda x: -x.sum(0).item())
elif self.model_type == 'Seq2SeqScorer':
return self.scorer.partial_score(prefixes, batch, source_format = 'blank')
else:
warnings.warn(f"Model type {self.model_type} not implemented. Assuming reduction = lambda x: -x.sum(0).item()")
return self.scorer.partial_score(prefixes, batch, reduction = lambda x: -x.sum(0).item())
ws = WordSegmenter(
segmenter_model_type=None,
reranker_model_type=None
)
segmenter_minicons_lm = PrefixedMiniconsLM(
"gpt2",
device="cuda",
gpu_batch_size=1000,
model_type="IncrementalLMScorer"
)
reranker_minicons_lm = PrefixedMiniconsLM(
"bert-base-cased",
device="cuda",
gpu_batch_size=1000,
model_type="MaskedLMScorer"
)
ws.set_segmenter(segmenter_minicons_lm)
ws.set_reranker(reranker_minicons_lm)
hashtag_list = [
"#myoldphonesucks",
"#latinosinthedeepsouth",
"#weneedanationalpark"
]
segmentations = ws.segment(hashtag_list)
You can replace the default Top2Ensembler
in a TransformerWordSegmenter
by calling ws.set_ensembler(ensembler)
.
Your ensembler should implement a run
method that takes as inputs segmenter_run
, reranker_run
, and alpha
and beta
keyword arguments.
segmenter_run
and reranker_run
are objects of the class ProbabilityDictionary
. An object of this class can be initialized with ProbabilityDictionary(dictionary)
for a dictionary that has hashtags as keys and scores as values.
The output of your run
method should also be a ProbabilityDictionary
.
For demonstration purposes, let's implement a RandomEnsembler
class that randomly takes scores from the segmenter and the reranker:
from hashformers.beamsearch.data_structures import ProbabilityDictionary
import numpy as np
from hashformers import TransformerWordSegmenter as WordSegmenter
class RandomEnsembler(object):
def __init__(self):
pass
def run(self, segmenter_run, reranker_run, alpha=0.222, beta=0.111):
segmenter_run_dict = segmenter_run.dictionary
reranker_run_dict = reranker_run.dictionary
alpha_weight = alpha / ( alpha + beta )
beta_weight = 1 - alpha
weights = [alpha_weight, beta_weight]
ensemble_dict = {}
for key, value in segmenter_run_dict.items():
if not np.random.choice(2, 1, replace=False, p=weights)[0]:
ensemble_dict[key] = value
else:
ensemble_dict[key] = reranker_run_dict.get(key, value)
return ProbabilityDictionary(ensemble_dict)
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
segmenter_model_type="incremental",
reranker_model_name_or_path="bert-base-cased",
reranker_model_type="masked"
)
random_ensembler = RandomEnsembler()
ws.set_ensembler(random_ensembler)
hashtag_list = [
"#myoldphonesucks",
"#latinosinthedeepsouth",
"#weneedanationalpark"
]
segmentations = ws.segment(hashtag_list)
The tweet segmenter is a segmenter designed mostly for convenience if you still don't have your own pipeline for extracting hashtags from tweets.
The TweetSegmenter
will extract hashtags from tweets for you, and return the tweet text with the segmented hashtag.
from hashformers import TransformerWordSegmenter as WordSegmenter
from hashformers import TweetSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
segmenter_model_type="incremental"
)
ts = TweetSegmenter(word_segmenter=ws)
tweets = ["Love love love all these people ️ ️ ️ #friends #bff #celebrate #blessed #sundayfunday",
"In the zone | : @user #colorsworldwide #RnBOnly @ The Regent"
]
segmented_tweets = ts.segment(tweets)
Although he default preprocessing argumetns are quite reasonable, it is also possible to control some of the preprocessing keyword arguments for the TweetSegmenter, like setting all hashtags to lowercase by passing the key value pair " lower
: True" to the dictionary on the preprocessing_kwargs
argument.
You can also control the segmenter_kwargs
that will be passed to the word segmenter model that was supplied to the tweet segmenter during initialization ( e.g. a TransformerWordSegmenter
).
p_kwargs = {
"hashtag_token": None,
"lower": False,
"separator": " ",
"hashtag_character: "#"
}
s_kwargs = {
"topk" : 20,
"steps" : 13,
"alpha" : 0.222,
"beta" : 0.111,
"use_reranker" : True,
"return_ranks": False
}
segmented_tweets = ts.segment(
tweets,
regex_flag = 0,
preprocessing_kwargs = p_kwargs, segmenter_kwargs = s_kwargs
)
Finally, you can also change the matcher that is used to extract hashtags from the tweets. The default TweetMatcher
will be efficient for most use cases. However, let's say, for instance, you want to remove any hashtags above 10 characters from your tweets, and that you don't want to segment them. You can insert this preprocessing step on your matcher, and then pass it to the TweetSegmenter.
from ttp import ttp
from hashformers import TransformerWordSegmenter as WordSegmenter
from hashformers import TweetSegmenter
class TrimmedTwitterTextMatcher(object):
def __init__(self):
"""
Initializes TwitterTextMatcher object with a Parser from ttp module.
"""
self.parser = ttp.Parser()
def __call__(self, tweets):
"""
Makes the TwitterTextMatcher instance callable. It parses the given tweets and returns their tags.
Args:
tweets: A list of strings, where each string is a tweet.
Returns:
A list of hashtags for each tweet.
"""
output = []
hashtags = [ self.parser.parse(x).tags for x in tweets ]
for hashtag_list in hashtags:
output.append(
[ x for x in hashtags if len(x) < 10]
)
return output
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
segmenter_model_type="incremental"
)
matcher = TrimmedTwitterTextMatcher()
ts = TweetSegmenter(word_segmenter=ws, matcher=matcher)
tweets = ["Love love love all these people ️ ️ ️ #friends #bff #celebrate #blessed #sundayfunday",
"In the zone | : @user #colorsworldwide #RnBOnly @ The Regent"
]
segmented_tweets = ts.segment(tweets)
The RegexWordSegmenter
is a hashtag segmentation tool that utilizes regex rules for segmentation. While the primary emphasis of our library is on transformer model-based hashtag segmentation, this class serves as a practical fallback or baseline solution.
The RegexWordSegmenter
allows you to provide a custom list of regex rules. These rules are applied in the order they are supplied, meaning the first rule will be applied before the second, and so forth. By default, if no list of regex rules is provided, the segmenter segments based on uppercase characters.
from hashformers import RegexWordSegmenter as WordSegmenter
ws = WordSegmenter(regex_rules=[r'([A-Z]+)', r'\d+'])
hashtags = ["#3DModel", "#2023Graduation"]
segmentations = ws.segment(hashtags)
In the code snippet above, the first rule splits the hashtag at every capital letter, and the second rule separates any string of digits from the rest of the hashtag. The output for the #3DModel hashtag would be "3 D Model" and for #2023Graduation, it would be "2023 Graduation".