You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I added some sequences of repeated characters as user defined tokens to a Unigram model. Now when tokenizing with sampling, I get unexpected behavior as I increase the nbest size. I believe this is a bug. Can you please confirm and let me know a workaround?
Below, my string is '+' repeated 16 times. When calling for an nbest_size of 2, I get the two most plausible sequences: '+' 16 times (which is a token), and '+' 8 times with and without leading meta space, as below.
LONG_STR = '++++++++++++++++'
def print_nbest(mystr, n):
possible_tokenization = []
for i in range(100):
tokenization = our_tokenizer.encode_as_pieces(mystr, enable_sampling=True, nbest_size=n)
possible_tokenization.append(" ".join(tokenization))
pprint.pprint(Counter(possible_tokenization),)
print_nbest(LONG_STR, 2) # Result: Counter({'▁++++++++ ++++++++': 54, '▁++++++++++++++++': 46})
Now when calling the same function with a larger nbest size, these "top two" tokenizations are now much further down the list. This does not make sense to me, as if each user-defined-symbol is added with the same high probability, sequences with more tokens should be less likely (as probabilities multiply)
Furthermore, playing with alpha between 0.01 and 0.99 does not lead to a predictable change in peakiness of the distribution. Observe the results of setting alpha=0.01 and alpha=0.99, for four runs of 100 samples. The only difference between the two function calls is alpha.
I added some sequences of repeated characters as user defined tokens to a Unigram model. Now when tokenizing with sampling, I get unexpected behavior as I increase the nbest size. I believe this is a bug. Can you please confirm and let me know a workaround?
Below, my string is '+' repeated 16 times. When calling for an nbest_size of 2, I get the two most plausible sequences: '+' 16 times (which is a token), and '+' 8 times with and without leading meta space, as below.
Now when calling the same function with a larger nbest size, these "top two" tokenizations are now much further down the list. This does not make sense to me, as if each user-defined-symbol is added with the same high probability, sequences with more tokens should be less likely (as probabilities multiply)
Finally, if I request an nbest_size of 1000, sampling fails:
Furthermore, playing with alpha between 0.01 and 0.99 does not lead to a predictable change in peakiness of the distribution. Observe the results of setting alpha=0.01 and alpha=0.99, for four runs of 100 samples. The only difference between the two function calls is alpha.
The text was updated successfully, but these errors were encountered: