-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AK-47, F-35, how tokenized? #13
Comments
I'm thinking either However, when it gets to plural terms, it is very strange to do something like Therefore, I would go with the entire term being tokenized as one. Thoughts? |
fwiw, EWT and LDC do the opposite. i for one am strongly against AK - 47 but unfortunately they don't seem to want to go with a "specific name" exemption for hyphens |
Based on that conversation, didn't they end up agreeing that for compounds where a hyphen is inherent to the name, e.g. I'm not a fan of separating it into |
on the one hand, yes, but on the other hand, our tokenizer won't agree with
the pieces at runtime. maybe that shouldn't be a consideration
…On Thu, Aug 22, 2024, 9:28 AM Alex Shan ***@***.***> wrote:
Based on that conversation, didn't they end up agreeing that for compounds
where a hyphen is inherent to the name, e.g. AK-47 or Jae-hoon, it should
be kept as one name? That's what I took away from skimming, at least.
I'm not a fan of separating it into AK - 47, so if there's no conclusion
from that thread, what do you think about us taking our own lead on this
and tokenizing these cases as one token?
—
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWIN6PCOQMY6T6ANDIDZSYGSJAVCNFSM6AAAAABMVHRG5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBVGE3TMNBWGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
kind of weird seeing it as
on four separate lines
The text was updated successfully, but these errors were encountered: