Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AK-47, F-35, how tokenized? #13

Open
AngledLuffa opened this issue Aug 17, 2024 · 4 comments
Open

AK-47, F-35, how tokenized? #13

AngledLuffa opened this issue Aug 17, 2024 · 4 comments

Comments

@AngledLuffa
Copy link
Contributor

kind of weird seeing it as

F
-
35
s

on four separate lines

@SecroLoL
Copy link
Collaborator

I'm thinking either AK-47, F-35 or AK - 47

However, when it gets to plural terms, it is very strange to do something like
AK - 47 s compared to AK-47s as one token.

Therefore, I would go with the entire term being tokenized as one. Thoughts?

@AngledLuffa
Copy link
Contributor Author

fwiw, EWT and LDC do the opposite. i for one am strongly against AK - 47 but unfortunately they don't seem to want to go with a "specific name" exemption for hyphens

UniversalDependencies/UD_English-EWT#204

@SecroLoL
Copy link
Collaborator

Based on that conversation, didn't they end up agreeing that for compounds where a hyphen is inherent to the name, e.g. AK-47 or Jae-hoon, it should be kept as one name? That's what I took away from skimming, at least.

I'm not a fan of separating it into AK - 47, so if there's no conclusion from that thread, what do you think about us taking our own lead on this and tokenizing these cases as one token?

@AngledLuffa
Copy link
Contributor Author

AngledLuffa commented Aug 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants