-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Tokens #93
Added Tokens #93
Conversation
The other split helpers we have don't work for capture groups. We had to resort to raw `NSRegularExpression`s
We need some custom behaviour that's not in the config :(
func preTokenize(text: String, firstSection: Bool) -> [String] | ||
func preTokenize(texts: [String], firstSection: Bool) -> [String] | ||
func callAsFunction(texts: [String], firstSection: Bool) -> [String] | ||
func callAsFunction(text: String, firstSection: Bool) -> [String] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I had to change this public protocol so that MetaspacePreTokenizer
can conditionally work depending on whether we are handling the first split. This is necessary because the input text is split by the added tokens first, and then all the chunks go through the tokenization pipeline. Do you think this will break something in your codebases @ZachNagengast @awni?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkoski wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any easy way to provide a default value so it's optional? In any case WhisperKit doesn't use these directly so I'd expect it to be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A default value is provided in a protocol extension, but the protocol definition itself cannot have default values, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine for mlx-swift. I wonder if this should be an enum instead of a bool? I don't know if there are other "contexts" that might be used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call @davidkoski, I did check the transformers.js port and didn't see any other context options, but better to play it safe.
@@ -277,6 +276,42 @@ extension String { | |||
return result | |||
} | |||
|
|||
/// This version supports capture groups, wheres the one above doesn't | |||
func split(by captureRegex: NSRegularExpression) -> [String] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit annoying, it's just another version of the previous split
helper we wrote that works with capture groups.
/// Class overrides for custom behaviour | ||
/// Not to be confused with the TokenizerModel classes defined in TokenizerModel | ||
static let tokenizerClasses: [String : PreTrainedTokenizer.Type] = [ | ||
"LlamaTokenizer": LlamaPreTrainedTokenizer.self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be tokenizer subclasses now (not just tokenizer models)
Thanks for the quick fix @pcuenca 🚀 🤗 |
import Hub | ||
|
||
class AddedTokensTests: XCTestCase { | ||
func testPhiAddedEnd() async throws { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test now passes, and it produces the same results as Python.
The CI workers are down, but all tests pass locally. I'll wait a few hours to merge unless you think it's urgent. Also cc @DanThePutzer, @MatthewWaller in case they want to test or review. |
Fixes #92