Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Tokens #93

Merged
merged 6 commits into from
Apr 28, 2024
Merged

Added Tokens #93

merged 6 commits into from
Apr 28, 2024

Conversation

pcuenca
Copy link
Member

@pcuenca pcuenca commented Apr 26, 2024

Fixes #92

pcuenca added 4 commits April 26, 2024 19:50
The other split helpers we have don't work for capture groups.
We had to resort to raw `NSRegularExpression`s
We need some custom behaviour that's not in the config :(
Comment on lines 12 to 15
func preTokenize(text: String, firstSection: Bool) -> [String]
func preTokenize(texts: [String], firstSection: Bool) -> [String]
func callAsFunction(texts: [String], firstSection: Bool) -> [String]
func callAsFunction(text: String, firstSection: Bool) -> [String]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I had to change this public protocol so that MetaspacePreTokenizer can conditionally work depending on whether we are handling the first split. This is necessary because the input text is split by the added tokens first, and then all the chunks go through the tokenization pipeline. Do you think this will break something in your codebases @ZachNagengast @awni?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkoski wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any easy way to provide a default value so it's optional? In any case WhisperKit doesn't use these directly so I'd expect it to be fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A default value is provided in a protocol extension, but the protocol definition itself cannot have default values, unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine for mlx-swift. I wonder if this should be an enum instead of a bool? I don't know if there are other "contexts" that might be used...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call @davidkoski, I did check the transformers.js port and didn't see any other context options, but better to play it safe.

@@ -277,6 +276,42 @@ extension String {
return result
}

/// This version supports capture groups, wheres the one above doesn't
func split(by captureRegex: NSRegularExpression) -> [String] {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit annoying, it's just another version of the previous split helper we wrote that works with capture groups.

/// Class overrides for custom behaviour
/// Not to be confused with the TokenizerModel classes defined in TokenizerModel
static let tokenizerClasses: [String : PreTrainedTokenizer.Type] = [
"LlamaTokenizer": LlamaPreTrainedTokenizer.self
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be tokenizer subclasses now (not just tokenizer models)

@awni
Copy link

awni commented Apr 26, 2024

Thanks for the quick fix @pcuenca 🚀 🤗

import Hub

class AddedTokensTests: XCTestCase {
func testPhiAddedEnd() async throws {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now passes, and it produces the same results as Python.

@pcuenca
Copy link
Member Author

pcuenca commented Apr 26, 2024

The CI workers are down, but all tests pass locally. I'll wait a few hours to merge unless you think it's urgent.

Also cc @DanThePutzer, @MatthewWaller in case they want to test or review.

@pcuenca pcuenca merged commit 0a606f5 into main Apr 28, 2024
1 check failed
@pcuenca pcuenca deleted the added-tokens branch April 28, 2024 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure added tokens are supported
4 participants