-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load encoded file if BPE is nil in TextUnsupervised #490
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the complexity issues are caused by the BytePairEncoder
implementation, it would be good to find a way to cache at that level, which would impact everything that references it. That would be my preference, but as a shortcut for benchmarks and quick prototyping, I can see how this PR makes sense (even with a bpe cache).
This approach effectively short-circuits the byte pair encoder entirely, by storing a preprocessed dataset. If you want to go this route for now, it would be worth storing this file in GCS (we might want to consider hosting the data files as well) and adding the path to TextUnsupervisedVariantDetails
, rather than passing it in. This warrants reconfiguring the init()
methods. For example, the first one creates an empty bpe, which isn't really useful for anything. Instead, this preprocessed dataset could be loaded, which would make the bpe optional.
Great point:) Updated with EncodedWikiText2Details. |
@@ -57,9 +57,19 @@ public struct TextUnsupervised { | |||
var fileExtension = "tgz" | |||
} | |||
|
|||
private struct EncodedWikiText2Details: TextUnsupervisedVariantDetails { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a new variant, how about replacing filename
with rawFilename
and adding var encodedFilename
? That way, the bpe nil check occurs closer to the embedding()
call, which is a bit clearer IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍Added encodedFilename as optional in newer change. Does it LG?
Store encoded data in files with one integer per line. Read them using a custom function instead of the deprecated NSArray(contentsOf:). Simplify precondition.
/// - Parameter batchSize: number of sequences in a batch. | ||
/// - Parameter sequenceLength: number of characters in a sequence. | ||
/// - Parameter documentCount: number of documents to proceed. (Refer func readCSV() to see how | ||
/// a text file is chunked into documents.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation is a big improvement 👍 Thank you!!
Text encoding with Byte Pair Encoder has O(N*N) time complexity https://github.com/tensorflow/swift-models/blob/master/Support/Text/BytePairEncoder.swift#L101. Encoding for full WikiText2 training and testing dataset can take ~100 minutes.
We wanted to make the cache optional where if the encoded text provided then load them directly.