Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load encoded file if BPE is nil in TextUnsupervised #490

Merged
merged 6 commits into from
May 9, 2020
Merged

Conversation

xihui-wu
Copy link
Contributor

@xihui-wu xihui-wu commented May 6, 2020

Text encoding with Byte Pair Encoder has O(N*N) time complexity https://github.com/tensorflow/swift-models/blob/master/Support/Text/BytePairEncoder.swift#L101. Encoding for full WikiText2 training and testing dataset can take ~100 minutes.

We wanted to make the cache optional where if the encoded text provided then load them directly.

@xihui-wu xihui-wu requested a review from texasmichelle May 6, 2020 04:35
Copy link
Member

@texasmichelle texasmichelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the complexity issues are caused by the BytePairEncoder implementation, it would be good to find a way to cache at that level, which would impact everything that references it. That would be my preference, but as a shortcut for benchmarks and quick prototyping, I can see how this PR makes sense (even with a bpe cache).

This approach effectively short-circuits the byte pair encoder entirely, by storing a preprocessed dataset. If you want to go this route for now, it would be worth storing this file in GCS (we might want to consider hosting the data files as well) and adding the path to TextUnsupervisedVariantDetails, rather than passing it in. This warrants reconfiguring the init() methods. For example, the first one creates an empty bpe, which isn't really useful for anything. Instead, this preprocessed dataset could be loaded, which would make the bpe optional.

@xihui-wu
Copy link
Contributor Author

xihui-wu commented May 7, 2020

Since the complexity issues are caused by the BytePairEncoder implementation, it would be good to find a way to cache at that level, which would impact everything that references it. That would be my preference, but as a shortcut for benchmarks and quick prototyping, I can see how this PR makes sense (even with a bpe cache).

This approach effectively short-circuits the byte pair encoder entirely, by storing a preprocessed dataset. If you want to go this route for now, it would be worth storing this file in GCS (we might want to consider hosting the data files as well) and adding the path to TextUnsupervisedVariantDetails, rather than passing it in. This warrants reconfiguring the init() methods. For example, the first one creates an empty bpe, which isn't really useful for anything. Instead, this preprocessed dataset could be loaded, which would make the bpe optional.

Great point:) Updated with EncodedWikiText2Details.

@@ -57,9 +57,19 @@ public struct TextUnsupervised {
var fileExtension = "tgz"
}

private struct EncodedWikiText2Details: TextUnsupervisedVariantDetails {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new variant, how about replacing filename with rawFilename and adding var encodedFilename? That way, the bpe nil check occurs closer to the embedding() call, which is a bit clearer IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍Added encodedFilename as optional in newer change. Does it LG?

@xihui-wu xihui-wu changed the title Load encoded text from cache if available for TextUnsupervised Load encoded file if BPE is nil in TextUnsupervised May 8, 2020
@xihui-wu xihui-wu requested a review from texasmichelle May 8, 2020 16:02
xihui-wu and others added 2 commits May 8, 2020 14:20
Store encoded data in files with one integer per line.
Read them using a custom function instead of the deprecated NSArray(contentsOf:).
Simplify precondition.
/// - Parameter batchSize: number of sequences in a batch.
/// - Parameter sequenceLength: number of characters in a sequence.
/// - Parameter documentCount: number of documents to proceed. (Refer func readCSV() to see how
/// a text file is chunked into documents.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is a big improvement 👍 Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants