Load encoded file if BPE is nil in TextUnsupervised #490

xihui-wu · 2020-05-06T03:59:29Z

Text encoding with Byte Pair Encoder has O(N*N) time complexity https://github.com/tensorflow/swift-models/blob/master/Support/Text/BytePairEncoder.swift#L101. Encoding for full WikiText2 training and testing dataset can take ~100 minutes.

We wanted to make the cache optional where if the encoded text provided then load them directly.

Datasets/TextUnsupervised/TextUnsupervised.swift

texasmichelle

Since the complexity issues are caused by the BytePairEncoder implementation, it would be good to find a way to cache at that level, which would impact everything that references it. That would be my preference, but as a shortcut for benchmarks and quick prototyping, I can see how this PR makes sense (even with a bpe cache).

This approach effectively short-circuits the byte pair encoder entirely, by storing a preprocessed dataset. If you want to go this route for now, it would be worth storing this file in GCS (we might want to consider hosting the data files as well) and adding the path to TextUnsupervisedVariantDetails, rather than passing it in. This warrants reconfiguring the init() methods. For example, the first one creates an empty bpe, which isn't really useful for anything. Instead, this preprocessed dataset could be loaded, which would make the bpe optional.

xihui-wu · 2020-05-07T05:03:01Z

Since the complexity issues are caused by the BytePairEncoder implementation, it would be good to find a way to cache at that level, which would impact everything that references it. That would be my preference, but as a shortcut for benchmarks and quick prototyping, I can see how this PR makes sense (even with a bpe cache).

This approach effectively short-circuits the byte pair encoder entirely, by storing a preprocessed dataset. If you want to go this route for now, it would be worth storing this file in GCS (we might want to consider hosting the data files as well) and adding the path to TextUnsupervisedVariantDetails, rather than passing it in. This warrants reconfiguring the init() methods. For example, the first one creates an empty bpe, which isn't really useful for anything. Instead, this preprocessed dataset could be loaded, which would make the bpe optional.

Great point:) Updated with EncodedWikiText2Details.

texasmichelle · 2020-05-07T16:50:33Z

Datasets/TextUnsupervised/TextUnsupervised.swift

@@ -57,9 +57,19 @@ public struct TextUnsupervised {
        var fileExtension = "tgz"
    }

+    private struct EncodedWikiText2Details: TextUnsupervisedVariantDetails {


Instead of adding a new variant, how about replacing filename with rawFilename and adding var encodedFilename? That way, the bpe nil check occurs closer to the embedding() call, which is a bit clearer IMO.

👍Added encodedFilename as optional in newer change. Does it LG?

Datasets/TextUnsupervised/TextUnsupervised.swift

Store encoded data in files with one integer per line. Read them using a custom function instead of the deprecated NSArray(contentsOf:). Simplify precondition.

texasmichelle · 2020-05-08T23:54:49Z

Datasets/TextUnsupervised/TextUnsupervised.swift

+    /// - Parameter batchSize: number of sequences in a batch.
+    /// - Parameter sequenceLength: number of characters in a sequence.
+    /// - Parameter documentCount: number of documents to proceed. (Refer func readCSV() to see how
+    ///   a text file is chunked into documents.)


This documentation is a big improvement 👍 Thank you!!

Use cached encoded text if available for TextUnsupervised

f511be5

xihui-wu requested a review from texasmichelle May 6, 2020 04:35

dabrahams reviewed May 6, 2020

View reviewed changes

Datasets/TextUnsupervised/TextUnsupervised.swift Outdated Show resolved Hide resolved

texasmichelle reviewed May 6, 2020

View reviewed changes

Re-organize to support downloading from gcs

d8bc3e2

texasmichelle reviewed May 7, 2020

View reviewed changes

Put WikiText2Details and EncodedWikiText2Details together

630f3e5

xihui-wu added the kokoro:run label May 7, 2020

texasmichelle reviewed May 8, 2020

View reviewed changes

Datasets/TextUnsupervised/TextUnsupervised.swift Outdated Show resolved Hide resolved

texasmichelle reviewed May 8, 2020

View reviewed changes

Datasets/TextUnsupervised/TextUnsupervised.swift Show resolved Hide resolved

texasmichelle reviewed May 8, 2020

View reviewed changes

Datasets/TextUnsupervised/TextUnsupervised.swift Outdated Show resolved Hide resolved

Address comments

6248a3a

kokoro-team removed the kokoro:run label May 8, 2020

xihui-wu changed the title ~~Load encoded text from cache if available for TextUnsupervised~~ Load encoded file if BPE is nil in TextUnsupervised May 8, 2020

xihui-wu requested a review from texasmichelle May 8, 2020 16:02

xihui-wu and others added 2 commits May 8, 2020 14:20

Avoid deprecated NSArray init

48cc295

Read encoded tokens from a file

fee1845

Store encoded data in files with one integer per line. Read them using a custom function instead of the deprecated NSArray(contentsOf:). Simplify precondition.

texasmichelle approved these changes May 8, 2020

View reviewed changes

texasmichelle reviewed May 8, 2020

View reviewed changes

xihui-wu merged commit fa0ca07 into master May 9, 2020

texasmichelle deleted the cache branch May 9, 2020 04:31

texasmichelle mentioned this pull request May 11, 2020

Remove unnecessary init in TextUnsupervised #504

Merged

xihui-wu mentioned this pull request May 11, 2020

[TextUnsupervised] Add encoded version of WikiText103 #505

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load encoded file if BPE is nil in TextUnsupervised #490

Load encoded file if BPE is nil in TextUnsupervised #490

xihui-wu commented May 6, 2020

texasmichelle left a comment

xihui-wu commented May 7, 2020

texasmichelle May 7, 2020

xihui-wu May 7, 2020

texasmichelle May 8, 2020

Load encoded file if BPE is nil in TextUnsupervised #490

Load encoded file if BPE is nil in TextUnsupervised #490

Conversation

xihui-wu commented May 6, 2020

texasmichelle left a comment

Choose a reason for hiding this comment

xihui-wu commented May 7, 2020

texasmichelle May 7, 2020

Choose a reason for hiding this comment

xihui-wu May 7, 2020

Choose a reason for hiding this comment

texasmichelle May 8, 2020

Choose a reason for hiding this comment