Skip to content

Commit

Permalink
Enabling SST2 dataset usage in fbcode (#1426)
Browse files Browse the repository at this point in the history
* include pytorch 1.5.0-rc1 for CI test

* bump up the version

* Set up ShipIt

fbshipit-source-id: bb7d2eb52240c7223b57c3c9624e61d116e77e39

* Re-sync with internal repository (#749)

* 20200429 pytorch/text import

Summary: [20:45:34: cpuhrsch@devvm3140 pytorch]$ ./fb_build/import_text.sh

Reviewed By: pbelevich

Differential Revision: D21320577

fbshipit-source-id: ac2148b9f0d58e5538443c879845bfb4f6ca7202

* 20200430 torchtext import script to include additional meta files

Summary: ./fb_build/import_text.sh

Reviewed By: zhangguanheng66

Differential Revision: D21343124

fbshipit-source-id: c08ecad2cc6f439fa40130aeaf91383be9403fe8

* torchtext flake8, github, travis metafiles

Summary: See title

Reviewed By: pbelevich

Differential Revision: D21344211

fbshipit-source-id: a8bcf7f3ab9bb2c2853e27f612e82caa341d3651

* Import torchtext 20200520 and update build

Summary: Import torchtext up to #786

Reviewed By: cpuhrsch

Differential Revision: D21483116

fbshipit-source-id: bc8ab38db9dc9ce4a8734ca8ea991c20e4ef0882

* Import torchtext 20200528

Summary:
Import up to #798
Addresses T67599333

Reviewed By: zhangguanheng66

Differential Revision: D21764935

fbshipit-source-id: f44d1db637799f2e95f420a8099fbf19545c7cbd

* 20200604 torchtext github import

Summary: Import from github master

Reviewed By: zhangguanheng66

Differential Revision: D21886238

fbshipit-source-id: a8f098e299466dd1701fe7ceb6a97c2a2fc54b9d

* Import torchtext 20200605

Summary: Import from github master

Reviewed By: zhangguanheng66

Differential Revision: D21907519

fbshipit-source-id: f22370d97796da5f2cb9f76f506c80f18fefea7f

* Back out "Import torchtext 20200605"

Summary: Original commit changeset: f22370d97796

Reviewed By: zhangguanheng66

Differential Revision: D21964222

fbshipit-source-id: c316836596fc3e232e63abc59e172f237b551cc5

* Import torchtext 2020/06/22

Summary: Import from github torchtext/master

Reviewed By: zhangguanheng66, cpuhrsch

Differential Revision: D22168183

fbshipit-source-id: 7d96ade64f18942d9bd19437011be2f65f0b2a5e

* Fix torch.testing._internal module not found

Reviewed By: Nayef211

Differential Revision: D22315715

fbshipit-source-id: 6b8b8544b0aa458cf5e7e9ca380d0dc85c98189f

* Import torchtext 2020/07/07

Summary: Import from github torchtext/master

Reviewed By: cpuhrsch

Differential Revision: D22420576

fbshipit-source-id: 4d2c19d7f1db8f698894ca406c1c44b2ad8e0506

* remediation of S205607

fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac

* remediation of S205607

fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3

* Import torchtext 2020/07/21

Summary: Import from github torchtext/master

Reviewed By: zhangguanheng66

Differential Revision: D22641140

fbshipit-source-id: 8190692d059a937e25c5f93506581086f389c291

* Remove .python3 markers

Reviewed By: ashwinp-fb

Differential Revision: D22955630

fbshipit-source-id: f00ef17a905e4c7cd9196c8924db39f9cdfe8cfa

* Import torchtext 2020/08/06

Summary: Import from github torchtext/master

Reviewed By: zhangguanheng66

Differential Revision: D22989210

fbshipit-source-id: 083464e188b758a8746123f4dd2197cc7edc4bc4

* Import torchtext 2020/08/18

Summary: Import from github torchtext/master

Reviewed By: cpuhrsch

Differential Revision: D23190596

fbshipit-source-id: 1568a25a5bd6431bcef3c6539f64a3ab1f5bccd7

* Import torchtext from 8aecbb9

Reviewed By: hudeven

Differential Revision: D23451795

fbshipit-source-id: 73e6130c16716919c77862cef4ca4c8048428670

* Import torchtext 9/4/2020

Reviewed By: Nayef211

Differential Revision: D23539397

fbshipit-source-id: 88dce59418a3071cbc9e944cf0a4cf2117d7d9f7

* Import github torchtext on 9/9/2020

Reviewed By: cpuhrsch

Differential Revision: D23616189

fbshipit-source-id: 365debc987326145eead7456ed48517fe55cac96

* Add property support for ScriptModules (#42390)

Summary:
Pull Request resolved: pytorch/pytorch#42390

**Summary**
This commit extends support for properties to include
ScriptModules.

**Test Plan**
This commit adds a unit test that has a ScriptModule with
a user-defined property.

`python test/test_jit_py3.py TestScriptPy3.test_module_properties`

Test Plan: Imported from OSS

Reviewed By: eellison, mannatsingh

Differential Revision: D22880298

Pulled By: SplitInfinity

fbshipit-source-id: 74f6cb80f716084339e2151ca25092b6341a1560

* sync with OSS torchtext 9/15/20

Reviewed By: cpuhrsch

Differential Revision: D23721167

fbshipit-source-id: 13b32091c422a3ed0ae299595d69a7afa7136638

* Import Github torchtext on 9/28/2020

Reviewed By: cpuhrsch

Differential Revision: D23962265

fbshipit-source-id: 0d042878fe9119aa725e982ab7d5e96e7c885a59

* Enable @unused syntax for ignoring properties (#45261)

Summary:
Pull Request resolved: pytorch/pytorch#45261

**Summary**
This commit enables `unused` syntax for ignoring
properties. Inoring properties is more intuitive with this feature enabled.
`ignore` is not supported because class type properties cannot be
executed in Python (because they exist only as TorchScript types) like
an `ignored` function and module properties that cannot be scripted
are not added to the `ScriptModule` wrapper so that they
may execute in Python.

**Test Plan**
This commit updates the existing unit tests for class type and module
properties to test properties ignored using `unused`.

Test Plan: Imported from OSS

Reviewed By: navahgar, Krovatkin, mannatsingh

Differential Revision: D23971881

Pulled By: SplitInfinity

fbshipit-source-id: 8d3cc1bbede7753d6b6f416619e4660c56311d33

* Import Github torchtext on 10/11/2020

Reviewed By: cpuhrsch

Differential Revision: D24242037

fbshipit-source-id: 605d81412c320373f1158c51dbb120e7d70d624d

* make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API (#47322)

Summary:
Pull Request resolved: pytorch/pytorch#47322

Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API.

I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:

For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
   m.def("opName", fn_name);
}
```

For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.

```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
  m.def("opName(schema_inputs) -> schema_outputs");
  m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}

// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
  m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:

I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`

There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.

There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D24714803

Pulled By: bdhirsh

fbshipit-source-id: c809aad8a698db3fd0d832f117f833e997b159e1

* Revert D24714803: make duplicate def() calls an error in the dispatcher. Updating all fb operators to use the new dispatcher registration API

Differential Revision:
D24714803

Original commit changeset: c809aad8a698

fbshipit-source-id: fb2ada65f9fc00d965708d202bd9d050f13ef467

* Import torchtext on Nov 20, 2020

Summary:
Import torchtext on the commit of 633548a

allow-large-files

Reviewed By: cpuhrsch

Differential Revision: D25127691

fbshipit-source-id: 3a617f5f4849df452f8a102a77ce11a1bce5af1f

* Updating all call-sites of the legacy dispatcher registration API in fbcode to the new API. (#48178)

Summary:
Pull Request resolved: pytorch/pytorch#48178

I migrated all call sites that used the legacy dispatcher registration API (RegisterOperators()) to use the new API (TORCH_LIBRARY...). I found all call-sites by running `fbgs RegisterOperators()`. This includes several places, including other OSS code (nestedtensor, torchtext, torchvision). A few things to call out:

For simple ops that only had one registered kernel without a dispatch key, I replaced them with:
```
TORCH_LIBRARY_FRAGMENT(ns, m) {
   m.def("opName", fn_name);
}
```

For ops that registered to a specific dispatch key / had multiple kernels registered, I registered the common kernel (math/cpu) directly inside a `TORCH_LIBRARY_FRAGMENT` block, and registered any additional kernels from other files (e.g. cuda) in a separate `TORCH_LIBRARY_IMPL` block.

```
// cpu file
TORCH_LIBRARY_FRAGMENT(ns, m) {
  m.def("opName(schema_inputs) -> schema_outputs");
  m.impl("opName", torch::dispatch(c10::DispatchKey::CPU, TORCH_FN(cpu_kernel)));
}

// cuda file
TORCH_LIBRARY_IMPL(ns, CUDA, m) {
  m.impl("opName", torch::dispatch(c10::DispatchKey::CUDA, TORCH_FN(cuda_kernel)));
}
```
Special cases:

I found a few ops that used a (legacy) `CPUTensorId`/`CUDATensorId` dispatch key. Updated those to use CPU/CUDA- this seems safe because the keys are aliased to one another in `DispatchKey.h`

There were a handful of ops that registered a functor (function class) to the legacy API. As far as I could tell we don't allow this case in the new API, mainly because you can accomplish the same thing more cleanly with lambdas. Rather than delete the class I wrote a wrapper function on top of the class, which I passed to the new API.

There were a handful of ops that were registered only to a CUDA dispatch key. I put them inside a TORCH_LIBRARY_FRAGMENT block, and used a `def()` and `impl()` call like in case two above.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056090

Pulled By: bdhirsh

fbshipit-source-id: 8f868b45f545e5da2f21924046e786850eba70d9

* Import torchtext from github into fbcode on 1/11/2021

Reviewed By: cpuhrsch

Differential Revision: D25873762

fbshipit-source-id: 0d34d36aeb8e7e2ce72fcf345c5e7e713ef3663c

* Import torchtext from github #1121 d56fffe

Summary: Import torchtext from github #1121 d56fffe

Reviewed By: zhangguanheng66

Differential Revision: D25976268

fbshipit-source-id: 81589f8988a54cc12f17f0a6f298a915e829a830

* Import the hidden files in torchtext github repo

Reviewed By: mthrok

Differential Revision: D26001386

fbshipit-source-id: f822f0f32232d3006ef629937520dee6c0faf414

* add a newline mark to config.yml file (#1128)

Reviewed By: zhangguanheng66

Differential Revision: D26369003

fbshipit-source-id: 09ca48f9705d8663b06e6a329a6b64b24f9c148e

* Replace model with full name when spacy load is used (#1140)

Reviewed By: zhangguanheng66

Differential Revision: D26369005

fbshipit-source-id: b1e6b5d77810bb8f67d14b8a1c7ec0a9f4831cab

* Fix the num_lines argument of the setup_iter func in RawTextIterableDataset (#1142)

Reviewed By: zhangguanheng66

Differential Revision: D26368999

fbshipit-source-id: 4b50e5d9e5fbdf633e8b3f0072223eed050af793

* Fix broken CI tests due to spacy 3.0 release (#1138)

Reviewed By: zhangguanheng66

Differential Revision: D26368998

fbshipit-source-id: 84e883562a9a3d0fe47b54823b22f7b2cd82fca4

* Switch data_select in dataset signature to split (#1143)

Reviewed By: zhangguanheng66

Differential Revision: D26369006

fbshipit-source-id: 608f42fa180db9ebcfaaeadc6b8cdd29393262af

* Add offset arg in the raw text dataset (#1145)

Reviewed By: zhangguanheng66

Differential Revision: D26368996

fbshipit-source-id: 52741015139c302b7b0ddf8c8f50ab45a609fd2f

* switch to_ivalue to __prepare_scriptable__ (#1080)

Reviewed By: zhangguanheng66

Differential Revision: D26368995

fbshipit-source-id: 0352c04e422c835350bd42df35d4054d543fee36

* Pass an embedding layer to the constructor of the BertModel class (#1135)

Reviewed By: zhangguanheng66

Differential Revision: D26369001

fbshipit-source-id: f5a67a2a812d568073505ec4d181f6e418eb4a3f

* add __next__ method to RawTextIterableDataset (#1141)

Reviewed By: zhangguanheng66

Differential Revision: D26368997

fbshipit-source-id: f5ef78f5f4a224db497f47f774eaddedd0498b4b

* Add func to count the total number of parameters in a model (#1134)

Reviewed By: zhangguanheng66

Differential Revision: D26369000

fbshipit-source-id: c687c0f0c2697dbd9c17a79a1291a2e279bbd1b8

* Retire the legacy code in torchtext library and fix the dependency of the downstream libraries

Summary: This diff is doing: 1) move the legacy code in torchtext to the legacy folder; 2) for the downstream libraries in fbcode, if they are using the legacy code, add "legacy" to the path.

Reviewed By: cpuhrsch

Differential Revision: D23718437

fbshipit-source-id: 1660868aaa95ac6555ad6793dda5ce02a9acdc08

* Sync torchtext GH<->fbcode until GH commit 1197514

Summary: Import recent torchtext changes up until GH commit 1197514

Reviewed By: zhangguanheng66

Differential Revision: D26824967

fbshipit-source-id: fc4be4f94a8f748ce2ed5e776e30a42422cbcab9

* 20210304[2] Sync torchtext GH<->fbcode until GH commit 2764143

Summary: Sync up until commit in title

Reviewed By: zhangguanheng66

Differential Revision: D26829429

fbshipit-source-id: a059a36d83b3803dfed9198d0e474e0e75f94f17

* 20210308 Sync torchtext GH <-> fbcode

Summary: Import latest GH changes

Reviewed By: zhangguanheng66

Differential Revision: D26888371

fbshipit-source-id: cc27f51fd89ad86b8bcfb8f286ad874ab01b1fd6

* Re-name raw_datasets.json file with jsonl extension

Reviewed By: cpuhrsch

Differential Revision: D26923978

fbshipit-source-id: c87c7776445e05d452f6b38244bf4cdaba45bdec

* 20210329 Sync torchtext up to GH commit eb5e39d

Summary: Sync torchtext up to GH commit eb5e39d

Reviewed By: parmeet

Differential Revision: D27400885

fbshipit-source-id: 1f8f92ca42ba36d070db6740b3bb4c148f69586b

* Import torchtext #1267 93b03e4

Summary:
Imported latest from github Master
PR#1267

Reviewed By: cpuhrsch

Differential Revision: D27503970

fbshipit-source-id: 853ff895ba42b1feb7442abe1c87478e43d62e5b

* Import torchtext #1266 ba0bf52

Summary: Import torchtext from github

Reviewed By: parmeet

Differential Revision: D27803909

fbshipit-source-id: 9cb0f15858b1417cb5868d5651513eb2df998fbe

* Import torchtext #1287 fab63ed

Reviewed By: parmeet

Differential Revision: D27922562

fbshipit-source-id: 3c18cd9e2583e03471461ad8a22ac6b0ceb596a2

* Import torchtext #1293 d2a0776

Summary: Importing torchtext from github for regular sync.

Reviewed By: cpuhrsch

Differential Revision: D27983819

fbshipit-source-id: 5806421d788afaa872f5320b5f4cbcd913e103ea

* Import torchtext #1291 0790ce6

Reviewed By: parmeet

Differential Revision: D28101664

fbshipit-source-id: a8643b3ecf85de2cb815dcfa5789a4a5d246d80f

* adding __contains__ method to experimental vocab (#1297)

Reviewed By: cpuhrsch

Differential Revision: D28111696

fbshipit-source-id: fef195941492493a399adb37339cfa64795e22a0

* Import torchtext #1292 ede6ce6

Summary: This diff syncs torchtext GH with fbcode

Reviewed By: cpuhrsch

Differential Revision: D28321356

fbshipit-source-id: 7736f0d100941627b58424911a1329b1ce66c123

* Added APIs for default index and removed unk token (#1302)

Reviewed By: parmeet

Differential Revision: D28478153

fbshipit-source-id: bfcaffe8fe48e96d8df454f7df0d25ec39d5d4a6

* Swapping experimental Vocab and retiring current Vocab into legacy (#1289)

Summary: allow-large-files to commit wikitext103_vocab.pt

Reviewed By: cpuhrsch

Differential Revision: D28478152

fbshipit-source-id: c2a871439f054024b95c05f7664a84028aacaca3

* Import torchtext #1313 36e33e2

Summary: Importing from Github

Reviewed By: cpuhrsch

Differential Revision: D28572929

fbshipit-source-id: 2e7b00aadeda6ab0596ef23295f41c5b0fa246e7

* Adding API usage logging

Summary: Adding API usage logging for Vocab module

Reviewed By: colin2328

Differential Revision: D28585537

fbshipit-source-id: 38975b523fb597412fbcb18ef831bfb4834cb420

* Import torchtext #1314 99557ef

Reviewed By: parmeet

Differential Revision: D28683381

fbshipit-source-id: 7bfbf445dd512f0ce21c34096cf3f08332d90138

* Import torchtext #1325 57a1df3

Reviewed By: NicolasHug

Differential Revision: D28994054

fbshipit-source-id: 4c679f56ef37b18f6d2acaaaed8518facbeaa41c

* Import torchtext #1328 ca514f6

Summary: Import torchtext #1328 ca514f6

Reviewed By: NicolasHug

Differential Revision: D29120370

fbshipit-source-id: 229586f3470bd61bfb2f6a390d79e45d4eae3b4d

* up the priority of numpy array comparisons in self.assertEqual (#59067) (#1340)

* Re-sync with internal repository (#1343)

* up the priority of numpy array comparisons in self.assertEqual (#59067)

Summary:
Fixes pytorch/pytorch#58988.

Pull Request resolved: pytorch/pytorch#59067

Reviewed By: jbschlosser

Differential Revision: D28986642

Pulled By: heitorschueroff

fbshipit-source-id: 3ef2d26b4010fc3519d0a1a020ea446ffeb46ba0

* Import torchtext #1300 0435df1

Summary: Import torchtext #1300 0435df1

Reviewed By: parmeet

Differential Revision: D29371832

fbshipit-source-id: 624280ddfa787a4e7628e60fa673cb9df0a66641

* Import torchtext #1345 8cf471c

Summary: Import from github

Reviewed By: hudeven

Differential Revision: D29441995

fbshipit-source-id: 27731ce2714c16180d11bfb26af5d5a2dba408b1

* Import torchtext #1352 7ab50af

Summary: Import from github

Reviewed By: NicolasHug

Differential Revision: D29537684

fbshipit-source-id: 25b1fc1e6d9f930e83f5f2939788b90b083aeaa2

* Enabling torchtext datasets access via manifold and iopath

Summary:
We would like to add and access torchtext datasets on manifold. This Diff unifies the dataset download from external links and through manifold for internal access. This is enabled via io_path package.

The main idea is to plugin the download hooks in the download_from_url function. The download hooks will delegate the download to appropriate Path Handler. In OSS we have enabled download via https and google drive. Internally, we replace the download hook to download data from manifold.

We have created a _download_hooks.py file under /fb/ folder which will replace the corresponding file in OSS. The file under /fb/ folder converts the http/https URL paths into corresponding manifold paths and download the data from there.

Reviewed By: hudeven

Differential Revision: D28892389

fbshipit-source-id: 3b66544dd2345075e2e7c524f344db04aa2a24e3

* Import torchtext #1361 05cb992

Summary: Import from github

Reviewed By: hudeven

Differential Revision: D29856211

fbshipit-source-id: 6332f9bdf3cf4eef572c5423db15101ea904d825

* Import torchtext #1365 c57b1fb

Summary: Import torchtext #1365 c57b1fb

Reviewed By: parmeet

Differential Revision: D29940816

fbshipit-source-id: 6b2495b550a7e6b6110b0df12de51a87b0d31c1c

* Moving Roberta building blocks to torchtext

Summary: This is the first step in moving Roberta Model from pytext_lib into PyTorch Text Library. Here we moved the Roberta building blocks into pytorch/text/fb/nn/modules. The code-base is organized according to WIP document https://docs.google.com/document/d/1c0Fs-v97pndLrT3bdfGRGeUeEC38UcDpibvgOXkbS-g/edit#heading=h.3ybcf0ic42yp

Reviewed By: hudeven

Differential Revision: D29671800

fbshipit-source-id: d01daa99e0a5463716660722381db9a0eeb083f8

* Enabling torchtext availability in @mode/opt

Summary:
More details on context and solution: D29973934

Note that in this implementation, we rely on over-riding behavior of _init_extention() function. This is in similar spirit where we over-ride behavior of download hooks to accommodate necessary changes needed to enable functionality on fbcode.

Reviewed By: mthrok

Differential Revision: D30494836

fbshipit-source-id: b2b015263fa1bca2ef4d4214909e469df3fbe327

* Import torchtext #1382 aa12e9a

Summary: Import torchtext #1382 aa12e9a

Reviewed By: parmeet

Differential Revision: D30584905

fbshipit-source-id: fba23cd19f31fc7826114dd2eb402c8f7b0553df

* Simplify cpp extension initialization process

Summary: Simplifying the cpp extension initialization process by following torchaudio's implementation in D30633316

Reviewed By: mthrok

Differential Revision: D30652618

fbshipit-source-id: f80ac150fa50b1edc22419b21412f64e77064c5d

* fixed bug with incorrect variable name in dataset_utils.py

Summary:
- ValueError was outputting `fn` instead of `func`
- Similar fix done in torchdata https://github.com/facebookexternal/torchdata/pull/167

Reviewed By: ejguan

Differential Revision: D31149667

fbshipit-source-id: 2c1228287d513895f8359cb97935252f0087d738

* Import torchtext #1410 0930843

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D31745899

fbshipit-source-id: e4ac5c337bcbd1a8809544add7679dd3da242999

* Import torchtext #1406 1fb2aed

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D31762288

fbshipit-source-id: f439e04f903d640027660cb969d6d9e00e7ed4a0

* Import from github 10/18/21

Summary: Syncing torchtext github main branch to fbcode

Reviewed By: parmeet

Differential Revision: D31841825

fbshipit-source-id: 9c1a05295e6557ff411e56eb719cb439d5c424ba

* Import torchtext #1420 0153ead

Summary: Import latest from github

Reviewed By: Nayef211

Differential Revision: D31871772

fbshipit-source-id: 989f5a453ef7680592df27e4174f465d11a2fbf8

* Import torchtext #1421 bcc1455

Summary: Syncing torchtext github main branch to fbcode

Reviewed By: parmeet

Differential Revision: D31873514

fbshipit-source-id: 1a964a67ce7ee73f5acf3a1e3f8118028c2dd46e

* Enable OSS torchtext XLMR Base/Large model on fbcode

Summary:
Enable access to open-source torchtext XLMR base/large implementation by:
1) Uploading models/transform weights on manifold
2) Patching public URL with manifold URL (similar to what we have for datasets)

Note that we didn't enabled model tests since it takes relatively long to download huge models weights from manifold. We would rely on Open-source signals when making changes to model implementation, and we need to ensure the any update in weights on AWS cloud is also replicated on manifold.

Reviewed By: hudeven

Differential Revision: D31844166

fbshipit-source-id: 62a4e9a3a8580ab93c3beb3af69be7361f1cc937

* enabling SST2 dataset usage in fbcode

Summary:
Enable access to open-source torchtext SST2 dataset by:
- Uploading SST2 dataset on manifold
- Swapping public URL with manifold URL in fbcode by implementing a dummy `HTTPReader` wrapper class
   - The wrapper class does URL mapping and calls `IoPathFileLoaderDataPipe` on the manifold URL
- Enabled SST2Dataset unit tests within fbcode

Reviewed By: parmeet

Differential Revision: D31876606

fbshipit-source-id: fdde14a67cce835da216b296e1a0024e1d1fc7a9

* Fixed imoporting is_module_available

Co-authored-by: Guanheng Zhang <zhangguanheng@devfair0197.h2.fair>
Co-authored-by: Christian Puhrsch <cpuhrsch@devfair0129.h2.fair>
Co-authored-by: cpuhrsch <cpuhrsch@fb.com>
Co-authored-by: Moto Hira <moto@fb.com>
Co-authored-by: George Guanheng Zhang <zhangguanheng@fb.com>
Co-authored-by: Stanislau Hlebik <stash@fb.com>
Co-authored-by: Andres Suarez <asuarez@fb.com>
Co-authored-by: Meghan Lele <meghanl@fb.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Vasilis Vryniotis <vvryniotis@fb.com>
Co-authored-by: Jeff Hwang <jeffhwang@fb.com>
Co-authored-by: Parmeet Singh Bhatia <parmeetbhatia@fb.com>
Co-authored-by: Artyom Astafurov <asta@fb.com>
Co-authored-by: Nicolas Hug <nicolashug@fb.com>
Co-authored-by: Heitor Schueroff <heitorschueroff@fb.com>
Co-authored-by: Facebook Community Bot <facebook-github-bot@users.noreply.github.com>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Vincent Quenneville-Belair <vincentqb@fb.com>
Co-authored-by: Yao-Yuan Yang <yyyang@fb.com>
Co-authored-by: nayef211 <n63ahmed@edu.uwaterloo.ca>
  • Loading branch information
21 people authored Oct 27, 2021
1 parent 1c3bce2 commit 4be2792
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 4 deletions.
4 changes: 4 additions & 0 deletions torchtext/_download_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
from tqdm import tqdm
# This is to allow monkey-patching in fbcode
from torch.hub import load_state_dict_from_url # noqa
from torchtext._internal.module_utils import is_module_available

if is_module_available("torchdata"):
from torchdata.datapipes.iter import HttpReader # noqa F401


def _stream_response(r, chunk_size=16 * 1024):
Expand Down
9 changes: 5 additions & 4 deletions torchtext/experimental/datasets/sst2.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@
)

if is_module_available("torchdata"):
from torchdata.datapipes.iter import (
HttpReader,
IterableWrapper,
)
from torchdata.datapipes.iter import IterableWrapper
# we import HttpReader from _download_hooks so we can swap out public URLs
# with interal URLs when the dataset is used within Facebook
from torchtext._download_hooks import HttpReader


NUM_LINES = {
"train": 67349,
Expand Down

0 comments on commit 4be2792

Please sign in to comment.