Fix loading pre-trained word vectors #99

matt-peters · 2017-08-23T00:26:46Z

Fixes #98 and adds a test for correctness.

nelson-liu · 2017-08-23T01:04:08Z

Thanks for the PR and issue @matt-peters ! It'd be good to run this on CI -- could you rename the test file to test_vocab.py / put the test in a test case?

jekbradbury · 2017-08-23T03:09:56Z

This looks right to me. @bmccann can you confirm that the shadowing with i was unintentional and this PR fixes it?

bmccann · 2017-08-23T03:22:31Z

Confirmed!

matt-peters · 2017-08-23T05:04:03Z

Addressed the PR comments. Now that the test is running in Travis, the build is failing for other reasons. In the 2.X build, it's due to os.makedirs not supporting the exist_ok kwarg. The 3.X build was killed by Travis for an unknown reason, perhaps because it is running into a hard Travis limit like disk or memory usage or build time.

nelson-liu · 2017-08-23T05:32:08Z

In the 2.X build, it's due to os.makedirs not supporting the exist_ok kwarg.

I've always made os.makedirs version invariant by using:

if not os.path.exists(path):
    os.makedirs(path)

The 3.X build was killed by Travis for an unknown reason, perhaps because it is running into a hard Travis limit like disk or memory usage or build time.

I doubt it's disk or build time, as I've run longer builds involving downloads of larger datasets -- my bet would be on memory consumption when downloading or unzipping...

nelson-liu · 2017-08-23T06:06:56Z

so after digging a bit deeper by looking through verbose test logs from travis runs on my own fork, the issue seems to be in memory consumption while loading the vectors from disk. It's also quite surprising how slow that part of the code is (at least from looking at tqdm output on travis) -- is that due to the repeated decodes to utf-8 from binary / is that really necessary?

bmccann · 2017-08-23T06:42:16Z

In the 2.X build, it's due to os.makedirs not supporting the exist_ok kwarg. The 3.X build was killed by Travis for an unknown reason, perhaps because it is running into a hard Travis limit like disk or memory usage or build time.

Sorry about this. It should have been caught in the tests on the original PR, but I didn't realize that Travis didn't actually run test/vocab.py. I suppose this means that the other dataset test files were also not run outside of my local environment.

matt-peters · 2017-08-23T15:53:09Z

The travis build is now passing. I created and checked in a small test GloVe file used in the test to avoid downloading and processing the full file every time the build runs. Longer term, it's better to host a smaller test file remotely to also exercise the download logic during the build.

jekbradbury · 2017-08-23T16:39:08Z

Thanks all, this looks good (especially the minimal glove test). I believe the glove-loading logic is the fastest it can be in pure Python (utf-8 conversion has to happen somewhere), but I’d be happy to be proven wrong!

matt-peters added 2 commits August 22, 2017 17:25

Fix loading pre-trained word vectors

4cef6a0

Formatting

d1901fb

Address PR comments

101b791

matt-peters added 3 commits August 23, 2017 06:58

Make os.makedirs 2.X compatible

4ad6a94

Add a small test glove file

9bbf7a5

Formatting

fc93a28

jekbradbury merged commit 8b672e6 into pytorch:master Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loading pre-trained word vectors #99

Fix loading pre-trained word vectors #99

matt-peters commented Aug 23, 2017

nelson-liu commented Aug 23, 2017

jekbradbury commented Aug 23, 2017

bmccann commented Aug 23, 2017

matt-peters commented Aug 23, 2017

nelson-liu commented Aug 23, 2017

nelson-liu commented Aug 23, 2017

bmccann commented Aug 23, 2017 •

edited

Loading

matt-peters commented Aug 23, 2017

jekbradbury commented Aug 23, 2017

Fix loading pre-trained word vectors #99

Fix loading pre-trained word vectors #99

Conversation

matt-peters commented Aug 23, 2017

nelson-liu commented Aug 23, 2017

jekbradbury commented Aug 23, 2017

bmccann commented Aug 23, 2017

matt-peters commented Aug 23, 2017

nelson-liu commented Aug 23, 2017

nelson-liu commented Aug 23, 2017

bmccann commented Aug 23, 2017 • edited Loading

matt-peters commented Aug 23, 2017

jekbradbury commented Aug 23, 2017

bmccann commented Aug 23, 2017 •

edited

Loading