Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loading pre-trained word vectors #99

Merged
merged 6 commits into from
Aug 23, 2017

Conversation

matt-peters
Copy link
Contributor

Fixes #98 and adds a test for correctness.

@nelson-liu
Copy link
Contributor

Thanks for the PR and issue @matt-peters ! It'd be good to run this on CI -- could you rename the test file to test_vocab.py / put the test in a test case?

@jekbradbury
Copy link
Contributor

This looks right to me. @bmccann can you confirm that the shadowing with i was unintentional and this PR fixes it?

@bmccann
Copy link
Contributor

bmccann commented Aug 23, 2017

Confirmed!

@matt-peters
Copy link
Contributor Author

Addressed the PR comments. Now that the test is running in Travis, the build is failing for other reasons. In the 2.X build, it's due to os.makedirs not supporting the exist_ok kwarg. The 3.X build was killed by Travis for an unknown reason, perhaps because it is running into a hard Travis limit like disk or memory usage or build time.

@nelson-liu
Copy link
Contributor

In the 2.X build, it's due to os.makedirs not supporting the exist_ok kwarg.

I've always made os.makedirs version invariant by using:

if not os.path.exists(path):
    os.makedirs(path)

The 3.X build was killed by Travis for an unknown reason, perhaps because it is running into a hard Travis limit like disk or memory usage or build time.

I doubt it's disk or build time, as I've run longer builds involving downloads of larger datasets -- my bet would be on memory consumption when downloading or unzipping...

@nelson-liu
Copy link
Contributor

so after digging a bit deeper by looking through verbose test logs from travis runs on my own fork, the issue seems to be in memory consumption while loading the vectors from disk. It's also quite surprising how slow that part of the code is (at least from looking at tqdm output on travis) -- is that due to the repeated decodes to utf-8 from binary / is that really necessary?

@bmccann
Copy link
Contributor

bmccann commented Aug 23, 2017

In the 2.X build, it's due to os.makedirs not supporting the exist_ok kwarg. The 3.X build was killed by Travis for an unknown reason, perhaps because it is running into a hard Travis limit like disk or memory usage or build time.

Sorry about this. It should have been caught in the tests on the original PR, but I didn't realize that Travis didn't actually run test/vocab.py. I suppose this means that the other dataset test files were also not run outside of my local environment.

@matt-peters
Copy link
Contributor Author

The travis build is now passing. I created and checked in a small test GloVe file used in the test to avoid downloading and processing the full file every time the build runs. Longer term, it's better to host a smaller test file remotely to also exercise the download logic during the build.

@jekbradbury
Copy link
Contributor

Thanks all, this looks good (especially the minimal glove test). I believe the glove-loading logic is the fastest it can be in pure Python (utf-8 conversion has to happen somewhere), but I’d be happy to be proven wrong!

@jekbradbury jekbradbury merged commit 8b672e6 into pytorch:master Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants