-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loading pre-trained word vectors #99
Fix loading pre-trained word vectors #99
Conversation
Thanks for the PR and issue @matt-peters ! It'd be good to run this on CI -- could you rename the test file to test_vocab.py / put the test in a test case? |
This looks right to me. @bmccann can you confirm that the shadowing with |
Confirmed! |
Addressed the PR comments. Now that the test is running in Travis, the build is failing for other reasons. In the 2.X build, it's due to |
I've always made
I doubt it's disk or build time, as I've run longer builds involving downloads of larger datasets -- my bet would be on memory consumption when downloading or unzipping... |
so after digging a bit deeper by looking through verbose test logs from travis runs on my own fork, the issue seems to be in memory consumption while loading the vectors from disk. It's also quite surprising how slow that part of the code is (at least from looking at tqdm output on travis) -- is that due to the repeated decodes to utf-8 from binary / is that really necessary? |
Sorry about this. It should have been caught in the tests on the original PR, but I didn't realize that Travis didn't actually run test/vocab.py. I suppose this means that the other dataset test files were also not run outside of my local environment. |
The travis build is now passing. I created and checked in a small test GloVe file used in the test to avoid downloading and processing the full file every time the build runs. Longer term, it's better to host a smaller test file remotely to also exercise the download logic during the build. |
Thanks all, this looks good (especially the minimal glove test). I believe the glove-loading logic is the fastest it can be in pure Python (utf-8 conversion has to happen somewhere), but I’d be happy to be proven wrong! |
Fixes #98 and adds a test for correctness.