-
Notifications
You must be signed in to change notification settings - Fork 137
Memory problem in building wiki2vec model via gensim #7
Comments
yeah, this is due to the vocabulary size. if you are only interested on getting the entities' vectors then @phdowling has a gensim branch for it. Which applies a filter of min_count on anything that is not an entity vector. Otherwise reducing your vocabulary by either:
|
Exactly I want to have just entity vectors. what I have to do ? |
So I think best you can do at the moment is to use this gensim fork(the develop branch) : https://github.com/piskvorky/gensim/ that fork contains some changes which will help you deal with the vocab size. One thing, depending on your current setup(linux or OsX) you might want to put attention on how to compile gensim using cython so that when gensim runs it makes use of all your cores. give it a go and let us know if it goes alright. |
Hi everyone, i have the same issue with the memory error. I am trying to increase the min_count to get rid of the error, but nothing is working. Any thought ? Is there a way to reduce the dimension from 1000 to maybe 300 ?
|
@jesuisnicolasdavid if that is literally the code you are running, then changing min_count will probably not help you. You're calling the load method - this doesn't train a new model, it simply loads an existing one. My guess is the existing model simply doesn't fit into RAM. The min_count parameter applies if you're training a new model, more specifically it filters out words that don't occur frequently enough. How big is the file you're trying to load and how much RAM does your machine have? |
So the file is 9GB, i tried to run the model in a first computer with a TitanX and 16GB of RAM. The model is allocating all the ram and fall into a memory error before even going to the GPU. Then, i tried the same code in a second computer with two GTX 980 and 64GB of RAM : the wiki2vec model is taking 20GB alone. Then, i run into a GPU memory error with theano through keras which said :
But i think i will move this question to a theano issue :) |
is this the model provided in the torrent? I've loaded successfully on a 16GB machine. |
Is there a way to make the 1000 dimensions of the pre-training a 300 dimensions ? |
Not that Im aware of, you can alway generate vectors of 300 dimensions, it should only take some hours. |
Yeah, I don't think there's an easy way to soundly change the dimensionality of the vectors.. You might be able to lower the RAM requirements by actually throwing away part of the vocabulary, i.e. loading less vectors, but this might also be quite hard if you're dealing with a raw numpy file and have no machine that can actually load it |
Thanks guys i will try to generate a 300 dimension on my own. Im still wondering in what case a 1000 dimensions can be useful ? |
@jesuisnicolasdavid have you been successful in creating a 300 dimension vocabulary? |
this is probably solved inthe newest gensim version. @vondiplo worth giving that a try ^ |
Hi
Did you have memory problem in loading the trained wiki2vec model in gensim,
I trained with size=500, window=10, min_count=10 based on last enwikipedia dump . So it created the 13g wiki2vec model, For loading on gensim I have memoryerror problem.
Do you have any idea how much memory I need ?
The text was updated successfully, but these errors were encountered: