-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Migrating from Gensim 3.x to 4
Gensim 4.0 is compatible with older releases (3.8.3 and prior) for the most part. Your existing stored models and code will continue to work in 4.0, except:
Gensim 4.0+ is Python 3 only. See the Gensim & Compatibility policy page for supported Python 3 versions. They train much faster and consume less RAM (see 4.0 benchmarks).
The *2Vec
-related classes (Word2Vec
, FastText
, & Doc2Vec
) have undergone significant internal refactoring for clarity, consistency, efficiency & maintainability.
model = Word2Vec(size=100, …) # 🚫
model = FastText(size=100, …) # 🚫
model = Doc2Vec(size=100, …) # 🚫
model = Word2Vec(vector_size=100, …) # 👍
model = FastText(vector_size=100, …) # 👍
model = Doc2Vec(vector_size=100, …) # 👍
model = Word2Vec(iter=5, …) # 🚫
model = FastText(iter=5, …) # 🚫
model = Doc2Vec(iter=5, …) # 🚫
model = Word2Vec(epochs=5, …) # 👍
model = FastText(epochs=5, …) # 👍
model = Doc2Vec(epochs=5, …) # 👍
Before, the iter
name was used to match the original word2vec implementation. But epochs
is more standard and descriptive, plus iter
clashes with Python's built-in iter
.
random_word = random.choice(model.wv.index2word) # 🚫
random_word = random.choice(model.wv.index_to_key) # 👍
This unifies the terminology: these models map keys to vectors (not just words or entities to vectors).
4. vocab
dict became key_to_index
for looking up a key's integer index, or get_vecattr()
and set_vecattr()
for other per-key attributes:
rock_idx = model.wv.vocab["rock"].index # 🚫
rock_cnt = model.wv.vocab["rock"].count # 🚫
vocab_len = len(model.wv.vocab) # 🚫
rock_idx = model.wv.key_to_index["rock"] # 👍
rock_cnt = model.wv.get_vecattr("rock", "count") # 👍
vocab_len = len(model.wv) # 👍
L2-normalized vectors are now computed dynamically, on request. The full numpy array of "normalized vectors" is no longer stored in memory:
all_normed_vectors = model.wv.get_normed_vectors() # still works but now creates a new array on each call!
normed_vector = model.wv.vectors_norm[model.wv.vocab["rock"].index] # 🚫
normed_vector = model.wv.get_vector("rock", norm=True) # 👍
This allows Gensim 4.0.0 to be much more memory efficient than Gensim <4.0.
6. no more vocabulary
and trainables
attributes; properties previously there have been moved back to the model:
out_weights = model.trainables.syn1neg # 🚫
min_count = model.vocabulary.min_count # 🚫
out_weights = model.syn1neg # 👍
min_count = model.min_count # 👍
7. methods like most_similar()
, wmdistance()
, doesnt_match()
, similarity()
, & others moved to KeyedVectors
These methods moved from the full model (Word2Vec
, Doc2Vec
, FastText
) object to its .wv
subcomponent (of type KeyedVectors
) many releases ago:
w2v_model.most_similar(word) # 🚫
w2v_model.most_similar_cosmul(word) # 🚫
w2v_model.wmdistance(wordlistA, wordlistB) # 🚫
w2v_model.similar_by_word(word) # 🚫
w2v_model.similar_by_vector(word) # 🚫
w2v_model.doesnt_match(wordlist) # 🚫
w2v_model.similarity(wordA, wordB) # 🚫
w2v_model.n_similarity(wordlistA, wordlistB) # 🚫
w2v_model.evaluate_word_pairs(wordpairs) # 🚫
w2v_model.accuracy(questions) # 🚫
w2v_model.log_accuracy(section) # 🚫
w2v_model.wv.most_similar(word) # 👍
w2v_model.wv.most_similar_cosmul(word) # 👍
w2v_model.wv.wmdistance(wordlistA, wordlistB) # 👍
w2v_model.wv.similar_by_word(word) # 👍
w2v_model.wv.similar_by_vector(word) # 👍
w2v_model.wv.doesnt_match(wordlist) # 👍
w2v_model.wv.similarity(wordA, wordB) # 👍
w2v_model.wv.n_similarity(wordlistA, wordlistB) # 👍
w2v_model.wv.evaluate_word_pairs(wordpairs) # 👍
w2v_model.wv.evaluate_word_analogies(questions) # 👍
w2v_model.wv.log_accuracy(section) # 👍
Most generally, if any call on a full model (Word2Vec
, Doc2Vec
, FastText
) object only needs the word vectors to calculate its response, and you encounter a has no attribute
error in Gensim 4.0.0+, make the call on the contained KeyedVectors
object instead.
In addition, wmdistance
will normalize vectors to unit length now by default:
# 🚫 BEFORE
model.init_sims(replace=True) # 🚫 First normalize all embedding vectors.
distance = model.wmdistance(wordlistA, wordlistB) # 🚫 Then compute WMD distance.
# 👍 Now in 4.0+
distance = model.wv.wmdistance(wordlistA, wordlistB) # 👍 WMD distance over normalized embedding vectors.
distance = model.wv.wmdistance(wordlistA, wordlistB, norm=False) # 👍 WMD over non-normalizated vectors.
These two training callbacks had muddled semantics, confused users and introduced race conditions. Use on_epoch_begin
and on_epoch_end
instead.
Gensim 4.0 now ignores these two functions entirely, even if implementations for them are present.
…and it's now a standard KeyedVectors
object, so has all the standard attributes and methods of KeyedVectors
(but no specialized properties like vectors_docs
):
random_doc_id = np.random.randint(doc2vec_model.docvecs.count) # 🚫
document_vector = doc2vec_model.docvecs["some_document_tag"] # 🚫
all_docvecs = doc2vec_model.docvecs.vectors_docs # 🚫
random_doc_id = np.random.randint(len(doc2vec_models.dv)) # 👍
document_vector = doc2vec_model.dv["some_document_tag"] # 👍
all_docvecs = doc2vec_model.dv.vectors # 👍
Because the vectors for document tags are now in a standard KeyedVectors
, prior specific-to-Doc2Vec
accessors like doctags_syn0
, vectors_docs
, or index_to_doctag
are no longer supported; the analogous generic accessors should be used instead:
all_docvecs = doc2vec_model.docvecs.doctag_syn0 # 🚫
all_docvecs = doc2vec_model.docvecs.vectors_docs # 🚫
doctag = doc2vec_model.docvecs.index_to_doctag[n] # 🚫
all_docvecs = doc2vec_model.dv.vectors # 👍
doctag = doc2vec_model.dv.index_to_key[n] # 👍
"night" in model.wv.vocab # 🚫
"night" in model.wv.key_to_index # 👍
Of course, even OOV words have vectors in FastText (assembled from vectors of their character ngrams), so the following is not a good way to test the presence of a vector:
"no_such_word" in model.wv # 🚫 always returns True for FastText!
model.wv["no_such_word"] # returns a vector even for OOV words
The following notes are for advanced users, who were using or extending the Gensim internals more deeply, perhaps relying on protected / private attributes.
-
A key change is the creation of a unified
KeyedVectors
class for working with sets-of-vectors, that's reused for both word-vectors and doc-vectors, both when these are a subcomponent of the full algorithm models (for training) and when they are separate vector-sets (for lighter-weight re-use). Thus, this unified class shares the same (& often improved) convenience methods & implementations. -
One notable internal implementation change means that performing the usual similarity operations no longer requires the creation of a 2nd full cache of unit-normalized vectors, via the
.init_sims()
method & stored in the.vectors_norm
property. That used to involve a noticeable delay on 1st use, much higher memory use, and extra complications when attempting to deploy/share vectors among multiple processes. -
A number of errors and inefficiencies in the FastText implementation have been corrected. Model size in memory and when saved to disk will be much smaller, and using
FastText
as if it wereWord2Vec
, by disabling character n-grams (withmax_n=0
), should be as fast & performant as vanillaWord2Vec
. -
When supplying a Python iterable corpus to instance-initialization,
build_vocab()
, ortrain()
, the parameter name is nowcorpus_iterable
, to reflect the central expectation (that it is an iterable) and for correspondence with thecorpus_file
alternative. The prior model-specific names for this parameter, likesentences
ordocuments
, were overly-specific, and sometimes led users to the mistaken belief that such input must be precisely natural-language sentences.
If you're unsure or getting unexpected results, let us know at the Gensim mailing list.
…to be more explicit in its intent, and easier to tell apart from its chunkier parent Phrases
:
phrases = Phrases(corpus)
phraser = Phraser(phrases) # 🚫
phrases = Phrases(corpus)
frozen_phrases = phrases.freeze() # 👍
Note that phrases (collocation detection, multi-word expressions) have been pretty much rewritten from scratch for Gensim 4.0, and are more efficient and flexible now overall.
Despite its general-sounding name, the module will not satisfy the majority of use cases in production and is likely to waste people's time. See this Github ticket for more motivation behind this.
A rarely used contributed module, of poor quality of both code and documentation.
The original module was named too broadly. Now it's clearer this module employs the Annoy kNN library, while there's also similarities.nmslib
etc.
These wrappers of 3rd party libraries required too much effort. There were no volunteers to maintain and support them properly in Gensim.
If your work depends on any of the modules below, feel free to copy it out of Gensim 3.8.3 (the last release where they appear), and extend & maintain the wrapper yourself.
The removed submodules are:
- gensim.models.wrappers.dtmmodel
- gensim.models.wrappers.ldamallet
- gensim.models.wrappers.ldavowpalwabbit
- gensim.models.wrappers.varembed
- gensim.models.wrappers.wordrank
- gensim.sklearn_api.atmodel
- gensim.sklearn_api.d2vmodel
- gensim.sklearn_api.ftmodel
- gensim.sklearn_api.hdp
- gensim.sklearn_api.ldamodel
- gensim.sklearn_api.ldaseqmodel
- gensim.sklearn_api.lsimodel
- gensim.sklearn_api.phrases
- gensim.sklearn_api.rpmodel
- gensim.sklearn_api.text2bow
- gensim.sklearn_api.tfidf
- gensim.sklearn_api.w2vmodel
- gensim.viz