Poor separation of concerns in fasttext design #2312

mpenkov · 2018-12-29T00:44:11Z

The architecture consists of several classes:

FastTextKeyedVectors (embeddings)
FastTextTrainables (neural network)
FastText (the actual model)

The separation of concerns between the classes is poor. For example, the FastTextTrainables neural network knows far too much about the implementation details of FastTextKeyedVectors embeddings. Here is a concrete example (full code here):

            wv.vectors_vocab = empty((len(wv.vocab), wv.vector_size), dtype=REAL)
            self.vectors_vocab_lockf = ones((len(wv.vocab), wv.vector_size), dtype=REAL)

            wv.vectors_ngrams = empty((self.bucket, wv.vector_size), dtype=REAL)
            self.vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)

            wv.hash2index = {}
            wv.buckets_word = {}
            ngram_indices = []
            for word, vocab in wv.vocab.items():
                buckets = []
                for ngram in _compute_ngrams(word, wv.min_n, wv.max_n):
                    ngram_hash = _ft_hash(ngram) % self.bucket
                    if ngram_hash not in wv.hash2index:
                        wv.hash2index[ngram_hash] = len(ngram_indices)
                        ngram_indices.append(ngram_hash)
                    buckets.append(wv.hash2index[ngram_hash])
                wv.buckets_word[vocab.index] = np.array(buckets, dtype=np.uint32)
            wv.num_ngram_vectors = len(ngram_indices)

The above code is part of the FastTextTrainables, but it's writing to attributes of FastTextKeyedVectors. It knows about what the attributes of FastTextKeyedVectors are, and how they are related.

Ideally, such code should be in the FastTextKeyedVectors class. In practice, this may not be as simple, because there may be code common to both classes there. Identifying such areas (concerns), splitting them, and separating the concerns would improve the fasttext design significantly.

mpenkov added the fasttext Issues related to the FastText model label Dec 29, 2018

mpenkov self-assigned this Dec 29, 2018

menshikh-iv added the difficulty hard Hard issue: required deep gensim understanding & high python/cython skills label Dec 29, 2018

mpenkov mentioned this issue Jan 7, 2019

Fix critical issues in FastText #2313

Merged

menshikh-iv closed this as completed in #2313 Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor separation of concerns in fasttext design #2312

Poor separation of concerns in fasttext design #2312

mpenkov commented Dec 29, 2018

Poor separation of concerns in fasttext design #2312

Poor separation of concerns in fasttext design #2312

Comments

mpenkov commented Dec 29, 2018