Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor separation of concerns in fasttext design #2312

Closed
mpenkov opened this issue Dec 29, 2018 · 0 comments · Fixed by #2313
Closed

Poor separation of concerns in fasttext design #2312

mpenkov opened this issue Dec 29, 2018 · 0 comments · Fixed by #2313
Assignees
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills fasttext Issues related to the FastText model

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 29, 2018

The architecture consists of several classes:

  • FastTextKeyedVectors (embeddings)
  • FastTextTrainables (neural network)
  • FastText (the actual model)

The separation of concerns between the classes is poor. For example, the FastTextTrainables neural network knows far too much about the implementation details of FastTextKeyedVectors embeddings. Here is a concrete example (full code here):

            wv.vectors_vocab = empty((len(wv.vocab), wv.vector_size), dtype=REAL)
            self.vectors_vocab_lockf = ones((len(wv.vocab), wv.vector_size), dtype=REAL)

            wv.vectors_ngrams = empty((self.bucket, wv.vector_size), dtype=REAL)
            self.vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)

            wv.hash2index = {}
            wv.buckets_word = {}
            ngram_indices = []
            for word, vocab in wv.vocab.items():
                buckets = []
                for ngram in _compute_ngrams(word, wv.min_n, wv.max_n):
                    ngram_hash = _ft_hash(ngram) % self.bucket
                    if ngram_hash not in wv.hash2index:
                        wv.hash2index[ngram_hash] = len(ngram_indices)
                        ngram_indices.append(ngram_hash)
                    buckets.append(wv.hash2index[ngram_hash])
                wv.buckets_word[vocab.index] = np.array(buckets, dtype=np.uint32)
            wv.num_ngram_vectors = len(ngram_indices)

The above code is part of the FastTextTrainables, but it's writing to attributes of FastTextKeyedVectors. It knows about what the attributes of FastTextKeyedVectors are, and how they are related.

Ideally, such code should be in the FastTextKeyedVectors class. In practice, this may not be as simple, because there may be code common to both classes there. Identifying such areas (concerns), splitting them, and separating the concerns would improve the fasttext design significantly.

@mpenkov mpenkov added the fasttext Issues related to the FastText model label Dec 29, 2018
@mpenkov mpenkov self-assigned this Dec 29, 2018
@menshikh-iv menshikh-iv added the difficulty hard Hard issue: required deep gensim understanding & high python/cython skills label Dec 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills fasttext Issues related to the FastText model
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants