Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glove model - Draft Pull Request #189

Closed
wants to merge 0 commits into from

Conversation

raphaelsty
Copy link
Member

@raphaelsty raphaelsty commented Dec 9, 2019

Hi ☺️,

I know that everyone is waiting for glove but this work still in progress. This draft pull request is linked to the issue #148.

Remarks on the model:
For reasons of simplicity and transparency in the model, I did not convert the tokens into an index. I used the words directly as an identifier for dictionaries.

It's possible to initialize the weights of the words with pre-trained ones thanks to the method load.

I have few questions:
How to avoid Lambda func in collections.defaultdict ? Is this really a problem here?

Do you think that the output of the model transform_one()`` and ``fit_transform_one() is the right one? I mean, a dictionary dictionary: {'weather': {0: 1.281993..., 1: 1.148658...}}

Remarks on the code:
I have to benchmark the model to make sure that the results are good and that I have not made any mistakes. I will find computer resources to run it on wikipedia as the authors did.

Feel free to review my code and give me feedback.

Raphaël

__all__ = ['Glove']


class Glove(base.Transformer, vectorize.VectorizerMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call it GloVe?

self.t = t
self.window_size = window_size
self.cooccurrences = defaultdict(float)
self.weights = defaultdict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you need to use a function or functools.partial. The issue with lambda is that it means the class can't be pickled, which is important :)

"""
# Extracts words of the document as a list of words:
tokens = self.tokenizer(self.preprocess(self._get_text(tokens)))
return {token: self.weights[token] for token in tokens}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return the average of the works embeddings. I'm not sure what most people do but I guess averaging is the way to go.

path (str): path of the weights of the pretrained model.

"""
file = open(path,'r')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What format is this function assuming the file is in? If the format has a name could you please specify it in the comment? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants