-
-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Glove model - Draft Pull Request #189
Conversation
creme/decomposition/glove.py
Outdated
__all__ = ['Glove'] | ||
|
||
|
||
class Glove(base.Transformer, vectorize.VectorizerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call it GloVe
?
creme/decomposition/glove.py
Outdated
self.t = t | ||
self.window_size = window_size | ||
self.cooccurrences = defaultdict(float) | ||
self.weights = defaultdict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you need to use a function or functools.partial
. The issue with lambda
is that it means the class can't be pickled, which is important :)
creme/decomposition/glove.py
Outdated
""" | ||
# Extracts words of the document as a list of words: | ||
tokens = self.tokenizer(self.preprocess(self._get_text(tokens))) | ||
return {token: self.weights[token] for token in tokens} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return the average of the works embeddings. I'm not sure what most people do but I guess averaging is the way to go.
creme/decomposition/glove.py
Outdated
path (str): path of the weights of the pretrained model. | ||
|
||
""" | ||
file = open(path,'r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What format is this function assuming the file is in? If the format has a name could you please specify it in the comment? :)
Hi☺️ ,
I know that everyone is waiting for glove but this work still in progress. This draft pull request is linked to the issue #148.
Remarks on the model:
For reasons of simplicity and transparency in the model, I did not convert the tokens into an index. I used the words directly as an identifier for dictionaries.
It's possible to initialize the weights of the words with pre-trained ones thanks to the method
load
.I have few questions:
How to avoid Lambda func in
collections.defaultdict
? Is this really a problem here?Do you think that the output of the model
transform_one()`` and ``fit_transform_one()
is the right one? I mean, a dictionary dictionary:{'weather': {0: 1.281993..., 1: 1.148658...}}
Remarks on the code:
I have to benchmark the model to make sure that the results are good and that I have not made any mistakes. I will find computer resources to run it on wikipedia as the authors did.
Feel free to review my code and give me feedback.
Raphaël