-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Clustering to Flair #2573
Conversation
@OatsProduction thanks a lot for adding this! On first look-through a few points:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this feature @OatsProduction. Some points regarding this PR:
- Can you remove all files not related to this PR? helps reviewing it.
- I have reviewed only KMeans until now, I try to support in the next days with our conventions regarding naming, required trainer's and so on.
thanks for reviewing this PR. The current status of this PR is that :
I started this PR just to start one and have this done from my ToDo list. So this is a WIP branch. |
EM Clustering is now done and functional. Can be reviewed. BIRCH is almost done. Can also be soon reviewed. |
Clustering refactorings
improved the TUTORIAL_12_CLUSTERING.md
How do you think I should add the evaluation data sets needed ? Like the StackOverflow dataset ? Also, my idea for saving/loading the model:
The integration of the sklearn clustering algorithms with flair is done. |
https://scikit-learn.org/stable/modules/model_persistence.html |
The stackoverflow dataset comes from https://github.com/jacoxu/StackOverflow |
I added StackOverflow Corpus. But my current implementation fails. @whoisjones can you look at this one ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OatsProduction regarding your question - see in my reviews. There are also some other remarks :)
added evaluation method improved the tutorial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @OatsProduction - looks good but some things:
- The signatures are a bit counterintuitive. Why is the ClusteringModel instantiated with the corpus? The corpus is only needed for the fit() and evaluate() methods so it would be better to pass the model in there. So instead of
corpus = TREC_6(memory_mode='full').downsample(0.05)
model = KMeans(n_clusters=6)
clustering_model = ClusteringModel(
model=model,
corpus=corpus,
label_type="question_class",
embeddings=embeddings
)
# fit the model
clustering_model.fit()
# evaluate the model
clustering_model.evaluate()
it should be:
corpus = TREC_6(memory_mode='full').downsample(0.05)
model = KMeans(n_clusters=6)
clustering_model = ClusteringModel(
model=model,
label_type="question_class",
embeddings=embeddings
)
# fit the model on a corpus
clustering_model.fit(corpus)
# evaluate the model on a corpus
clustering_model.evaluate(corpus)
- The loading is counterintuitive as it requires a different clustering method to already be initialized. Better would be a static method:
# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")
# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')
# predict for sentence
model.predict(sentence)
# print sentence with prediction
print(sentence)
- Small thing but it would be nice to use label names instead of numbers in the STACKOVERFLOW corpus
better labels for corpus STACKOVERFLOW
Added every remark to the code. Need another review on this PR. |
@OatsProduction the code from the tutorial throws an error during the predict method: from sklearn.cluster import KMeans
from flair.data import Sentence
from flair.datasets import TREC_6
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.models import ClusteringModel
embeddings = SentenceTransformerDocumentEmbeddings()
# store all embeddings in memory which is required to perform clustering
corpus = TREC_6(memory_mode='full').downsample(0.05)
clustering_model = ClusteringModel(model=KMeans(n_clusters=6), embeddings=embeddings)
# fit the model on a corpus
clustering_model.fit(corpus)
# save the model
clustering_model.save(model_file="clustering_model.pt")
# load saved clustering model
model = ClusteringModel.load(model_file="clustering_model.pt")
# make example sentence
sentence = Sentence('Getting error in manage categories - not found for attribute "navigation _ column"')
# predict for sentence
model.predict(sentence)
# print sentence with prediction
print(sentence) Can you fix it so that the tutorial code works? Two other things:
|
Some comments on the issues before:
|
Thanks @OatsProduction - I'll merge this and take care of the flake errors. |
This PR is the result of the Study work for @alanakbik. This PR adds 3 Clustering algorithms to flair.