-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results on different machines #559
Comments
Update: Seems like |
Strange, I believe that should not happen if you have set a |
Hey, I am suffering from the same issue right now, but in an even more extreme way: |
This can happen if the documents themselves are empty or if extremely low c-TF-IDF scores are assigned to specific words. In those cases, topics represented by empty strings could be generated. I believe this happened more often in v0.9.4 than in v0.10.0.
With respect to the random results, that can depend on a few more things than just the |
Thank you! It seems to have indeed been a problem with BERTopic installing an old version (0.9.2). I wonder why, since it was installed from a pip requirements file with no specific version specified. |
@OmriPi Glad to hear that it solved your issue! |
So after retrying with an up to date version, I'm no longer getting empty topics but still getting inconsistent results between different OSes. In particular the clustering is different where in my local MacOS machine 3 topics are created, while in the docker running linux 4 topics are created, but one of the topics appears twice (for example two clusters named "good"). The hierarchy graph shows they're so close together that they should be merged. I don't understand why they aren't, especially with the sentences containing them being the exact same. And why despite setting a |
Did you make sure that your environment, aside from the OSes, is exactly the same between devices? Also, could you create a reproducible example for me to look at? |
Yes the environment is exactly the same, clean installation from a requirements file, same python version etc... As far as I can tell the only difference is the docker on the remote machine runs Linux, while locally I'm working with MacOS. I have also seen this issue being reported in UMAP, which seems to be the root cause: lmcinnes/umap#153 I will try to give a generic example which is as close as can be to my case:
The dataset consists of 3 sentences:
Each sentence repeats 67 times (total 201 sentences). |
Reading through the issues you refer to it indeed seems that some changes may happen across different OS devices. Unfortunately, I also do not have a solution for this other than using a containerization tool like Docker.
Do note that the same requirements file may lead to different versions of the packages if the OS differs as wheels can be created differently depending on the OS that you use. |
Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue! |
Hey everyone! |
I have run into the same issue, and on the same machine (iMac running Mojave) running on the same dataset at different times: this is even using random_state=np.random.RandomState(42) as has been suggested elsewhere. Attached is an example output using the same input data run two different time. This is a shame because in my limited testing UMAP outperforms tSNE but if we can't get the same results from session to session it limits the usefulness. |
Facing the same issue |
Hii @MaartenGr , I am using bertopic model version 0.15.0. And fixing seed of Umas too. But each time I run I get different topic model output with +/- 10 of extra topics from the previous run. I am attaching the code that I am using. Would you mine helping me to solve the issue?? embedding_model = SentenceTransformer("all-MiniLM-L6-v2") ctfidf_model = ClassTfidfTransformer() representation_model = KeyBERTInspired() topic_modelA = BERTopic( topicsA, probsA = topic_modelA.fit_transform(dfA1['SENTENCE']) |
@sauravwel If you are using the exact same environment/machine/OS between two runs, then setting the random state should fix the output. More specifically, I would advise simply setting an integer for |
Unfortunately, it is not @MaartenGr . Also, I already tried to an integer for random_state but same issue I faced. Using UMAP in BERTopic giving me a finest model on my use case. But it has a reproducible issue, I am thinking to drop the idea and moving to more conservative models. Any last thoughts on, how we should fix it? I guess, it's a major issue for many commented above as we can not able to replicate same in development and production environment given everything above remain same. |
@sauravwel Sorry to hear that the environments are not the same. I believe due to the stochastic nature of UMAP, the random initialization might be different between OS since they handle that differently. I might be mistaken though. Just to be sure I understand your use case, why would you re-train the exact same model between development and production if you can simply save and load the model? If the model you get in development is exactly what you are looking for, you can just save that model and load it in your production environment. In contrast, if the parameters you get in development work for your use case and you have different data in production, then differences would already be expected so I guess that would not be the case. |
Sorry, you miss-understood, in my case environment is exactly same. Now why I am re-training the model, it is because, in banking world, model validator tries to replicate the model keeping environment same...It is something to do with model regulation. @MaartenGr , I was in assumption that the randomness is due to some stochastic nature that remain even after fixing random state of UMAP. But there is something else, when I change the dimension reduction model to PCA(dim_model = PCA(n_components=0.7,random_state=24)) from UMAP, keeping remaining above code same. The output again changes! Any possibility of randomness coming from hdbscan model? I check the documentation, but it not mentioned. What's your thoughts?? |
@sauravwel Hmmm, it might then be related to HDBSCAN. I do not believe there should be but perhaps you have stumbled upon an edge-case. It might make sense to do the following:
That way, you can test out whether the changes in output are a result of HDBSCAN. The thing is, and I have tested this quite often, if you set |
Hey!
Recently I found out that my code is giving me different results when running on local (M1 macbook), local Docker or k8s docker containers. I use
random_seed
forUMAP
indeed however also20newsgroups
is behaving differently and returning not exact same results. From my code perspective difference is quite big - on localhost script generated ~150topics and on cloud just around 40 (even initial set of topics was similar but not exact).I double-checked whether same data goes in and tried to set
numpy
different random seeds but nothing really happened.I tested this with
python:3.7
andpython:3.9
as well asbertopic
version0.10.0
or0.9.4
.Any idea or experience how to make results same across different platforms?
The text was updated successfully, but these errors were encountered: