Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results on different machines #559

Closed
marekargalas opened this issue Jun 10, 2022 · 20 comments
Closed

Inconsistent results on different machines #559

marekargalas opened this issue Jun 10, 2022 · 20 comments

Comments

@marekargalas
Copy link

marekargalas commented Jun 10, 2022

Hey!

Recently I found out that my code is giving me different results when running on local (M1 macbook), local Docker or k8s docker containers. I use random_seed for UMAP indeed however also 20newsgroups is behaving differently and returning not exact same results. From my code perspective difference is quite big - on localhost script generated ~150topics and on cloud just around 40 (even initial set of topics was similar but not exact).
I double-checked whether same data goes in and tried to set numpy different random seeds but nothing really happened.
I tested this with python:3.7 and python:3.9 as well as bertopic version 0.10.0 or 0.9.4.

Any idea or experience how to make results same across different platforms?

@marekargalas
Copy link
Author

Update:

Seems like UMAP is the problem indeed, when I used PCA instead then the results are the same across machines/platforms. Anyone ideas how to tackle this problem?

@MaartenGr
Copy link
Owner

Strange, I believe that should not happen if you have set a random_state. Could you share the code you have been using for instantiating BERTopic? Perhaps we can find the culprit in there.

@OmriPi
Copy link

OmriPi commented Jun 14, 2022

Hey, I am suffering from the same issue right now, but in an even more extreme way:
I have created a dataset of 3 sentences repeating 200 times, expecting to get 3 clusters. On my local machine I get 3 clusters and they all look great (MacOS). However, on a remote machine running docker on linux, when running the test, I get a different and very strange result: the clustering is done correctly (meaning I get 3 clusters and each is assigned the correct samples) but one of the clusters has no topics representing it, it's all just empty '' strings. After reading I realize that the randomness is due to UMAP being inconsistent across different operating systems (and I didn't really find a way around it), but how is it possible that a cluster has no topic words and they're all empty? Is that a bug? Or is a proper cluster with empty name a legitimate output?
Since it only happens on the remote machine my ability to debug it is very limited.
This is the code I run:
vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=stopwords) umap_model = UMAP(n_neighbors=30, n_components=5, min_dist=0.0, metric='cosine', random_state=42, low_memory=False) model = BERTopic(umap_model=umap_model, nr_topics='auto', min_topic_size=30, vectorizer_model=vectorizer, calculate_probabilities=True, verbose=True)
and the sentence which fails to get a topic is "Not good enough!"
On my local machine the topic "good" is extracted, but on the docker which runs on linux, the topic is '' (but a cluster is created).
Any advice or way to fix it?
@MaartenGr

@MaartenGr
Copy link
Owner

@OmriPi

but how is it possible that a cluster has no topic words and they're all empty? Is that a bug? Or is a proper cluster with empty name a legitimate output?

This can happen if the documents themselves are empty or if extremely low c-TF-IDF scores are assigned to specific words. In those cases, topics represented by empty strings could be generated. I believe this happened more often in v0.9.4 than in v0.10.0.

On my local machine the topic "good" is extracted, but on the docker which runs on linux, the topic is '' (but a cluster is created).
Any advice or way to fix it?

With respect to the random results, that can depend on a few more things than just the random_state in UMAP. For example, if you are using different operating systems, then the package versions that you are using might also differ. Making sure that you are in the exact same environment, at least with respect to the packages used, helps to reproduce the results that you have had.

@OmriPi
Copy link

OmriPi commented Jun 15, 2022

@MaartenGr

With respect to the random results, that can depend on a few more things than just the random_state in UMAP. For example, if you are using different operating systems, then the package versions that you are using might also differ. Making sure that you are in the exact same environment, at least with respect to the packages used, helps to reproduce the results that you have had.

Thank you! It seems to have indeed been a problem with BERTopic installing an old version (0.9.2). I wonder why, since it was installed from a pip requirements file with no specific version specified.
Forcing version 0.10.0 seems to fix the problem.
Thank you for this incredible package!

@MaartenGr
Copy link
Owner

@OmriPi Glad to hear that it solved your issue!

@OmriPi
Copy link

OmriPi commented Jun 20, 2022

So after retrying with an up to date version, I'm no longer getting empty topics but still getting inconsistent results between different OSes. In particular the clustering is different where in my local MacOS machine 3 topics are created, while in the docker running linux 4 topics are created, but one of the topics appears twice (for example two clusters named "good"). The hierarchy graph shows they're so close together that they should be merged. I don't understand why they aren't, especially with the sentences containing them being the exact same. And why despite setting a random_state the results still differ between machines...

@MaartenGr
Copy link
Owner

Did you make sure that your environment, aside from the OSes, is exactly the same between devices? Also, could you create a reproducible example for me to look at?

@OmriPi
Copy link

OmriPi commented Jun 21, 2022

Yes the environment is exactly the same, clean installation from a requirements file, same python version etc... As far as I can tell the only difference is the docker on the remote machine runs Linux, while locally I'm working with MacOS. I have also seen this issue being reported in UMAP, which seems to be the root cause:

lmcinnes/umap#153
lmcinnes/umap#183
lmcinnes/umap#158

I will try to give a generic example which is as close as can be to my case:

vectorizer = CountVectorizer(ngram_range=(1, 1), stop_words=stopwords)
umap_model = UMAP(n_neighbors=30, n_components=5, min_dist=0.0, metric='cosine', random_state=42,
                           low_memory=False)
model = BERTopic(umap_model=umap_model, nr_topics='auto', min_topic_size=30, vectorizer_model=vectorizer,
               calculate_probabilities=True, verbose=True)
topics, probs = model.fit_transform(dataset)

The dataset consists of 3 sentences:

"This is such a great product!"
"I love Leeroy!"
"Not good enough!"
"This is such a great product!"
"I love Leeroy!"
"Not good enough!"
.
.
.

Each sentence repeats 67 times (total 201 sentences).
Results on my machine are 3 clusters ["love", "good", "product"] (top word for each cluster).
Results on the remote machine have 4 clusters ["love", "good", "product", "good"], where sentences of the 3rd type until line 100 are assigned cluster #1, and sentences after line 100 are assigned class #3.
If there is any more useful info I could add please let me know, I really hope this can be solved!
Thanks!

@MaartenGr
Copy link
Owner

Reading through the issues you refer to it indeed seems that some changes may happen across different OS devices. Unfortunately, I also do not have a solution for this other than using a containerization tool like Docker.

Yes the environment is exactly the same, clean installation from a requirements file, same python version etc..

Do note that the same requirements file may lead to different versions of the packages if the OS differs as wheels can be created differently depending on the OS that you use.

@MaartenGr
Copy link
Owner

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

@samarthsarin
Copy link

Hey everyone!
Any update on this issue of different results on different OS machines? I am also facing this issue of different results on Windows and Linux.

@pkstys
Copy link

pkstys commented Feb 21, 2023

I have run into the same issue, and on the same machine (iMac running Mojave) running on the same dataset at different times:
Python version: 3.9.12 (v3.9.12:b28265d7e6, Mar 23 2022, 18:17:11)
numpy version: 1.23.5
scikit-learn version: 1.2.0
numba version: 0.56.4
umap version: 0.5.3

this is even using random_state=np.random.RandomState(42) as has been suggested elsewhere. Attached is an example output using the same input data run two different time. This is a shame because in my limited testing UMAP outperforms tSNE but if we can't get the same results from session to session it limits the usefulness.
Is there a solution?
Peter.
UMAP.pdf

@bhavaygg
Copy link

Facing the same issue

@sauravwel
Copy link

sauravwel commented Nov 5, 2023

Hii @MaartenGr , I am using bertopic model version 0.15.0. And fixing seed of Umas too. But each time I run I get different topic model output with +/- 10 of extra topics from the previous run. I am attaching the code that I am using. Would you mine helping me to solve the issue??

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddingsS = embedding_model.encode(dfA1['SENTENCE'], show_progress_bar=True)
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=np.random.RandomState(42))
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
hdbscan_model = HDBSCAN(min_cluster_size=40, metric='manhattan', prediction_data=True,
cluster_selection_method='leaf')

ctfidf_model = ClassTfidfTransformer()

representation_model = KeyBERTInspired()

topic_modelA = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic represenations
#min_topic_size = 40, #Reduce certain threshold as outlier increased previously
calculate_probabilities=True,
verbose=True
)

topicsA, probsA = topic_modelA.fit_transform(dfA1['SENTENCE'])

@MaartenGr
Copy link
Owner

@sauravwel If you are using the exact same environment/machine/OS between two runs, then setting the random state should fix the output. More specifically, I would advise simply setting an integer for random_state and not np.random since I am not entirely sure how that would be handled.

@sauravwel
Copy link

Unfortunately, it is not @MaartenGr . Also, I already tried to an integer for random_state but same issue I faced.

Using UMAP in BERTopic giving me a finest model on my use case. But it has a reproducible issue, I am thinking to drop the idea and moving to more conservative models. Any last thoughts on, how we should fix it? I guess, it's a major issue for many commented above as we can not able to replicate same in development and production environment given everything above remain same.

@MaartenGr
Copy link
Owner

@sauravwel Sorry to hear that the environments are not the same. I believe due to the stochastic nature of UMAP, the random initialization might be different between OS since they handle that differently. I might be mistaken though.

Just to be sure I understand your use case, why would you re-train the exact same model between development and production if you can simply save and load the model? If the model you get in development is exactly what you are looking for, you can just save that model and load it in your production environment.

In contrast, if the parameters you get in development work for your use case and you have different data in production, then differences would already be expected so I guess that would not be the case.

@sauravwel
Copy link

Sorry, you miss-understood, in my case environment is exactly same. Now why I am re-training the model, it is because, in banking world, model validator tries to replicate the model keeping environment same...It is something to do with model regulation.

@MaartenGr , I was in assumption that the randomness is due to some stochastic nature that remain even after fixing random state of UMAP. But there is something else, when I change the dimension reduction model to PCA(dim_model = PCA(n_components=0.7,random_state=24)) from UMAP, keeping remaining above code same. The output again changes! Any possibility of randomness coming from hdbscan model? I check the documentation, but it not mentioned. What's your thoughts??

@MaartenGr
Copy link
Owner

@sauravwel Hmmm, it might then be related to HDBSCAN. I do not believe there should be but perhaps you have stumbled upon an edge-case. It might make sense to do the following:

  • Keep using PCA for demonstration purposes
  • Remove the hdbscan_model and uncomment min_topic_size

That way, you can test out whether the changes in output are a result of HDBSCAN.

The thing is, and I have tested this quite often, if you set umap_model with a random_state in the same environment (same package versions, OS, and dependency versions), then it should definitely give you the same output. So either the environments are not the same (which you mentioned they are) or indeed another model is giving issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants