Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hierarchical_topics(...) when the distances between three clusters are the same #1929

Merged
merged 9 commits into from
Jun 13, 2024
7 changes: 6 additions & 1 deletion bertopic/_bertopic.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
from bertopic.cluster._utils import hdbscan_delegator, is_supported_hdbscan
from bertopic._utils import (
MyLogger, check_documents_type, check_embeddings_shape,
check_is_fitted, validate_distance_matrix
check_is_fitted, validate_distance_matrix, get_unique_distances
)
import bertopic._save_utils as save_utils

Expand Down Expand Up @@ -979,6 +979,11 @@ def hierarchical_topics(self,
# Use the 1-D condensed distance matrix as an input instead of the raw distance matrix
Z = linkage_function(X)

# Ensuring that the distances between clusters are unique otherwise the flatting of the hierarchy with
# `sch.fcluster(...)` would produce incorrect values for "Topics" for these clusters
if len(Z[:, 2]) != len(np.unique(Z[:, 2])):
Z[:, 2] = get_unique_distances(Z[:, 2])

# Calculate basic bag-of-words to be iteratively merged later
documents = pd.DataFrame({"Document": docs,
"ID": range(len(docs)),
Expand Down
44 changes: 44 additions & 0 deletions bertopic/_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import numpy as np
import pandas as pd
import logging
from typing import Union
from collections.abc import Iterable
from scipy.sparse import csr_matrix
from scipy.spatial.distance import squareform
Expand Down Expand Up @@ -147,3 +148,46 @@ def validate_distance_matrix(X, n_samples):
raise ValueError("Distance matrix cannot contain negative values.")

return X


def get_unique_distances(dists: np.array, noise_max=1e-7) -> np.array:
"""Check if the consecutive elements in the distance array are the same. If so, a small noise
is added to one of the elements to make sure that the array does not contain duplicates.

Arguments:
dists: distance array.
noise_max: the maximal magnitude of noise to be added.

Returns:
Unique distances sorted in the preserved increasing order.

Raises:
ValueError: If the distance array is not sorted in the increasing order.
"""

def get_next_diff_value(array: np.array, ix: int) -> Union[float, None]:
"""Get the next different value from `array[ix]` in the array."""
for j in range(ix + 1, array.shape[0]):
if array[j] != array[ix]:
return array[j]
return None

if not np.all(np.diff(dists) >= 0):
raise ValueError("The distances must be sorted in the increasing order")
MaartenGr marked this conversation as resolved.
Show resolved Hide resolved
dists_cp = dists.copy()

for i in range(dists.shape[0] - 1):
if dists[i] == dists[i + 1]:
next_unique_dist = get_next_diff_value(dists, i)
# if there is no different distance further in the array, the `next_unique_dist` is set to be a lightly
# larger than the current (also the max) distance in the array
next_unique_dist = dists[i] + noise_max if next_unique_dist is None else next_unique_dist

# when the max noise is smaller than `curr_max_noise`, then we can be sure that order is preserved.
# `dists_cp` must be used since it contains the noise-added values.
curr_max_noise = min(noise_max, next_unique_dist - dists_cp[i])
dists_cp[i + 1] += np.random.uniform(
low=dists_cp[i] + curr_max_noise / 2,
high=dists_cp[i] + curr_max_noise
)
return dists_cp
17 changes: 16 additions & 1 deletion tests/test_utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import pytest
import logging
import numpy as np
from bertopic._utils import check_documents_type, check_embeddings_shape, MyLogger
from bertopic._utils import check_documents_type, check_embeddings_shape, MyLogger, get_unique_distances


def test_logger():
Expand Down Expand Up @@ -32,3 +32,18 @@ def test_check_embeddings_shape():
embeddings = np.array([[1, 2, 3],
[2, 3, 4]])
check_embeddings_shape(embeddings, docs)


def test_make_unique_distances():
def check_dists(dists: list[float], noise_max: float):
unique_dists = get_unique_distances(np.array(dists, dtype=float), noise_max=noise_max)
assert len(unique_dists) == len(dists), "The number of elements must be the same"
assert len(dists) == len(np.unique(unique_dists)), "The distances must be unique"

check_dists([0, 0, 0.5, 0.75, 1, 1], noise_max=1e-7)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked the actual values of the updated distance list? When I run it, I get the following updated values:

[0.00000000e+00 8.32483552e-08 5.00000000e-01 7.50000000e-01
 1.00000000e+00 2.00000008e+00]

The last value is twice as big which should not happen. I have a feeling the code for get_unique_distances could be simplified a bit. What about simply doing something like this:

def get_unique_distances(dists):
    increment =  np.random.uniform(low=1e-5, high=1e-6)
    last_val = -float('inf')
    return [last_val := max(dist, last_val + increment) for dist in dists]

my_list = [0, 0, 0, 0.5, 0.75, 1, 1]
get_unique_distances(my_list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice simplification.

Are we ok with changing distances that do not have a duplicate?

E.g. check_dists([0, 0, 0, 0, 0, 0, 0, 1e-7], noise_max=1e-7) changes the last value otherwise the distances would not be in the increasing order.

I had a bug in the code (should assign and not add), that's why the last value was 2.00000008e+00.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with changing distances that do not have a duplicate?

Hmmm, my preference would indeed be to keep them as is as long as it requires no more than one or two lines of code. I would like to simplify this as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the code. Please have a look and let me know if you have any ideas.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I just tested it a bunch of times and it all looks good to me. Thanks for simplifying the code. I'll re-run the workflow to check whether everything passes. If it does, I will go ahead and merge the PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests failed but I believe because you used list[float] which is not supported in python 3.8. Removing that should make the tests pass I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, you are right! I just changed it! Thank you!


# testing whether the distances are sorted in ascending order when if the noise is extremely high
check_dists([0, 0, 0, 0.5, 0.75, 1, 1], noise_max=20)

# test whether the distances are sorted in ascending order when the distances are all the same
check_dists([0, 0, 0, 0, 0, 0, 0], noise_max=1e-7)