Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hierarchy viz and handle any form of distance matrix #1173

Merged
merged 3 commits into from
Apr 15, 2023

Conversation

elashrry
Copy link
Contributor

@elashrry elashrry commented Apr 8, 2023

This fixes #923, fixes #1063 , and fixes #1021 which are just duplicates of the first one.

I added a utility function to make sure the distance matrix computed by a given distance_function is a condensed 1-D array. If not, the function convert it and return the result. It can be used directly as I did in _bertopic.pyL877:

X = validate_distance_matrix(X, embeddings.shape[0])

or to wrap around a function as I did in _hierarchy.py#L132-L133:

distance_function_viz = lambda x: validate_distance_matrix(
        distance_function(x), embeddings.shape[0])

I also added this information in the docstring to let the user know what is possible. I didn't find things to changes in the tutorials.

I added a testing notebook to test the code provided in the mentioned issues. (to be deleted after reviewing the PR)

Let me know if you have any comments!

Copy link
Owner

@MaartenGr MaartenGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and the extensive docs, greatly appreciated! A few minor things to change before it will be merged (after CI has passed).

bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/_utils.py Outdated Show resolved Hide resolved
bertopic/plotting/_hierarchy.py Outdated Show resolved Hide resolved
Comment on lines +142 to +144
# Make sure its entries are non-negative
if np.any(X < 0):
raise ValueError("Distance matrix cannot contain negative values.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether this is necessary as this issue is handled and raised within Scipy already. Perhaps remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the squareform from SciPy doesn't raise error if the entries are non-negative.

X = np.array([1, -2, 3,])
squareform(X)

Output:

array([[ 0,  1, -2],
       [ 1,  0,  3],
       [-2,  3,  0]])

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true but the linkage function will give the error but it's alright to keep it here.

@elashrry
Copy link
Contributor Author

I just pushed the changes you requested. I don't know why the test is failing. Let me know if I can help!

@MaartenGr
Copy link
Owner

The workflow is failing due to an issue with the testing pipeline. It automatically pulls pandas 2.0 which has a slight change that breaks some specific tests, unfortunately. Having said that, if it runs successfully locally for you, then I think it would be alright to merge it.

@elashrry
Copy link
Contributor Author

elashrry commented Apr 15, 2023

Great, it runs successfully on my computer.

@MaartenGr MaartenGr merged commit e6bb030 into MaartenGr:master Apr 15, 2023
@MaartenGr
Copy link
Owner

Awesome, thanks for the work on this PR! It is highly appreciated.

@elashrry
Copy link
Contributor Author

You are welcome! and thank you for this nice library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants