-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding similarity column in the group_similar_strings output #14
Comments
The deduplicated name can be seen as kind of a "group identifier", where all strings that have the same deduplicate_name belong to the same group. All strings in that group are similar to each other, and the group identifier is just a "random" string within the group of similar strings. So its not necessary clear which similarity to pick. For example, suppose you have 3 similiar strings with the following similarities: string_a - string_b - 0.80 The deduplicated name will be "string_a", but for entry "string_c" for example do you pick 0.99? That means the low similarity of 0.75 will be lost. It is also possible to have another string (string_d) with similarity 0.99 to string_c, but 0.74 to string_a. If your cutoff value is 0.75, there will be no similarity between string_a and d, but string_d will still be in the same group. Another possibility is to show for each entry the lowest similarity it has with any strings in the group. I think this might give a better indication on how similar a string within a group is. I think this is possible to do with some hacking. |
Hi, Thank you for your prompt reply and your detailed explanation I appreciate it. Yes, I clearly understand the tricky part regarding which random group identifier to pick and also the fact that the low accuracies will be lost. I did not think about this former. In your example, if string_a is selected as the group identifier, I would therefore pick 0.99 for the accuracy of string_c but loos other similarities related to string _c then. 1. Thinking again about it, for simplicity, in my use case having the similarity values of each string within a group and their group identifier would be great for now. How I see it: 2. Also, I like your approach to track and show the lowest similarities of each string within the group. In that case, I can not see it as part of the same single data frame returned by the group_similar_strings functions like in point 1 above. Is it going to be in a separate second data frame? Also, if we consider our same example, this data frame would be like a 3 dimensions data frame with the number of rows equals 10? Something like this? Sorry since I really have no idea where to start from, that's why I drew these tab to make it clear in my mind as well. Please, can you guide me on how I can hack the code and get these results, Please? Thank you again for your time. |
Hi,
Thank you for this amazing code working just great so far in my use case.
Please, How can I add the similarities values from the computed cosine in the outputted result of group_similar_strings functions?
The output I am trying to make is a pandas.series containing the duplicated name with their respective cosine similarities value regarding the deduplicated_name.
So it would be something like this:
Line Number | Company Name | Company CIK Key | Similarity | deduplicated_name
Please any help?
Thank you.
The text was updated successfully, but these errors were encountered: