Matching for process.extract and fuzz.ratio produce different results #129

gunthercox · 2016-08-13T12:47:43Z

Hi, I have a test case that I would like to ask about. It appears that two versions of what I believe to be logically equivalent code are returning different results.

First, this is the test data that I am using for both cases. I have a unicode string string and a list of unicode strings collection. Note that string is equal to collection[1].

string = u'⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋ ⊄ ⊅ ⫅ ⫆ ⫋ ⫌ ⫃ ⫄ ⫇ ⫈ ⫉ ⫊ ⟃ ⟄'

collection = [
    u'¶ ∑ ∞ ∫ π ∈ ℝ² ∖ ⩆ ⩇ ⩈ ⩉ ⩊ ⩋ ⪽ ⪾ ⪿ ⫀ ⫁ ⫂ ⋒ ⋓',
    u'⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋ ⊄ ⊅ ⫅ ⫆ ⫋ ⫌ ⫃ ⫄ ⫇ ⫈ ⫉ ⫊ ⟃ ⟄',
    u'∠ ∡ ⦛ ⦞ ⦟ ⦢ ⦣ ⦤ ⦥ ⦦ ⦧ ⦨ ⦩ ⦪ ⦫ ⦬ ⦭ ⦮ ⦯ ⦓ ⦔ ⦕ ⦖ ⟀',
    u'∫ ∬ ∭ ∮ ∯ ∰ ∱ ∲ ∳ ⨋ ⨌ ⨍ ⨎ ⨏ ⨐ ⨑ ⨒ ⨓ ⨔ ⨕ ⨖ ⨗ ⨘ ⨙ ⨚ ⨛ ⨜',
    u'≁ ≂ ≃ ≄ ⋍ ≅ ≆ ≇ ≈ ≉ ≊ ≋ ≌ ⩯ ⩰ ⫏ ⫐ ⫑ ⫒ ⫓ ⫔ ⫕ ⫖',
    u'¬ ⫬ ⫭ ⊨ ⊭ ∀ ∁ ∃ ∄ ∴ ∵ ⊦ ⊬ ⊧ ⊩ ⊮ ⊫ ⊯ ⊪ ⊰ ⊱ ⫗ ⫘',
    u'∧ ∨ ⊻ ⊼ ⊽ ⋎ ⋏ ⟑ ⟇ ⩑ ⩒ ⩓ ⩔ ⩕ ⩖ ⩗ ⩘ ⩙ ⩚ ⩛ ⩜ ⩝ ⩞ ⩟ ⩠ ⩢',
]

For my first test, I will use the extractOne method to try to find the closest match in the list.

# -*- coding: utf-8 -*-
from fuzzywuzzy import process

closest_match, ratio = process.extractOne(
    string,
    collection
)

print closest_match, ratio

This returns the string from collection[0] with a 0 as the matching ratio. This does not appear to be correct because collection[1] is an exact match to the input.

>>> ¶ ∑ ∞ ∫ π ∈ ℝ² ∖ ⩆ ⩇ ⩈ ⩉ ⩊ ⩋ ⪽ ⪾ ⪿ ⫀ ⫁ ⫂ ⋒ ⋓ 0

For my second test, I use fuzz and a for loop to find the closest match.

# -*- coding: utf-8 -*-
from fuzzywuzzy import fuzz

max_ratio = -1
closest_match = None

for item in collection:
    ratio = fuzz.ratio(string, item)

    if ratio > max_ratio:
        max_ratio = ratio
        closest_match = item

print closest_match, max_ratio

This returns the correct matching statement and a match ratio of 100. This seems like the correct behavior.

>>> ⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋ ⊄ ⊅ ⫅ ⫆ ⫋ ⫌ ⫃ ⫄ ⫇ ⫈ ⫉ ⫊ ⟃ ⟄ 100

So, am I misunderstanding the purpose of the process.extract methods, or is there an intentional design difference between how these two methods select a result?

Notes:

This behavior is happening consistently in both Python 2.7, 3.4 and 3.5
I believe this may be related to extractOne is not giving the best result #122

The text was updated successfully, but these errors were encountered:

DavidCEllis · 2016-10-12T17:13:13Z

The issue is process.extract runs using fuzz.WRatio. In turn WRatio runs utils.full_process on both query and on the items in collection which turns your unicode string into an empty string. By design empty string comparisons return 0.

In[2]: from fuzzywuzzy import process, fuzz, utils
In[3]: utils.full_process(u'⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋ ⊄ ⊅ ⫅ ⫆ ⫋ ⫌ ⫃ ⫄ ⫇ ⫈ ⫉ ⫊ ⟃ ⟄')
Out[3]: 
''

Interestingly the first item in your list is the only item that full_process does not turn into the empty string. This leads to an interesting issue if you use a different scorer.

In[15]: process.extract(string, collection, scorer=fuzz.ratio, limit=10)
Out[15]: 

[('¶ ∑ ∞ ∫ π ∈ ℝ² ∖ ⩆ ⩇ ⩈ ⩉ ⩊ ⩋ ⪽ ⪾ ⪿ ⫀ ⫁ ⫂ ⋒ ⋓', 12),
 ('⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋ ⊄ ⊅ ⫅ ⫆ ⫋ ⫌ ⫃ ⫄ ⫇ ⫈ ⫉ ⫊ ⟃ ⟄', 0),
 ('∠ ∡ ⦛ ⦞ ⦟ ⦢ ⦣ ⦤ ⦥ ⦦ ⦧ ⦨ ⦩ ⦪ ⦫ ⦬ ⦭ ⦮ ⦯ ⦓ ⦔ ⦕ ⦖ ⟀', 0),
 ('∫ ∬ ∭ ∮ ∯ ∰ ∱ ∲ ∳ ⨋ ⨌ ⨍ ⨎ ⨏ ⨐ ⨑ ⨒ ⨓ ⨔ ⨕ ⨖ ⨗ ⨘ ⨙ ⨚ ⨛ ⨜', 0),
 ('≁ ≂ ≃ ≄ ⋍ ≅ ≆ ≇ ≈ ≉ ≊ ≋ ≌ ⩯ ⩰ ⫏ ⫐ ⫑ ⫒ ⫓ ⫔ ⫕ ⫖', 0),
 ('¬ ⫬ ⫭ ⊨ ⊭ ∀ ∁ ∃ ∄ ∴ ∵ ⊦ ⊬ ⊧ ⊩ ⊮ ⊫ ⊯ ⊪ ⊰ ⊱ ⫗ ⫘', 0),
 ('∧ ∨ ⊻ ⊼ ⊽ ⋎ ⋏ ⟑ ⟇ ⩑ ⩒ ⩓ ⩔ ⩕ ⩖ ⩗ ⩘ ⩙ ⩚ ⩛ ⩜ ⩝ ⩞ ⩟ ⩠ ⩢', 0)]

This occurs because process.extract is still running this full_process on the 'choices' but also it's not running it on the 'query' (I think this is a bug and will submit it shortly).

If you did want to use process.extract the method would be to bypass this full_process. Luckily you can do this.

In[18]: process.extract(string, collection, scorer=fuzz.ratio, processor=lambda x: x, limit=10)
Out[18]: 

[('⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋ ⊄ ⊅ ⫅ ⫆ ⫋ ⫌ ⫃ ⫄ ⫇ ⫈ ⫉ ⫊ ⟃ ⟄', 100),
 ('¶ ∑ ∞ ∫ π ∈ ℝ² ∖ ⩆ ⩇ ⩈ ⩉ ⩊ ⩋ ⪽ ⪾ ⪿ ⫀ ⫁ ⫂ ⋒ ⋓', 48),
 ('≁ ≂ ≃ ≄ ⋍ ≅ ≆ ≇ ≈ ≉ ≊ ≋ ≌ ⩯ ⩰ ⫏ ⫐ ⫑ ⫒ ⫓ ⫔ ⫕ ⫖', 48),
 ('¬ ⫬ ⫭ ⊨ ⊭ ∀ ∁ ∃ ∄ ∴ ∵ ⊦ ⊬ ⊧ ⊩ ⊮ ⊫ ⊯ ⊪ ⊰ ⊱ ⫗ ⫘', 48),
 ('∠ ∡ ⦛ ⦞ ⦟ ⦢ ⦣ ⦤ ⦥ ⦦ ⦧ ⦨ ⦩ ⦪ ⦫ ⦬ ⦭ ⦮ ⦯ ⦓ ⦔ ⦕ ⦖ ⟀', 47),
 ('∧ ∨ ⊻ ⊼ ⊽ ⋎ ⋏ ⟑ ⟇ ⩑ ⩒ ⩓ ⩔ ⩕ ⩖ ⩗ ⩘ ⩙ ⩚ ⩛ ⩜ ⩝ ⩞ ⩟ ⩠ ⩢', 45),
 ('∫ ∬ ∭ ∮ ∯ ∰ ∱ ∲ ∳ ⨋ ⨌ ⨍ ⨎ ⨏ ⨐ ⨑ ⨒ ⨓ ⨔ ⨕ ⨖ ⨗ ⨘ ⨙ ⨚ ⨛ ⨜', 44)]

Which correctly gives your full string as the 100% match.

gunthercox mentioned this issue Aug 13, 2016

Correct unicode equality issue in python 2.7 gunthercox/ChatterBot#222

Merged

This was referenced Oct 29, 2016

Strange results that depends on sort and case #141

Open

Clarify default behaviour of extract / Add tests for matching strings #142

Merged

gunthercox closed this as completed Sep 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching for process.extract and fuzz.ratio produce different results #129

Matching for process.extract and fuzz.ratio produce different results #129

gunthercox commented Aug 13, 2016

DavidCEllis commented Oct 12, 2016

Matching for process.extract and fuzz.ratio produce different results #129

Matching for process.extract and fuzz.ratio produce different results #129

Comments

gunthercox commented Aug 13, 2016

DavidCEllis commented Oct 12, 2016