-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utility function to check if numbers in one string exists in another #260
Utility function to check if numbers in one string exists in another #260
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, please check the comments below.
src/tests/unit_tests/test_util.py
Outdated
result = "Labore deserunt 12.45 $45,000 sit velit nulla. Sint ipsum reprehenderit sint cupidatat amet est id anim exercitation fugiat adipisicing elit. Id est dolore minim magna occaecat aute. Est dolore culpa laborum non esse nostrud." | ||
check_result = check_if_number_exist(source ,result , 'jack.com') | ||
assert check_result['number_exisit'] == False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides asserting the existence, we also want to check the numbers extracted are correct...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a separate test specifically for that.
src/sherpa_ai/utils.py
Outdated
if data not in source_numbers: | ||
message.append(f"{data} is not mentioned in the {source_link}. ") | ||
if len(message)>0: | ||
return {"number_exisit": False , "messages":message} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo number_exisit => number_exists
@@ -300,3 +304,19 @@ def check_url(url): | |||
else: | |||
return True | |||
|
|||
def extract_numbers_from_text(text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add unit tests to assure yourself this regex pattern works against the types of numbers you are expecting to match
pattern = r"\d+\.\d+|\d+\,\d+|\d+" | ||
matches = re.findall(pattern, text) | ||
|
||
return matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add newline after return matches
and delete newline before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some more comments mostly for the new tests. Please take a look.
numbers_in_source_data = ['12.45','9', '45,000'] | ||
print(extracted_number) | ||
if len(numbers_in_source_data) != len(extracted_number): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use assert len(numbers_in_source_data) == len(extracted_number)
without using if-else
. It will fail right away if the length does not match
assert False , "failed to extract a number" | ||
else: | ||
for number in extracted_number: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compare the two lists with potential duplicated elements. An easy way is to use Counter
. You just need
from collections import Counter
assert Counter(extracted_number) == Counter(numbers_in_source_data)
extracted_number = extract_numbers_from_text(source_data) | ||
numbers_in_source_data = ['12.45','9', '45,000'] | ||
print(extracted_number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these print statements. If the assertions are done correctly, the content of these lists will be printout due to failed assertions
|
||
def test_extract_numbers_from_text(source_data): | ||
extracted_number = extract_numbers_from_text(source_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extracted_numbers
def test_extract_numbers_from_text_pass(source_data, correct_result_data): | ||
check_result = check_if_number_exist(source_data ,correct_result_data) | ||
assert check_result['number_exists'] == True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert check_result['number_exists']
def test_extract_numbers_from_text_fails(source_data, incorrect_result_data): | ||
check_result = check_if_number_exist(incorrect_result_data , source_data) | ||
assert check_result['number_exists'] == False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert not check_result['number_exists']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line for the file. Maybe we can take a look together to set up a formatter in the IDE?
message += numbers + ", " | ||
message += f"not mentioned in the {source_link}. \n" | ||
return {"number_exists": False , "messages":message} | ||
return {"number_exists": True , "messages":message} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a new line at the end
Number exisitance checking utility
This pull request introduces a utility function designed to determine whether the "source" set of text data contains the numbers referenced in the "result" set.
This utility function will serve as a temporary solution for verifying whether the numerical results from the language model originated from scraped data rather than being hallucinated.