Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility function to check if numbers in one string exists in another #260

Conversation

Eyobyb
Copy link
Collaborator

@Eyobyb Eyobyb commented Dec 11, 2023

Number exisitance checking utility

This pull request introduces a utility function designed to determine whether the "source" set of text data contains the numbers referenced in the "result" set.

This utility function will serve as a temporary solution for verifying whether the numerical results from the language model originated from scraped data rather than being hallucinated.

Copy link
Collaborator

@20001LastOrder 20001LastOrder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, please check the comments below.

src/sherpa_ai/utils.py Outdated Show resolved Hide resolved
result = "Labore deserunt 12.45 $45,000 sit velit nulla. Sint ipsum reprehenderit sint cupidatat amet est id anim exercitation fugiat adipisicing elit. Id est dolore minim magna occaecat aute. Est dolore culpa laborum non esse nostrud."
check_result = check_if_number_exist(source ,result , 'jack.com')
assert check_result['number_exisit'] == False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides asserting the existence, we also want to check the numbers extracted are correct...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate test specifically for that.

if data not in source_numbers:
message.append(f"{data} is not mentioned in the {source_link}. ")
if len(message)>0:
return {"number_exisit": False , "messages":message}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo number_exisit => number_exists

src/sherpa_ai/utils.py Outdated Show resolved Hide resolved
@@ -300,3 +304,19 @@ def check_url(url):
else:
return True

def extract_numbers_from_text(text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit tests to assure yourself this regex pattern works against the types of numbers you are expecting to match

pattern = r"\d+\.\d+|\d+\,\d+|\d+"
matches = re.findall(pattern, text)

return matches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add newline after return matches and delete newline before

Copy link
Collaborator

@20001LastOrder 20001LastOrder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some more comments mostly for the new tests. Please take a look.

numbers_in_source_data = ['12.45','9', '45,000']
print(extracted_number)
if len(numbers_in_source_data) != len(extracted_number):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use assert len(numbers_in_source_data) == len(extracted_number) without using if-else. It will fail right away if the length does not match

assert False , "failed to extract a number"
else:
for number in extracted_number:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compare the two lists with potential duplicated elements. An easy way is to use Counter. You just need

from collections import Counter
assert Counter(extracted_number) == Counter(numbers_in_source_data)

extracted_number = extract_numbers_from_text(source_data)
numbers_in_source_data = ['12.45','9', '45,000']
print(extracted_number)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these print statements. If the assertions are done correctly, the content of these lists will be printout due to failed assertions


def test_extract_numbers_from_text(source_data):
extracted_number = extract_numbers_from_text(source_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted_numbers

def test_extract_numbers_from_text_pass(source_data, correct_result_data):
check_result = check_if_number_exist(source_data ,correct_result_data)
assert check_result['number_exists'] == True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert check_result['number_exists']

def test_extract_numbers_from_text_fails(source_data, incorrect_result_data):
check_result = check_if_number_exist(incorrect_result_data , source_data)
assert check_result['number_exists'] == False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert not check_result['number_exists']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new line for the file. Maybe we can take a look together to set up a formatter in the IDE?

message += numbers + ", "
message += f"not mentioned in the {source_link}. \n"
return {"number_exists": False , "messages":message}
return {"number_exists": True , "messages":message}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a new line at the end

@Eyobyb Eyobyb closed this Jan 11, 2024
@Eyobyb Eyobyb mentioned this pull request Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme-qa Quality assurance work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants