Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity validation #298

Closed

Conversation

Eyobyb
Copy link
Collaborator

@Eyobyb Eyobyb commented Feb 19, 2024

Your checklist for this pull request

Thank you for submitting a pull request! To speed up the review process, please follow this checklist:

  • My Pull Request is small and focused on one topic so it can be reviewed easily
  • My code follows the style guidelines of this project (make format)
  • Commit messages are detailed
  • I have performed a self-review of my code
  • I commented hard-to-understand parts of my code
  • I updated the documentation (docstrings, /docs)
  • My changes generate no new warnings (or explain any new warnings and why they're ok)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass when I run pytest tests (offline mode)

Additional steps for code with networking dependencies:

This is entity validation, designed to check if entities from the context exist in the answer. If they do not, the system will reprompt to add those entities. There are three stages of entity validation: first, string comparison from the list of entities extracted using spaCy; second, using metrics; and third, utilizing language model (LLM)

pervious pr #295
issue #274

@Eyobyb Eyobyb requested review from oshoma and 20001LastOrder and removed request for oshoma February 19, 2024 02:44
@Eyobyb Eyobyb changed the title entity validation = Entity validation Feb 19, 2024
Copy link
Collaborator

@oshoma oshoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eyobyb and @20001LastOrder please coordinate with each other on this Entity Validation PR (#298) and the Hydra Config PR #289.

PR #289 contains changes to number validation and configuration settings like TEMPERATURE. Those changes need to be be picked up and carried forward in this PR as well.

There are two options:

  1. Merge Entity Validation Entity validation #298 into main first, then rebase Hydra Config Entity validation #298 on main (or on Entity validation #298).
  2. Merge Hydra Config into main first, then rebase Entity Validation on main (or on Hydra Config).

return {"entity_exist": True, "messages": message}


def check_entities_match(result: str, source: str, stage: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #37dbe54687ca60b1b6c505f594513ceace244927 in the Hydra Config PR for a reworked version of number verification. The return values are simplified, and sets are used to implement the core logic. Please review that pattern and apply it to this function so that the two are consistent.

Copy link
Collaborator Author

@Eyobyb Eyobyb Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked about it and decided to move this to the Hydra Config after merging it with the main branch.

result = text_similarity_by_metrics(check_entity, source_entity)
assert result["entity_exist"] is False
expected_message = "remember to address these entities pear, grape, kiwi, in final the answer."
print(result["messages"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove all the print() statements throughout this PR

result = extract_entities(text)
assert result == ["The United Nations", "NORP"]

def test_extract_entities_without_entities():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test extract_entities with None and a blank string ""

return filtered_entities


def json_extractor(text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_extractor => json_from_text

)

instruction = f"""
I have a question and an answer. I want you to confirm if the entities from the question has been in some form inside the answer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question and an answer. I want you to confirm whether the entities from the question are mentioned in some form within the answer.

error_entity.append(entity)
if len(error_entity) > 0:
for entity in error_entity:
message += entity + ", "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message += ", ".join(error_entity)

"""

source_entity = extract_entities(source)
check_entity = extract_entities(result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rework this stage variable. You are implementing a state machine here. So give the states meaningful names, e.g. TEXT_SIMILARITY_BASIC = 0, TEXT_SIMILARITY_BY_METRICS = 1, TEXT_SIMILARITY_BY_LLM = 2. Something like that.

If function callers must know what these states mean then define them at the top of this file so they are obvious.

Alternatively, is there a way to hide all of this from function callers? It seems like implementation detail to me, but I haven't been part of the design conversation.

from sherpa_ai.utils import extract_entities
from nltk.metrics import jaccard_distance

@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray for parametrize! 🥇

result_entities = extract_entities(results[0].content)
for entitity in expected_entities:
set_a = set(entitity.split()) # Convert each string in a to a set of words
match_found = any( 1 - jaccard_distance(set_a, result_entity.split()) >= 0.7 for result_entity in result_entities)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you are using 0.7, and elsewhere in the code there is a 0.75 value. are they meant to be the same? if so, define a constant and use it (import it) in the test.

@@ -189,3 +194,95 @@ def test_extract_numbers_from_text_fails(source_data, incorrect_result_data):
# test aganist a text which don't have the same numers as the source
check_result = check_if_number_exist(incorrect_result_data, source_data)
assert not check_result["number_exists"]

def test_json_extractor_valid_json():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice tests.
these tests could also be made more concise with parametrize.

remember to check the failure cases and corner cases like None and ""

@oshoma
Copy link
Collaborator

oshoma commented Feb 22, 2024

@Eyobyb and @20001LastOrder please coordinate with each other on this Entity Validation PR (#298) and the Hydra Config PR #289. PR #289 contains changes to number validation and configuration settings like TEMPERATURE. Those changes need to be be picked up and carried forward in this PR as well.

There are two options:

  1. Merge Entity Validation Entity validation #298 into main first, then rebase Hydra Config Entity validation #298 on main (or on Entity validation #298).
  2. Merge Hydra Config into main first, then rebase Entity Validation on main (or on Hydra Config).

Noting that Hydra Config PR #289 is now merged into main. @Eyobyb please rebase your Entity Validation work on main.

@Eyobyb Eyobyb mentioned this pull request Feb 29, 2024
11 tasks
@Eyobyb Eyobyb closed this Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants