Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial for citation validation and fix a bug in it #371

Merged
merged 4 commits into from
May 9, 2024

Conversation

20001LastOrder
Copy link
Collaborator

@20001LastOrder 20001LastOrder commented May 9, 2024

Description

  • Add the tutorial for citation validation.
  • Remove a breaking configuration in the document search tutorial due to the change in BaseAction
  • Fix a bug related to citation validation not using the correct resources

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Maintenance
  • New release

Checklists

To speed up the review process, please follow these checklists:

Development

  • The Pull Request is small and focused on one topic
  • Lint rules pass locally (make format && make lint)
  • The code changed/added as part of this pull request has been covered with tests
  • All tests related to the changed code pass in development (make test)
  • The changes generate no new warnings (or explain any new warnings and why they're ok)
  • Commit messages are detailed
  • Changed code is self-explanatory and/or I added comments
  • I updated the documentation (docstrings, /docs)
    See the testing guidelines for help on tests, especially those involving web services.

Code review

  • This pull request has a descriptive title and information useful to a reviewer. There may be a screenshot or screencast attached.
  • I have performed a self-review of my code
  • Issue from task tracker has a link to this pull request

💔 Thank you for submitting a pull request!


The `DocumentSearch` action inherit from the `BaseAction` class, which has a method `add_resources` that can be used to add a citation to the response. The `add_resources` method takes a list of dictionaries, each dictionary should contain the following keys:

- `Document`: Content of the resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this specifically the chunk that was placed in the context? if so perhaps clarify that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some description to it.


The above example shows how to add citations to the Google search action. However, sometimes we may also want to add citations to the responses from the document search action. In this case, we need to manually add the citation to the response.

The `DocumentSearch` action inherit from the `BaseAction` class, which has a method `add_resources` that can be used to add a citation to the response. The `add_resources` method takes a list of dictionaries, each dictionary should contain the following keys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I feel like this should be "add_source", not "add_resource". like you're adding the "source" of the insight, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since the method takes both the source of the document as well as the document content, we should keep the name as add_resource


Ask me a question: What is data leakage
2024-05-09 00:24:57.552 | INFO | sherpa_ai.agents.base:run:70 - Action selected: ('DocumentSearch', {'query': 'What is data leakage'})
Data leakage refers to the potential for data to be unintentionally exposed or disclosed to unauthorized parties [1](doc:chunk_5), [3](doc:chunk_45). In the context provided, data leakage is discussed in relation to the presence of inter-dataset code duplication and the implications for the evaluation of language models in software engineering research [1](doc:chunk_5). It is highlighted as a potential threat that researchers need to consider when working with pre-training and fine-tuning datasets for language models [1](doc:chunk_5). By acknowledging the risk of data leakage due to code duplication, researchers can enhance the robustness of their evaluation methodologies and improve the validity of their results [1](doc:chunk_5).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"chunk_5" etc is what you're calling "source" above? does that need to be a unique ID? in some cases I'm assuming that we would want to allow the front end to show the text of the chunk when clicked on this link

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added instructions to output the chunk table so that one can check the chunk associated with the chunk id

@20001LastOrder 20001LastOrder requested a review from amirfz May 9, 2024 19:50
@amirfz amirfz merged commit 862b6c4 into Aggregate-Intellect:main May 9, 2024
1 check passed
@20001LastOrder 20001LastOrder deleted the citation_tutorial branch December 19, 2024 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants