Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial for PDF reader #334

Merged
merged 8 commits into from
Apr 15, 2024

Conversation

20001LastOrder
Copy link
Collaborator

Your checklist for this pull request

Thank you for submitting a pull request! To speed up the review process, please follow this checklist:

  • My Pull Request is small and focused on one topic so it can be reviewed easily
  • My code follows the style guidelines of this project (make format)
  • Commit messages are detailed
  • I have performed a self-review of my code
  • I commented hard-to-understand parts of my code
  • I updated the documentation (docstrings, /docs)
  • My changes generate no new warnings (or explain any new warnings and why they're ok)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass when I run pytest tests (offline mode)

Additional steps for code with networking dependencies:

Description

Describe your pull request here.

What does this PR implement or fix? Explain.

If this PR resolves any currently open issues then mention them like this: Closes #xxx.
Github will close such issues automatically when your PR is merged into main.

Any relevant logs, error output, etc?

Any other comments? For example, will other contributors need to install new libraries via poetry install after picking up these changes?

💔Thank you!

@20001LastOrder 20001LastOrder requested a review from amirfz April 7, 2024 05:08

.. code-block:: bash

pip install sherpa_ai
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like sherpa_ai is dependent on a specific version of python (>3.9?) if that's true, then we should include that note somewhere. the above command gave me an error to upgrade my python.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh! this took me a long time to fix. had to reinstall the whole anaconda. it kept giving me a circular error that I needed to upgrade conda first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we create a separate issue for a docker file that contains the right env?

docs/How_To/Tutorials/document_reader.rst Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Outdated Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Outdated Show resolved Hide resolved
@20001LastOrder 20001LastOrder requested a review from amirfz April 8, 2024 03:54
Copy link
Collaborator

@oshoma oshoma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments for text changes.

Also running into a few problems here. See following transcript from the end of the tutorial after adding the validation.

Issues:

  1. citation_validation is dying with a bug
  2. The hugging face warnings make the tutorial harder to understand. Can we get rid of these? I've seen them at other times and wondered if we are doing something wrong, but haven't dug in.
  3. Repeated warnings "Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3" . I'm not sure where these are coming from or whether they are benign.
Ask me a question: what is the PDF about?
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-04-08 11:59:47.833 | INFO     | sherpa_ai.agents.base:run:69 - Action selected: ('DocumentSearch', {'query': 'PDF about'})
Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3
2024-04-08 11:59:49.418 | INFO     | sherpa_ai.agents.base:run:69 - Action selected: ('DocumentSearch', {'query': 'PDF about'})
Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3
Traceback (most recent call last):
  File "/Users/oshoma/dev/sherpa-tutorial/main.py", line 35, in <module>
    result = qa_agent.run()
             ^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/agents/base.py", line 100, in run
    result = self.validate_output()
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/agents/base.py", line 115, in validate_output
    validation_result = validation.process_output(result, self.belief)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/output_parsers/citation_validation.py", line 189, in process_output
    resources = self.resources_from_belief(belief)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/output_parsers/citation_validation.py", line 161, in resources_from_belief
    resources.extend(action.meta[-1])
                     ~~~~~~~~~~~^^^^

docs/How_To/Tutorials/document_reader.rst Outdated Show resolved Hide resolved
To create the PDF reader, there are three main components to define:
1. A text embedding tool to convert text and queries into vectors (We will use the SentenceTransformer library)
2. A vector database to store the text embeddings of the PDF file (We will use Chroma in-memory vector database)
3. An customized Sherpa Action to enable searching the vector database and can be used by an agent from Sherpa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A customized Sherpa Action to enable searching the vector database and be used by an agent from Sherpa

1. A text embedding tool to convert text and queries into vectors (We will use the SentenceTransformer library)
2. A vector database to store the text embeddings of the PDF file (We will use Chroma in-memory vector database)
3. An customized Sherpa Action to enable searching the vector database and can be used by an agent from Sherpa
4. A question answering agent to answer questions about the content of the PDF file (We will use the QAAgent from Sherpa)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. A question-answering agent to answer questions about the content of the PDF file. For this we will use the QAAgent built into Sherpa.

docs/How_To/Tutorials/document_reader.rst Outdated Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Outdated Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Show resolved Hide resolved
docs/How_To/Tutorials/document_reader.rst Outdated Show resolved Hide resolved
- ${doc_search}


The _target_ keys tell Sherpa which classes to use to instantiate various objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renders as a hyperlink. It should be formatted as _target_

@artificialfintelligence
Copy link
Contributor

artificialfintelligence commented Apr 9, 2024

@oshoma @20001LastOrder Some initial comments:

  1. Step 2. Create a folder for storing your tutorial code and PDF files. and then further down under “Define the custom action”: Create a folder for this tutorial. In this folder, create a file called actions.py and add the following code. It sounds a bit confusing because it seems we already created a folder "for the tutorial". Maybe change the second "folder" to "subfolder".
  2. In the agent_config for qa_agent: description: You are a Question answering assistant helping users to find answers to the text. I think this should be “…find answers to to their questions” instead of “to the text”? Or maybe “from the text” was meant? (This works as is, but still “find answers to the text” sounds strange.)

And just a few minor issues:
1a) Under "Define the custom action", below the code, in the line below, "name" should also be enclosed in backticks:
name and args: These properties describe the action to agents that want to use it.

1b) Also under "Defining the agent configuration", below the code, in the line below, “name” should also be enclosed in backticks:
name and description: These are used to describe the agent when it is executing the task.

2) In the line Finally, to view more detailed logs, you can set the log level to debug by changing the LOG_LEVEL environment variable to the .env file, this should be “in the .env file”

3) …first try to collection relevant information… should be “… to collect relevant information”

4) In the sentence "The _target_ keys tell Sherpa which classes to use...", _target_ is rendered as a hyperlink in ReStructured Text. It should be written as `_target_` (with backticks).

@artificialfintelligence
Copy link
Contributor

artificialfintelligence commented Apr 10, 2024

No further comments. I was able to follow this tutorial and implement it in a fresh venv without hitting any roadblocks. Very well-written and easy to follow. I just encountered a known issue with citation validation at the very end.

@amirfz
Copy link
Contributor

amirfz commented Apr 10, 2024

@20001LastOrder since the citation validation util is throwing an error most of the time re #306 can we use a different verification util as part of the tutorial that is more reliable (at least until the bug is fixed)? if not, let's remove that part of the tutorial for now since the tutorial should be focused on what works robustly.

@20001LastOrder
Copy link
Collaborator Author

I think I addressed all the comments. Waiting for approval from @oshoma and @amirfz to merge.

amirfz
amirfz previously approved these changes Apr 15, 2024
@oshoma oshoma enabled auto-merge (squash) April 15, 2024 20:09
@oshoma oshoma merged commit 38a2fc3 into Aggregate-Intellect:main Apr 15, 2024
1 check passed
@20001LastOrder 20001LastOrder deleted the pdf_tutorial branch April 16, 2024 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants