Add tutorial for PDF reader #334

20001LastOrder · 2024-04-07T05:08:10Z

Your checklist for this pull request

Thank you for submitting a pull request! To speed up the review process, please follow this checklist:

My Pull Request is small and focused on one topic so it can be reviewed easily
My code follows the style guidelines of this project (make format)
Commit messages are detailed
I have performed a self-review of my code
I commented hard-to-understand parts of my code
I updated the documentation (docstrings, /docs)
My changes generate no new warnings (or explain any new warnings and why they're ok)
I have added tests that prove my fix is effective or that my feature works
All tests pass when I run pytest tests (offline mode)

Additional steps for code with networking dependencies:

I followed the offline and online testing guidelines
All tests pass when I run pytest tests --external_api (online mode, making network calls)

Description

Describe your pull request here.

What does this PR implement or fix? Explain.

If this PR resolves any currently open issues then mention them like this: Closes #xxx.
Github will close such issues automatically when your PR is merged into main.

Any relevant logs, error output, etc?

Any other comments? For example, will other contributors need to install new libraries via poetry install after picking up these changes?

💔Thank you!

amirfz · 2024-04-07T08:35:53Z

docs/How_To/Tutorials/document_reader.rst

+
+.. code-block:: bash
+
+    pip install sherpa_ai


looks like sherpa_ai is dependent on a specific version of python (>3.9?) if that's true, then we should include that note somewhere. the above command gave me an error to upgrade my python.

sigh! this took me a long time to fix. had to reinstall the whole anaconda. it kept giving me a circular error that I needed to upgrade conda first

maybe we create a separate issue for a docker file that contains the right env?

docs/How_To/Tutorials/document_reader.rst

oshoma

I added comments for text changes.

Also running into a few problems here. See following transcript from the end of the tutorial after adding the validation.

Issues:

citation_validation is dying with a bug
The hugging face warnings make the tutorial harder to understand. Can we get rid of these? I've seen them at other times and wondered if we are doing something wrong, but haven't dug in.
Repeated warnings "Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3" . I'm not sure where these are coming from or whether they are benign.

Ask me a question: what is the PDF about?
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-04-08 11:59:47.833 | INFO     | sherpa_ai.agents.base:run:69 - Action selected: ('DocumentSearch', {'query': 'PDF about'})
Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3
2024-04-08 11:59:49.418 | INFO     | sherpa_ai.agents.base:run:69 - Action selected: ('DocumentSearch', {'query': 'PDF about'})
Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3
Traceback (most recent call last):
  File "/Users/oshoma/dev/sherpa-tutorial/main.py", line 35, in <module>
    result = qa_agent.run()
             ^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/agents/base.py", line 100, in run
    result = self.validate_output()
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/agents/base.py", line 115, in validate_output
    validation_result = validation.process_output(result, self.belief)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/output_parsers/citation_validation.py", line 189, in process_output
    resources = self.resources_from_belief(belief)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/output_parsers/citation_validation.py", line 161, in resources_from_belief
    resources.extend(action.meta[-1])
                     ~~~~~~~~~~~^^^^

docs/How_To/Tutorials/document_reader.rst

oshoma · 2024-04-08T15:17:54Z

docs/How_To/Tutorials/document_reader.rst

+To create the PDF reader, there are three main components to define:
+1. A text embedding tool to convert text and queries into vectors (We will use the SentenceTransformer library)
+2. A vector database to store the text embeddings of the PDF file (We will use Chroma in-memory vector database)
+3. An customized Sherpa Action to enable searching the vector database and can be used by an agent from Sherpa


A customized Sherpa Action to enable searching the vector database and be used by an agent from Sherpa

oshoma · 2024-04-08T15:18:30Z

docs/How_To/Tutorials/document_reader.rst

+1. A text embedding tool to convert text and queries into vectors (We will use the SentenceTransformer library)
+2. A vector database to store the text embeddings of the PDF file (We will use Chroma in-memory vector database)
+3. An customized Sherpa Action to enable searching the vector database and can be used by an agent from Sherpa
+4. A question answering agent to answer questions about the content of the PDF file (We will use the QAAgent from Sherpa)


A question-answering agent to answer questions about the content of the PDF file. For this we will use the QAAgent built into Sherpa.

docs/How_To/Tutorials/document_reader.rst

artificialfintelligence · 2024-04-09T23:51:35Z

docs/How_To/Tutorials/document_reader.rst

+            - ${doc_search}
+
+
+The _target_ keys tell Sherpa which classes to use to instantiate various objects.


This renders as a hyperlink. It should be formatted as _target_

artificialfintelligence · 2024-04-09T23:59:31Z

@oshoma @20001LastOrder Some initial comments:

Step 2. Create a folder for storing your tutorial code and PDF files. and then further down under “Define the custom action”: Create a folder for this tutorial. In this folder, create a file called actions.py and add the following code. It sounds a bit confusing because it seems we already created a folder "for the tutorial". Maybe change the second "folder" to "subfolder".
In the agent_config for qa_agent: description: You are a Question answering assistant helping users to find answers to the text. I think this should be “…find answers to to their questions” instead of “to the text”? Or maybe “from the text” was meant? (This works as is, but still “find answers to the text” sounds strange.)

And just a few minor issues:
1a) Under "Define the custom action", below the code, in the line below, "name" should also be enclosed in backticks:
name and args: These properties describe the action to agents that want to use it.

1b) Also under "Defining the agent configuration", below the code, in the line below, “name” should also be enclosed in backticks:
name and description: These are used to describe the agent when it is executing the task.

2) In the line Finally, to view more detailed logs, you can set the log level to debug by changing the LOG_LEVEL environment variable to the .env file, this should be “in the .env file”

3) …first try to collection relevant information… should be “… to collect relevant information”

4) In the sentence "The _target_ keys tell Sherpa which classes to use...", _target_ is rendered as a hyperlink in ReStructured Text. It should be written as `_target_` (with backticks).

artificialfintelligence · 2024-04-10T00:39:49Z

No further comments. I was able to follow this tutorial and implement it in a fresh venv without hitting any roadblocks. Very well-written and easy to follow. I just encountered a known issue with citation validation at the very end.

amirfz · 2024-04-10T13:18:01Z

@20001LastOrder since the citation validation util is throwing an error most of the time re #306 can we use a different verification util as part of the tutorial that is more reliable (at least until the bug is fixed)? if not, let's remove that part of the tutorial for now since the tutorial should be focused on what works robustly.

20001LastOrder · 2024-04-15T04:27:54Z

I think I addressed all the comments. Waiting for approval from @oshoma and @amirfz to merge.

oshoma · 2024-04-15T20:08:09Z

I created these for the open issues:

[FEATURE] Resolve HuggingFace/tokenizers warnings #339
[BUG] Repeated warnings "Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3" #340

20001LastOrder requested a review from amirfz April 7, 2024 05:08

amirfz requested changes Apr 7, 2024

View reviewed changes

20001LastOrder requested a review from amirfz April 8, 2024 03:54

oshoma requested changes Apr 8, 2024

View reviewed changes

artificialfintelligence reviewed Apr 9, 2024

View reviewed changes

20001LastOrder and others added 7 commits April 15, 2024 00:27

Add tutorial for PDF reader

6ade01f

Add more content to the tutorial

9726e5e

Update tutorial

dc6c8d5

Add more details to tutorial text

18006b1

Changes to the tutorial

f4a50c4

Add tutorial in index

fdb8523

update onboarding checklist

9571d67

20001LastOrder force-pushed the pdf_tutorial branch from 6e69eae to 9571d67 Compare April 15, 2024 04:27

amirfz previously approved these changes Apr 15, 2024

View reviewed changes

Fix typo

2f468cf

oshoma dismissed amirfz’s stale review via 2f468cf April 15, 2024 19:04

oshoma enabled auto-merge (squash) April 15, 2024 20:09

oshoma approved these changes Apr 15, 2024

View reviewed changes

oshoma merged commit 38a2fc3 into Aggregate-Intellect:main Apr 15, 2024
1 check passed

20001LastOrder deleted the pdf_tutorial branch April 16, 2024 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tutorial for PDF reader #334

Add tutorial for PDF reader #334

20001LastOrder commented Apr 7, 2024

amirfz Apr 7, 2024

amirfz Apr 7, 2024

amirfz Apr 7, 2024

oshoma left a comment

oshoma Apr 8, 2024

oshoma Apr 8, 2024

artificialfintelligence Apr 9, 2024

artificialfintelligence commented Apr 9, 2024 •

edited

Loading

artificialfintelligence commented Apr 10, 2024 •

edited

Loading

amirfz commented Apr 10, 2024

20001LastOrder commented Apr 15, 2024

oshoma commented Apr 15, 2024

		- ${doc_search}


		The _target_ keys tell Sherpa which classes to use to instantiate various objects.

Add tutorial for PDF reader #334

Add tutorial for PDF reader #334

Conversation

20001LastOrder commented Apr 7, 2024

Your checklist for this pull request

Description

amirfz Apr 7, 2024

Choose a reason for hiding this comment

amirfz Apr 7, 2024

Choose a reason for hiding this comment

amirfz Apr 7, 2024

Choose a reason for hiding this comment

oshoma left a comment

Choose a reason for hiding this comment

oshoma Apr 8, 2024

Choose a reason for hiding this comment

oshoma Apr 8, 2024

Choose a reason for hiding this comment

artificialfintelligence Apr 9, 2024

Choose a reason for hiding this comment

artificialfintelligence commented Apr 9, 2024 • edited Loading

artificialfintelligence commented Apr 10, 2024 • edited Loading

amirfz commented Apr 10, 2024

20001LastOrder commented Apr 15, 2024

oshoma commented Apr 15, 2024

artificialfintelligence commented Apr 9, 2024 •

edited

Loading

artificialfintelligence commented Apr 10, 2024 •

edited

Loading