-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tutorial for PDF reader #334
Conversation
|
||
.. code-block:: bash | ||
|
||
pip install sherpa_ai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like sherpa_ai is dependent on a specific version of python (>3.9?) if that's true, then we should include that note somewhere. the above command gave me an error to upgrade my python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sigh! this took me a long time to fix. had to reinstall the whole anaconda. it kept giving me a circular error that I needed to upgrade conda first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we create a separate issue for a docker file that contains the right env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments for text changes.
Also running into a few problems here. See following transcript from the end of the tutorial after adding the validation.
Issues:
- citation_validation is dying with a bug
- The hugging face warnings make the tutorial harder to understand. Can we get rid of these? I've seen them at other times and wondered if we are doing something wrong, but haven't dug in.
- Repeated warnings "Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3" . I'm not sure where these are coming from or whether they are benign.
Ask me a question: what is the PDF about?
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2024-04-08 11:59:47.833 | INFO | sherpa_ai.agents.base:run:69 - Action selected: ('DocumentSearch', {'query': 'PDF about'})
Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3
2024-04-08 11:59:49.418 | INFO | sherpa_ai.agents.base:run:69 - Action selected: ('DocumentSearch', {'query': 'PDF about'})
Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3
Traceback (most recent call last):
File "/Users/oshoma/dev/sherpa-tutorial/main.py", line 35, in <module>
result = qa_agent.run()
^^^^^^^^^^^^^^
File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/agents/base.py", line 100, in run
result = self.validate_output()
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/agents/base.py", line 115, in validate_output
validation_result = validation.process_output(result, self.belief)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/output_parsers/citation_validation.py", line 189, in process_output
resources = self.resources_from_belief(belief)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/oshoma/miniconda3/lib/python3.11/site-packages/sherpa_ai/output_parsers/citation_validation.py", line 161, in resources_from_belief
resources.extend(action.meta[-1])
~~~~~~~~~~~^^^^
To create the PDF reader, there are three main components to define: | ||
1. A text embedding tool to convert text and queries into vectors (We will use the SentenceTransformer library) | ||
2. A vector database to store the text embeddings of the PDF file (We will use Chroma in-memory vector database) | ||
3. An customized Sherpa Action to enable searching the vector database and can be used by an agent from Sherpa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A customized Sherpa Action to enable searching the vector database and be used by an agent from Sherpa
1. A text embedding tool to convert text and queries into vectors (We will use the SentenceTransformer library) | ||
2. A vector database to store the text embeddings of the PDF file (We will use Chroma in-memory vector database) | ||
3. An customized Sherpa Action to enable searching the vector database and can be used by an agent from Sherpa | ||
4. A question answering agent to answer questions about the content of the PDF file (We will use the QAAgent from Sherpa) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- A question-answering agent to answer questions about the content of the PDF file. For this we will use the QAAgent built into Sherpa.
- ${doc_search} | ||
|
||
|
||
The _target_ keys tell Sherpa which classes to use to instantiate various objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This renders as a hyperlink. It should be formatted as _target_
@oshoma @20001LastOrder Some initial comments:
And just a few minor issues: 1b) Also under "Defining the agent configuration", below the code, in the line below, “name” should also be enclosed in backticks: 2) In the line 3) 4) In the sentence "The _target_ keys tell Sherpa which classes to use...", |
No further comments. I was able to follow this tutorial and implement it in a fresh venv without hitting any roadblocks. Very well-written and easy to follow. I just encountered a known issue with citation validation at the very end. |
@20001LastOrder since the citation validation util is throwing an error most of the time re #306 can we use a different verification util as part of the tutorial that is more reliable (at least until the bug is fixed)? if not, let's remove that part of the tutorial for now since the tutorial should be focused on what works robustly. |
6e69eae
to
9571d67
Compare
Your checklist for this pull request
Thank you for submitting a pull request! To speed up the review process, please follow this checklist:
make format
)/docs
)pytest tests
(offline mode)Additional steps for code with networking dependencies:
pytest tests --external_api
(online mode, making network calls)Description
Describe your pull request here.
What does this PR implement or fix? Explain.
If this PR resolves any currently open issues then mention them like this:
Closes #xxx
.Github will close such issues automatically when your PR is merged into
main
.Any relevant logs, error output, etc?
Any other comments? For example, will other contributors need to install new libraries via
poetry install
after picking up these changes?💔Thank you!