The aim of this project is to provide a naturally queryable knowledge base using an indexed set of content from files, Confluence and Zendesk alongside GPT3 models.
We will accomplish this by fetching all the content from a given directory, Confluence spaces and Zendesk domain and indexing the content found within the subheadings in each page of the Space/Domain into a CSV. The CSV can then be parsed through the OpenAI embeddings model to provide vector scores across all of the contexts.
Once we have an embeddings database of the content, we can then use an embeddings compare search to retrieve the best context for a given question, and use this as the context when submitting a question to the GPT-3 model. Further work can then be done to allow the script and indexed content to be accessible through interfaces such as Slack.
Imagine a user posing the question:
Question: Can I still see my completion if I delete my enrolment?
Using the OpenAI embeddings API, we would return the "embedding" for this question and then compare it against the embeddings generated from our knowledge bank (an indexed Confluence Space, e.g. STRM).
In this case, we would find the following answer from our FAQs
We would then use this context to generate a better context when querying the GPT3 model:
The same question posed through this repo:
$ python ask_question.py --allow_hallucinations True --question "Can I still see my completion if I delete my enrolment?"
Question: Can I still see my completion if I delete my enrolment?
Answer: No, if you delete all enrolments for a user, then the completions ARE NOT shown on the leaderboard. More info: https://learninglocker.atlassian.net/wiki/spaces/STRM/pages/1014333441/FAQs
Execute index_content.py to generate a CSV of content from the first 10 pages of both STRM and LL spaces in
./defaults/output/contents.csv
and all the content from the learningpool.zendesk.com domain
python index_content.py --spaces=STRM LL --max_pages=10 --zendesk learningpool --out ./output/default/contents.csv
--spaces
Space separated list of Confluence Spaces to index. Not required; defaults to STRM
--zendesk
Configure the zendesk domain to pull content from. Not required; defaults to learningpool
--out
specify the output file. Not required; defaults to ./output/defauilt/contents.csv
--max_pages
Max pages pulled per Confluence space. Not required; defaults to 1000. Recommend using low numbers for initial testing
--min_tokens
Minimum tokens in a given bit of content before it is excluded. Defaults to 20 (approx 60 characters)
For CSVs included in the csv_input directory, they will be iterated over and imported.
Content can now be imported from files using the following options:
--input
is not required; defaults to ./input
. Points to a folder to ingest CSVs from. Rows should be in the format 'heading,answers,answers,...'
--use_csv_dirs
is not required; defaults to False. If True, uses the folder structure (./input/product/area.csv) to prefill the title (e.g. product) and subtitle (e.g. area) for imported CSVs. Otherwise prompts for each file found
Outputs embeddings to a file in ./default/output/embeddings.csv
python create_embeddings.py --file ./output/default/contents.csv --out ./output/default/embeddings.csv
--file
is not required; defaults to ./output/defaults/contents.csv
--out
is not required; defaults to ./output/default/embeddings.csv
--dir
is not required. It defaults to the ./output/default/ directory.
Inside here it will look for a contents.csv and embeddings.csv
python ask_question.py --question "How much does an elephant weigh?"
Pass in a question using the --question
flag
--embeddings
is not required; defaults to embeddings
and loads ./output/embeddings.csv
--show_prompt
can be used to output the full prompt used to answer the question
--imagine
can be used to provide a looser prompt
--experiment_hyde
can be used to enable Hypothetical Document Embeddings mode. This will generate a hypotethetical answer to the original question, and use the embeddings from that answer to search through the contents. Credit to https://arxiv.org/pdf/2212.10496.pdf via https://twitter.com/mathemagic1an/status/1615378778863157248
--stream
can be used to stream the output from GPT directly to the terminal in realtime
The script can be configured to listen to all messages mentioning to a Slack bot/user and reply in the thread to the user. It uses the Real Time Messaging (RTM) Slack client to listen to all incoming messages connected to the bot and replies directly to the thread of the message where the bot is @mentioned.
You will need to configure a bot with the bot
and chat:wrute:bot
scopes. Note that the bot
scope is deprecated but is the the recommended approach for RTM clients. More info: https://api.slack.com/rtm
Ensure you have the bot's Oauth key configured as SLACK_BOT_API_KEY
and the bot's ID in SLACK_BOT_ID
.
To start the script in Slack listening mode, use...
python ask_question.py --slack
The user may add additional comands to their question in Slack:
[--show_prompt]
: Returns the full prompt to the user
[--imagine]
: Enables the hallucination mode (looser prompts)
- Python 3.x
- Check requirments.txt for pip modules
You will require an OpenAI account and your OpenAI API key included in a OPENAI_API_KEY
environment variable.
The script uses the Atlassian API to fetch pages from Confluence. Ensure the following ENV variables are included in your system:
CONFLUENCE_USERNAME=your confluence username
CONFLUENCE_API_KEY=your confluence api key
Instructions on generating a Confluence API token: https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/
- Index a Confluence Space
- Parse through all pages within a space
- Return out all content, stripped of HTML and indexed by heading/subheading
- Count tokens in each indexed row of content
- Use
tokenizer
from GPT2TokenizerFast (tokenizers) - Discard content where too small (<40?)
- See https://github.com/openai/openai-cookbook/blob/main/examples/fine-tuned_qa/olympics-1-collect-data.ipynb for example
- Use
- Parse indexed content through embeddings API to generate vectors
- Create script to take input question from user and return a useful answer!
- Search indexed content using question's embedding and select most similar content
- Apply it as context back to GPT3 (davinci?) alongside original question to attempt to answer from knowledge base
- Create a Slack bot that can take a question, feed it to the script and return the answer
- Rewrite the Slack bot to a more modern protocol (e.g websocket)
- Provide an endpoint version for AWS Lambda/API Gateway
- Utilise a Vector Database
- Index multiple spaces into single embeddings database
- Index other knowledge sources
- Slack support channels
- Academy: https://learningpool.zendesk.com/api/v2/help_center/en-us/articles.csv
- CSV
- Fine tune model rather than just finding and providing context (although this is still a recommended approach)
The transformers module, used to count tokens of content, requires rust to compile.
Solution: Install rust
See reference issue here: delip/PyTorchNLPBook#14
Solution: Execute this in the Python terminal:
import nltk
nltk.download('punkt')
A lot of the work to date has been inspired by the notebooks and cookbooks provided by OpenAI: