Skip to content

Cvga/gpt_content_indexing

 
 

Repository files navigation

Content indexing & embeddings for Q&A through GPT3

The aim of this project is to provide a naturally queryable knowledge base using an indexed set of content from files, Confluence and Zendesk alongside GPT3 models.

We will accomplish this by fetching all the content from a given directory, Confluence spaces and Zendesk domain and indexing the content found within the subheadings in each page of the Space/Domain into a CSV. The CSV can then be parsed through the OpenAI embeddings model to provide vector scores across all of the contexts.

Once we have an embeddings database of the content, we can then use an embeddings compare search to retrieve the best context for a given question, and use this as the context when submitting a question to the GPT-3 model. Further work can then be done to allow the script and indexed content to be accessible through interfaces such as Slack.

Example use case

Imagine a user posing the question:

Question: Can I still see my completion if I delete my enrolment?

Using the OpenAI embeddings API, we would return the "embedding" for this question and then compare it against the embeddings generated from our knowledge bank (an indexed Confluence Space, e.g. STRM).

In this case, we would find the following answer from our FAQs

What happens to completions if you expire or delete an enrolment? If you expire all enrolments for a user, then their completions are retained in the LX leaderboards but the user does not count against active users licences. If you delete all enrolments for a user, then the completions ARE NOT shown on the leaderboard and of course, they do not count as an active user. But if you re-enrol the user, then their completion is resurfaced too.

We would then use this context to generate a better context when querying the GPT3 model:

image

Example running in code

The same question posed through this repo:

$ python ask_question.py --allow_hallucinations True --question "Can I still see my completion if I delete my enrolment?"

Question: Can I still see my completion if I delete my enrolment?

Answer: No, if you delete all enrolments for a user, then the completions ARE NOT shown on the leaderboard. More info: https://learninglocker.atlassian.net/wiki/spaces/STRM/pages/1014333441/FAQs

image

Runtime example

Generate CSV of Spaces, Zendesk and custom CSV content

Example:

Execute index_content.py to generate a CSV of content from the first 10 pages of both STRM and LL spaces in ./defaults/output/contents.csv and all the content from the learningpool.zendesk.com domain

python index_content.py --spaces=STRM LL --max_pages=10 --zendesk learningpool --out ./output/default/contents.csv

--spaces Space separated list of Confluence Spaces to index. Not required; defaults to STRM

--zendesk Configure the zendesk domain to pull content from. Not required; defaults to learningpool

--out specify the output file. Not required; defaults to ./output/defauilt/contents.csv

--max_pages Max pages pulled per Confluence space. Not required; defaults to 1000. Recommend using low numbers for initial testing

--min_tokens Minimum tokens in a given bit of content before it is excluded. Defaults to 20 (approx 60 characters)

CSV import

For CSVs included in the csv_input directory, they will be iterated over and imported.

Content can now be imported from files using the following options:

--input is not required; defaults to ./input. Points to a folder to ingest CSVs from. Rows should be in the format 'heading,answers,answers,...'

--use_csv_dirs is not required; defaults to False. If True, uses the folder structure (./input/product/area.csv) to prefill the title (e.g. product) and subtitle (e.g. area) for imported CSVs. Otherwise prompts for each file found

Generate embeddings from output file

Outputs embeddings to a file in ./default/output/embeddings.csv

python create_embeddings.py --file ./output/default/contents.csv --out ./output/default/embeddings.csv

--file is not required; defaults to ./output/defaults/contents.csv

--out is not required; defaults to ./output/default/embeddings.csv

Ask it a question!

--dir is not required. It defaults to the ./output/default/ directory. Inside here it will look for a contents.csv and embeddings.csv

Direct Q&A mode

python ask_question.py --question "How much does an elephant weigh?"

Pass in a question using the --question flag

--embeddings is not required; defaults to embeddings and loads ./output/embeddings.csv

--show_prompt can be used to output the full prompt used to answer the question

--imagine can be used to provide a looser prompt

--experiment_hyde can be used to enable Hypothetical Document Embeddings mode. This will generate a hypotethetical answer to the original question, and use the embeddings from that answer to search through the contents. Credit to https://arxiv.org/pdf/2212.10496.pdf via https://twitter.com/mathemagic1an/status/1615378778863157248

--stream can be used to stream the output from GPT directly to the terminal in realtime

Slack listening mode

The script can be configured to listen to all messages mentioning to a Slack bot/user and reply in the thread to the user. It uses the Real Time Messaging (RTM) Slack client to listen to all incoming messages connected to the bot and replies directly to the thread of the message where the bot is @mentioned.

image

You will need to configure a bot with the bot and chat:wrute:bot scopes. Note that the bot scope is deprecated but is the the recommended approach for RTM clients. More info: https://api.slack.com/rtm

Ensure you have the bot's Oauth key configured as SLACK_BOT_API_KEY and the bot's ID in SLACK_BOT_ID.

To start the script in Slack listening mode, use...

python ask_question.py --slack

image

The user may add additional comands to their question in Slack:

[--show_prompt]: Returns the full prompt to the user

[--imagine]: Enables the hallucination mode (looser prompts)

Requirements

OpenAI

You will require an OpenAI account and your OpenAI API key included in a OPENAI_API_KEY environment variable.

Atlassion/Confluence

The script uses the Atlassian API to fetch pages from Confluence. Ensure the following ENV variables are included in your system:

CONFLUENCE_USERNAME=your confluence username

CONFLUENCE_API_KEY=your confluence api key

Instructions on generating a Confluence API token: https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/

Todo

Stretch goals

Known issues and workarounds

Can't find Rust compiler

The transformers module, used to count tokens of content, requires rust to compile.

Solution: Install rust

Resource punkt not found

See reference issue here: delip/PyTorchNLPBook#14

Solution: Execute this in the Python terminal:

import nltk
nltk.download('punkt')

References

A lot of the work to date has been inspired by the notebooks and cookbooks provided by OpenAI:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%