Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(knowledge AI): Knowledge AI feature and bug fixes #104

Merged
merged 7 commits into from
Aug 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .env-dist
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
ENVIRONMENT=
ENVIRONMENT=
MAX_NUMBER_OF_TOKENS=
KNOWLEDGE_AI_STORE_DIR=
8 changes: 4 additions & 4 deletions .github/workflows/pr-title-checker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ on:
- unlabeled

jobs:
pull_request:
check:
name: PR Title Checker
runs-on: ubuntu-latest
steps:
- uses: thehanimo/pr-title-checker@v1.3.4
- uses: thehanimo/pr-title-checker@v1.4.0
with:
GITHUB_TOKEN: ${{ secrets.GH_TOKEN }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
pass_on_octokit_error: false
configuration_path: '.github/pr-title-checker-config.json'
configuration_path: .github/pr-title-checker-config.json
6 changes: 4 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
node_modules
build
agent
knowledgeAIStore
.history
coverage/
.nyc_output
config.json
.env
playbooks.json
playbookRunResults.json
playbookRunResults.json
config.**.json
config.json
5 changes: 5 additions & 0 deletions .snyk
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Snyk (<https://snyk.io>) policy file, patches or ignores known vulnerabilities

version: v1.22.1
ignore: {}
patch: {}
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,27 @@
# [1.3.0](https://github.com/Cognigy/Cognigy-CLI/compare/v1.2.2...v1.3.0) (2023-04-24)


### Features

* **playbooks:** run playbooks ([04986be](https://github.com/Cognigy/Cognigy-CLI/commit/04986bea068721ccc49a6e7b21875f595631ccd3))
* **playbooks:** updated readme ([51847fc](https://github.com/Cognigy/Cognigy-CLI/commit/51847fcd27eaa3f828178655212441425789a546))

## [1.2.2](https://github.com/Cognigy/Cognigy-CLI/compare/v1.2.1...v1.2.2) (2023-03-14)


### Bug Fixes

* **project:** triggering a new release ([9e81c80](https://github.com/Cognigy/Cognigy-CLI/commit/9e81c800c5f00ce05ce446f1205791ea4e9ff9d3))

## [1.2.1](https://github.com/Cognigy/Cognigy-CLI/compare/v1.2.0...v1.2.1) (2023-03-14)


### Bug Fixes

* package.json & package-lock.json to reduce vulnerabilities ([c23fe7e](https://github.com/Cognigy/Cognigy-CLI/commit/c23fe7e036f18802437bf4ec347e5286a7e2d160))
* receive snapshot as ArrayBuffer ([a672133](https://github.com/Cognigy/Cognigy-CLI/commit/a672133c639f834625b2d37d1b4b72765c14423e))
* **snyk vulnerabilities:** snyk vulnerability ([3e48f41](https://github.com/Cognigy/Cognigy-CLI/commit/3e48f41d41bd502afb20bb041d66b4cacdad063b))

## [1.2.1](https://github.com/Cognigy/Cognigy-CLI/compare/v1.2.0...v1.2.1) (2023-03-14)


Expand Down
123 changes: 123 additions & 0 deletions KNOWLEDGE-AI-README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# CognigyAI KnowledgeAI Introduction

This tool is used to manage data for the knowledge search feature. It can be used to create and delete knowledge stores as well as ingest data into and delete data from knowledge stores.

## Usage

### Background

This section aims to give you a brief idea of the underlying data model and how to make use of it in your Cognigy.AI project.

In the backend we use the term `knowledgeAI store` to refer to an entity `KnowledgeAI source` which is used to reference to `documents` stored in your project. Each knowledgeAI store can reference one or multiple sources and can be used as a higher level element to order your ingested documents in your project. Each project can have one or multiple knowledgeAI stores.

The term `KnowledgeAI source` references the actual file which contains one or more `chunks`. Each chunk within a document will be ingested as a separate object into the database, but the document's URL will be used as a reference in a knowledge store.

Currently we only support ingesting cognigy ctext `.ctxt` files, where the paragraphs are separated by a blank line as in the following example:

```
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Aenean mi nulla, fermentum id finibus nec, lacinia nec ipsum. Nullam rhoncus augue in magna vulputate, ac porttitor justo posuere. Integer at risus ut libero scelerisque vehicula a eget sapien. Integer feugiat nulla leo, a elementum arcu consequat at. In hac habitasse platea dictumst. Ut ut sem condimentum, tempus enim vel, maximus est. Suspendisse commodo interdum ullamcorper. In pulvinar quam ut elementum tempus. Maecenas feugiat risus ac magna tincidunt maximus. Ut vestibulum congue elit ac finibus. Vivamus aliquet auctor risus, vel euismod felis pulvinar sit amet. Cras nec molestie enim, in ultricies justo. Integer a pretium dui. Cras in bibendum velit, a laoreet metus.

Vestibulum orci enim, rutrum nec quam in, iaculis hendrerit eros. Maecenas ultrices, felis at luctus fringilla, elit risus auctor erat, sit amet posuere nunc augue sed elit. Nam tempus ipsum magna, et semper ipsum rhoncus condimentum. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aliquam feugiat vehicula magna. Praesent magna mi, lobortis et dolor mattis, tempus malesuada lacus. Nam ut eros vitae metus iaculis tristique.
```

### How to ingest documents

To ingest documents into your project, you have to follow these steps for each project where you want to use knowledge search:

1. Create a knowledge store that is bound to your project:

```bash
cognigy knowledge-ai create source --projectId <projectId> --language <languageCode> --name <nameOfYourKnowledgeStore> --description <descriptionOfYourKnowledgeStore>
```

To easily access your projectId, create a new agent and navigate to the dashboard. Copy the first id from the URL you see in your browser. In the following example it is encapsulated in star characters:

> <https://dev.cognigy.ai/agent/**64467681d8170fe52ead079d**/42467681d8170f859aad079f>

You will see that the command succeeds when a knowledge store object is printed to the terminal.

This command will write a file `./knowledgeStore_<nameOfYourKnowledgeStore>.json`. You can use it for ingestion or as a reference for the IDs.

2. Ingest your document or a whole directory containing documents:

```bash
cognigy knowledge-ai ingest --projectId <projectId> --language <languageCode> --knowledgeStoreId 64467681d8170fe52ead079d --input <pathToFileOrDirectory>
```

where you have to replace the value of `--knowledgeStoreId` with the `_id` field returned by the previous command. As `--input` you can give a path pointing to a single `.ctxt` file or to a directory. In the latter case, the CLI tool ingests each and every `.ctxt` file from that directory. Currently, it does not read files located within nested sub-directories.

Alternatively you can use the written `./knowledgeStore_<nameOfYourKnowledgeStore>.json` file by specifying it in the command. You then no longer have to provide the projectId, language and knowledgeStoreId parameters.

```bash
cognigy knowledge-ai ingest --name <nameOfYourKnowledgeStore> --input <pathToFileOrDirectory>
```

For further instructions on how to use the specific and other commands, see the help printout at `cognigy knowledge-ai --help`.

### How to extract text from various sources

> We implement the [Langchain Document Loaders](https://js.langchain.com/docs/modules/indexes/document_loaders/) and [Text Splitters](https://js.langchain.com/docs/modules/indexes/text_splitters/). For more information, visit the Langchain JS website.

You can use the `extract` command to extract texts from various document types. Afterwards this text can be ingested using the above described method.

The syntax for the command is

```
cognigy knowledge-ai extract <type> -i path_to_input_file -o path_to_output_file
```

| Option | Description |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
| `-i, --inputFile <string>` | Input File Path |
| `-o, --outputFile <string>` | Output File Path |
| `-u --url <string>` | Target URL (for cheerio & playwright extraction) |
| `-e, --excludeString <string>` | Excludes paragraphs containing this string |
| `-s, --splitter <string>` | Splitter to use, leave empty for default (see below) |
| `-cs, --chunkSize <number>` | Chunk size, default 2000 |
| `-co, --chunkOverlap <number>` | Chunk overlap, default 200 |
| `-ap --additionalParameters <string>` | Additional parameters for the extractor |
| `-fl --forceLocal` | Skips the API call to the extraction service and forces local processing of files. Will not work with type `other` |

The following types are available:

| Type | Description | Default Splitter |
| ------------ | --------------------------------------------------- | ------------------------------ |
| `text` | Plain text | RecursiveCharacterTextSplitter |
| `pdf` | Portable Document Format | RecursiveCharacterTextSplitter |
| `docx` | Microsoft Word Document | RecursiveCharacterTextSplitter |
| `csv` | Comma Separated Values | RecursiveCharacterTextSplitter |
| `json` | JavaScript Object Notation | RecursiveCharacterTextSplitter |
| `jsonl` | JavaScript Object Notation Lines | RecursiveCharacterTextSplitter |
| `epub` | Electronic Publication | RecursiveCharacterTextSplitter |
| `srt` | SubRip Subtitle Format | RecursiveCharacterTextSplitter |
| `md` | Markdown | MarkdownSplitter |
| `cheerio` | Simple web-based content extraction | RecursiveCharacterTextSplitter |
| `playwright` | Web-based content extraction via browser simulation | RecursiveCharacterTextSplitter |
| `other` | Any other file type | RecursiveCharacterTextSplitter |

The `other` type can be used for virtually any file type that is not explicitly supported. We will do our best to extract the content.

Each extractor can _optionally_ be called with their own splitter defined by using the `-s <splitter_name>` option.

| Splitter | Description |
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| CharacterTextSplitter | Splits by new line `\n` characters |
| MarkdownTextSplitter | Splits your content into documents based on the Markdown headers |
| RecursiveCharacterTextSplitter | Split documents recursively by different characters - starting with `\n\n`, then `\n`, then spaces |
| TokenTextSplitter | Splits a raw text string by first converting the text into BPE tokens, then split these tokens into chunks and convert the tokens within a single chunk back into text |

Additional parameters applicable to each type can be checked in the [Langchain Document Loader Documentation](https://js.langchain.com/docs/modules/indexes/document_loaders/).

### Checking the token size of a source document

You can check the token size of a given document using the `size` command.

```
cognigy knowledge-ai size -i path_to_input_file
```

### Limitations

Currently, the number of tokens allowed for each paragraph is limited to `2048`. You may change this value by modifying the environmental variable `MAX_NUMBER_OF_TOKENS`.
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,19 @@ The run command outputs the status of the playbook runs and exits:

All details are written to `./playbookRunResults.json`

### Command: knowledge-ai

[Cognigy Knowledge AI Documentation](KNOWLEDGE-AI-README.md)

## FAQ

[Frequently asked questions (FAQ)](FAQ.md)

## Contributing

Make sure you pull origin from the develop branch
```git pull develop``````

### Commiting

Commit using the commitizen hook with semantic naming convetion promt
Expand All @@ -287,3 +298,4 @@ Any PRs to develop needs to be merged as squash merges.

Create a PR from develop to main and do a merge commit. This will automatically trigger a new release.
To make the release publish a new minor version to the npm registry, the commit message needs to follow the [semantic message format] and having at least one of the commits to main from the last release with a fix.

3 changes: 2 additions & 1 deletion config-test.json → config-dist.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
"apiKey": "some-key",
"agent": "some-agent-id",
"baseUrl": "http://some-url",
"agentDir": "./agent"
"agentDir": "./agent",
"playbookTimeoutSeconds": 10
}
4 changes: 3 additions & 1 deletion config.json.dist
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,7 @@
"baseUrl": "",
"agentDir": "",
"openAIApiKey": "",
"playbookTimeoutSeconds": 10
"playbookTimeoutSeconds": 10,
"maxNumberOfTokens": 2048,
"knowledgeAIStoreDir": ""
}
Loading