From 6e4402b15efb05c4863d62e49ced81b4edefd0a5 Mon Sep 17 00:00:00 2001 From: Chris Bush Date: Mon, 21 Aug 2023 11:44:51 -0400 Subject: [PATCH] (DOCSP-31343): Add system diagram and info for ingest (#86) * Add ingest system diagram * Move diagram to README and add some info --- ingest/README.md | 39 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/ingest/README.md b/ingest/README.md index 52b41d1d8..310adf252 100644 --- a/ingest/README.md +++ b/ingest/README.md @@ -5,6 +5,44 @@ database. Based on https://github.com/cbush/typescript-cli-template +## System Overview + +```mermaid +flowchart + B[Pages command] + C[Embed command] + B --> D(fetch pages from source) + D --> E(store pages in Atlas) + + C --> F(fetch pages from Atlas) + F -- for pages marked\n 'created' or 'updated' --> G(make embeddings) + G --> H(store embeddings in Atlas) + F -- for pages marked 'deleted' --> I(delete embeddings\nfor page) +``` + +The ingest tool has two major commands: `pages` and `embed`. These commands +represent the two stages of ingesting content. + +### Stage 1: Pages + +The `pages` command fetches pages from data sources and stores them in Atlas +with a last updated timestamp. A "page" is some text with a URL. A data source +is an arbitrary collection of pages. You can create a new data source by +implementing `DataSource`. + +For each given data source, the `pages` command compares the pages with those +already stored in the database and only updates those that are new, have +changed, or have been deleted. The command does not actually delete documents +from the database, but instead marks a page as "deleted", so that the next stage +knows to delete the corresponding embeddings. + +### Stage 2: Embed + +The `embed` command creates embeddings for pages that have been updated since a +given date. For pages that have been deleted, the command deletes any +corresponding embeddings in the database. If a page is new or has been updated, +the command regenerates the corresponding embeddings for that page. + ## Development ### Build & Run @@ -20,4 +58,3 @@ node . Add commands to `src/commands/`. The CLI automatically picks up any non-test .ts file that default-exports a `yargs.CommandModule`. See existing commands for example. -