Skip to content

Commit

Permalink
Merge pull request #76 from cloudera/main
Browse files Browse the repository at this point in the history
* upgrade everything

* small refactor for params, update loading

* add bedrock converse

* fix loading

* Clean up Cohere suggested questions

* Add property-based test for process_response() (#56)

* Add hypothesis

* Add property-based test for process_response()

* Shorten variable

* Formatting

* Add type annotations

* Fix type annotation

* hacking on startup scripts

* hacking on startup scripts, moar

* fix wrong dir

* try having the java side restart itself if it dies

* see output from java startup

* add debug info

* add the executable bit

* change the flags

* Add docstrings for tests

* refactor datasourceId

* update to exclude 405b model and default to 8b

* update readme for new cohere

* fix broken tests monkeypatching

* "wip on creating with models and response chunks"

* wip on modal updates

* commit java updates

* wip on populating the chat setting modal

* set up ui for updating a session

* add update method

* use updated session for chat

* remove query configuration from the chat context

* refactoring fe and fixing bug with empty model

* remove the datasource id from the context and use the active session instead

* Update release version to 1.4.0-beta

* Support multiple embedding models (#59)

* add embedding model to the data source in the java API

* embedding model used from the datasource while indexing

* replace the rest of the embedding model defaults

* "test & fix bugs with embedding variability"

* small refactoring to make embedding & llm caii methods look the same

* fix linting issues

* add a todo for a failing property test case

* remove unused import

---------

Co-authored-by: Elijah Williams <ewilliams@cloudera.com>

* Provide CAII batch embedding for better performance (#35)

* CAII endpoint discovery (#60)

* "wip on endpoint listing"

* "wip on list_endpoints typing"

* "refactoring to endpoint object"

* "wip filtering"

* "endpoints queried!"

* "refactoring"

* "wip on cleaning up types"

* "type cleanup complete"

* "moving files"

* "use a dummy embedding model for deletes"

* fix some bits from merge, get evals working again with CAII, tests passing

* formatting

* clean up ruff stuff

* use the chat llm for evals

* fix mypy for reformatting

* "wip on java reconciler"

* "reconciler don't do no model; start python work"

* "python - updating for summarization model"

* "comment out batch embeddings to get it working again"

* add handling for no summarization in the files table

* finish up ui and python for summarization

* make sure to update the time-updated fields on data sources and chat sessions

* use no-op models when we don't need real ones for summary functionality

* Update release version to dev-testing

* use the summarization llm when summarizing summaries

---------

Co-authored-by: Elijah Williams <ewilliams@cloudera.com>
Co-authored-by: actions-user <actions@github.com>

* Update release version to 1.4.0

* pass the original filename from java-> python so we don't need s3 metadata to store it

* don't read the whole directory when summarizing docs

* "refactor java to use RagFileService"

* remove seaweedfs experiment

* Make mypy happy (#62)

* Refactor summary index to isolate the logic (#63)

* Refactor summary index to isolate the logic

* fix tests

* handle race condition

* handle mypy

* ignore errors if the directory doesn't exist

---------

Co-authored-by: jwatson <jkwatson@gmail.com>

* image

* Update catalog entry to match the official one (#66)

* Update local catalog with official info

* add the git-ref back

* add the html long description (#67)

* Shuffle API for data sources for easier human consumption (#68)

* Shuffle API for data sources for easier human consumption

* make mypy happy

* remove prints

* wip o fs rag file uploader

* "now we're thinking with overtime"

* Revert ""now we're thinking with overtime""

This reverts commit 3c93206.

* get the databases directory from the environment (in local_dev)
python file storage abstraction
python tests currently broken
real AMP startup script needs new env var

* add a todo

* merge from main

* properly override the configuration in pytest configure to point at a temp directory

* get the tests passing with filesystem file handoff

* update project metadata to support new local filesystem storage

* Update release version to dev-testing

* fix java

* cleanup after switching tests to use the local filesystem

* Remove unused settings (#70)

* remove unused dep

* fix circular dep and refactor doc storage

* Update release version to 1.4.0

* Summarize the data store on every document summarization (#69)

* fix bug with s3 path when the prefix is not provided (#72)

* add --reload to the fastapi startup_app script

* Avoid global variables and use ephemeral folder for tests (#71)

* Avoid global variables and use ephemeral folder for tests

* fix with merge to main

* Remove print

* lint

* refetch knowledge base summary on doc summary change

* Bump @eslint/plugin-kit (#16)

Bumps the npm_and_yarn group with 1 update in the /ui directory: [@eslint/plugin-kit](https://github.com/eslint/rewrite).


Updates `@eslint/plugin-kit` from 0.2.2 to 0.2.3
- [Release notes](https://github.com/eslint/rewrite/releases)
- [Changelog](https://github.com/eslint/rewrite/blob/main/release-please-config.json)
- [Commits](eslint/rewrite@plugin-kit-v0.2.2...plugin-kit-v0.2.3)

---
updated-dependencies:
- dependency-name: "@eslint/plugin-kit"
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: jwatson <jkwatson@gmail.com>
Co-authored-by: Michael Liu <mliu@cloudera.com>
Co-authored-by: actions-user <actions@github.com>
Co-authored-by: conradocloudera <csilvamiranda@cloudera.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
6 people authored Dec 12, 2024
2 parents afcf7e2 + ec5804a commit 0251c7f
Show file tree
Hide file tree
Showing 123 changed files with 3,920 additions and 1,746 deletions.
2 changes: 0 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@ DB_URL=jdbc:h2:../databases/rag

# If using CAII, fill these in:
CAII_DOMAIN=
CAII_INFERENCE_ENDPOINT_NAME=
CAII_EMBEDDING_ENDPOINT_NAME=

# set this to true if you have uv installed on your system, other wise don't include this
USE_SYSTEM_UV=true
Expand Down
34 changes: 13 additions & 21 deletions .project-metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,40 +8,32 @@ prototype_version: 1.0
environment_variables:
AWS_DEFAULT_REGION:
default: "us-west-2"
description: "AWS Region where Bedrock is configured and the S3 bucket is located"
required: true
description: "AWS Region where Bedrock is configured and the S3 bucket is located. Required if Bedrock or S3 is used."
required: false
S3_RAG_DOCUMENT_BUCKET:
default: ""
description: "The S3 bucket where uploaded documents are stored"
required: true
description: "The S3 bucket where uploaded documents are stored. Only set if S3 is used for file storage."
required: false
S3_RAG_BUCKET_PREFIX:
default: "rag-studio"
description: "A prefix added to all S3 paths used by Rag Studio"
required: true
description: "A prefix added to all S3 paths used by Rag Studio. Only needed if S3 is used for file storage."
required: false
AWS_ACCESS_KEY_ID:
default: ""
description: "AWS Access Key ID"
required: true
description: "AWS Access Key ID. Required if Bedrock or S3 is used."
required: false
AWS_SECRET_ACCESS_KEY:
default: ""
description: "AWS Secret Access Key"
required: true
description: "AWS Secret Access Key. Required if Bedrock or S3 is used."
required: false
USE_ENHANCED_PDF_PROCESSING:
default: "false"
description: "Use enhanced PDF processing for better text extraction. This option makes PDF parsing take significantly longer. A GPU is highly recommended to speed up the process."
required: false
CAII_DOMAIN:
default: ""
description: "The domain of the CAII service. Setting this will enable CAII as the sole source for both inference and embedding models."
required: false
CAII_INFERENCE_ENDPOINT_NAME:
default: ""
description: "The name of the inference endpoint for the CAII service. Required if CAII_DOMAIN is set."
required: false
CAII_EMBEDDING_ENDPOINT_NAME:
default: ""
description: "The name of the embedding endpoint for the CAII service. Required if CAII_DOMAIN is set."
required: false
default: ""
description: "The domain of the CAII service. Setting this will enable CAII as the sole source for both inference and embedding models."
required: false
DB_URL:
default: "jdbc:h2:file:~/databases/rag"
description: "Internal DB URL. Do not change."
Expand Down
Binary file added RAG-AMP.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ RAG Studio requires AWS for access to both LLM and embedding models. Please comp

- A S3 bucket to store the documents
- The following models configured and accessible via AWS Bedrock. Any of the models not enabled will not function in the UI.
- Llama3.1 8b Instruct V1 (`meta.llama3-1-8b-instruct-v1:0`) - This model is required for the RAG Studio to function
- Llama3.1 70b Instruct V1 (`meta.llama3-1-70b-instruct-v1:0`)
- Llama3.1 405b Instruct V1 (`meta.llama3-1-405b-instruct-v1:0`)
- Llama3.1 8b Instruct v1 (`meta.llama3-1-8b-instruct-v1:0`) - This model is required for the RAG Studio to function
- Llama3.1 70b Instruct v1 (`meta.llama3-1-70b-instruct-v1:0`)
- Cohere Command R+ v1 (`cohere.command-r-plus-v1:0`)
- For Embedding, you will need to enable the following model in AWS Bedrock:
- Cohere English Embedding v3 (`meta.cohere-english-embedding-v3:0`)

Expand Down
6 changes: 5 additions & 1 deletion backend/src/main/java/com/cloudera/cai/rag/Types.java
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ public enum ConnectionType {
public record RagDataSource(
Long id,
String name,
String embeddingModel,
String summarizationModel,
Integer chunkSize,
Integer chunkOverlapPercent,
Instant timeCreated,
Expand All @@ -100,5 +102,7 @@ public record Session(
Instant timeUpdated,
String createdById,
String updatedById,
Instant lastInteractionTime) {}
Instant lastInteractionTime,
String inferenceModel,
Integer responseChunks) {}
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/*******************************************************************************
/*
* CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP)
* (C) Cloudera, Inc. 2024
* All rights reserved.
Expand Down Expand Up @@ -38,6 +38,9 @@

package com.cloudera.cai.rag.configuration;

import com.cloudera.cai.rag.files.FileSystemRagFileUploader;
import com.cloudera.cai.rag.files.RagFileUploader;
import com.cloudera.cai.rag.files.S3RagFileUploader;
import com.cloudera.cai.util.reconcilers.ReconcilerConfig;
import com.cloudera.cai.util.s3.AmazonS3Client;
import com.cloudera.cai.util.s3.S3Config;
Expand All @@ -64,20 +67,14 @@ public String s3BucketPrefix(S3Config s3Config) {
return s3Config.getBucketPrefix();
}

@Bean
public AmazonS3Client amazonS3Client(S3Config s3Config) {
return new AmazonS3Client(s3Config);
}

@Bean
public S3Config s3Config() {
return S3Config.builder()
.endpointUrl(System.getenv("AWS_ENDPOINT_URL_S3"))
.accessKey(System.getenv("AWS_ACCESS_KEY_ID"))
.secretKey(System.getenv("AWS_SECRET_ACCESS_KEY"))
.awsRegion(System.getenv("AWS_DEFAULT_REGION"))
.bucketName(
Optional.ofNullable(System.getenv("S3_RAG_DOCUMENT_BUCKET")).orElse("rag-files"))
.bucketName(Optional.ofNullable(System.getenv("S3_RAG_DOCUMENT_BUCKET")).orElse(""))
.bucketPrefix(Optional.ofNullable(System.getenv("S3_RAG_BUCKET_PREFIX")).orElse(""))
.build();
}
Expand Down Expand Up @@ -105,6 +102,15 @@ public HttpClient httpClient(OpenTelemetry openTelemetry) {
.newHttpClient(HttpClient.newHttpClient());
}

@Bean
public RagFileUploader ragFileUploader(S3Config configuration) {
if (configuration.getBucketName().isEmpty()) {
return new FileSystemRagFileUploader();
}
AmazonS3Client s3Client = new AmazonS3Client(configuration);
return new S3RagFileUploader(s3Client, configuration.getBucketName());
}

public static String getRagIndexUrl() {
return Optional.ofNullable(System.getenv("LLM_SERVICE_URL")).orElse("http://rag-backend:8000");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
import com.cloudera.cai.rag.Types.RagDataSource;
import com.cloudera.cai.rag.configuration.JdbiConfiguration;
import com.cloudera.cai.util.exceptions.NotFound;
import java.time.Instant;
import java.util.List;
import lombok.extern.slf4j.Slf4j;
import org.jdbi.v3.core.Jdbi;
Expand All @@ -58,35 +59,48 @@ public RagDataSourceRepository(Jdbi jdbi) {
}

public Long createRagDataSource(RagDataSource input) {

RagDataSource cleanedInputs = cleanInputs(input);
return jdbi.inTransaction(
handle -> {
var sql =
"""
INSERT INTO rag_data_source (name, chunk_size, chunk_overlap_percent, created_by_id, updated_by_id, connection_type)
VALUES (:name, :chunkSize, :chunkOverlapPercent, :createdById, :updatedById, :connectionType)
INSERT INTO rag_data_source (name, chunk_size, chunk_overlap_percent, created_by_id, updated_by_id, connection_type, embedding_model, summarization_model)
VALUES (:name, :chunkSize, :chunkOverlapPercent, :createdById, :updatedById, :connectionType, :embeddingModel, :summarizationModel)
""";
try (var update = handle.createUpdate(sql)) {
update.bindMethods(input);
update.bindMethods(cleanedInputs);
return update.executeAndReturnGeneratedKeys("id").mapTo(Long.class).one();
}
});
}

private static RagDataSource cleanInputs(RagDataSource input) {
if (input.summarizationModel() != null && input.summarizationModel().isEmpty()) {
input = input.withSummarizationModel(null);
}
return input;
}

public void updateRagDataSource(RagDataSource input) {

RagDataSource cleanedInputs = cleanInputs(input);
jdbi.inTransaction(
handle -> {
var sql =
"""
UPDATE rag_data_source
SET name = :name, connection_type = :connectionType, updated_by_id = :updatedById
SET name = :name, connection_type = :connectionType, updated_by_id = :updatedById, summarization_model = :summarizationModel, time_updated = :now
WHERE id = :id AND deleted IS NULL
""";
try (var update = handle.createUpdate(sql)) {
return update
.bind("name", input.name())
.bind("updatedById", input.updatedById())
.bind("connectionType", input.connectionType())
.bind("id", input.id())
.bind("name", cleanedInputs.name())
.bind("updatedById", cleanedInputs.updatedById())
.bind("connectionType", cleanedInputs.connectionType())
.bind("id", cleanedInputs.id())
.bind("summarizationModel", cleanedInputs.summarizationModel())
.bind("now", Instant.now())
.execute();
}
});
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/*******************************************************************************
/*
* CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP)
* (C) Cloudera, Inc. 2024
* All rights reserved.
Expand Down Expand Up @@ -67,9 +67,11 @@ public void indexFile(
indexUrl
+ "/data_sources/"
+ ragDocument.dataSourceId()
+ "/documents/download-and-index",
+ "/documents/"
+ ragDocument.documentId()
+ "/index",
new IndexRequest(
ragDocument.documentId(), bucketName, ragDocument.s3Path(), configuration));
bucketName, ragDocument.s3Path(), ragDocument.filename(), configuration));
} catch (IOException e) {
throw new RuntimeException(e);
}
Expand All @@ -78,8 +80,13 @@ public void indexFile(
public String createSummary(Types.RagDocument ragDocument, String bucketName) {
try {
return client.post(
indexUrl + "/data_sources/" + ragDocument.dataSourceId() + "/summarize-document",
new SummaryRequest(bucketName, ragDocument.s3Path()));
indexUrl
+ "/data_sources/"
+ ragDocument.dataSourceId()
+ "/documents/"
+ ragDocument.documentId()
+ "/summary",
new SummaryRequest(bucketName, ragDocument.s3Path(), ragDocument.filename()));
} catch (IOException e) {
throw new RuntimeException(e);
}
Expand All @@ -98,14 +105,15 @@ public void deleteSession(Long sessionId) {
}

record IndexRequest(
@JsonProperty("document_id") String documentId,
@JsonProperty("s3_bucket_name") String s3BucketName,
@JsonProperty("s3_document_key") String s3DocumentKey,
@JsonProperty("original_filename") String originalFilename,
IndexConfiguration configuration) {}

public record SummaryRequest(
@JsonProperty("s3_bucket_name") String s3BucketName,
@JsonProperty("s3_document_key") String s3DocumentKey) {}
@JsonProperty("s3_document_key") String s3DocumentKey,
@JsonProperty("original_filename") String originalFilename) {}

public record IndexConfiguration(
@JsonProperty("chunk_size") int chunkSize,
Expand Down Expand Up @@ -150,7 +158,9 @@ public void deleteDataSource(Long dataSourceId) {
@Override
public String createSummary(Types.RagDocument ragDocument, String bucketName) {
String result = super.createSummary(ragDocument, bucketName);
tracker.track(new TrackedRequest<>(new SummaryRequest(bucketName, ragDocument.s3Path())));
tracker.track(
new TrackedRequest<>(
new SummaryRequest(bucketName, ragDocument.s3Path(), ragDocument.filename())));
checkForException();
return result;
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
/*
* CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP)
* (C) Cloudera, Inc. 2024
* All rights reserved.
*
* Applicable Open Source License: Apache 2.0
*
* NOTE: Cloudera open source products are modular software products
* made up of hundreds of individual components, each of which was
* individually copyrighted. Each Cloudera open source product is a
* collective work under U.S. Copyright Law. Your license to use the
* collective work is as provided in your written agreement with
* Cloudera. Used apart from the collective work, this file is
* licensed for your use pursuant to the open source license
* identified above.
*
* This code is provided to you pursuant a written agreement with
* (i) Cloudera, Inc. or (ii) a third-party authorized to distribute
* this code. If you do not have a written agreement with Cloudera nor
* with an authorized and properly licensed third party, you do not
* have any rights to access nor to use this code.
*
* Absent a written agreement with Cloudera, Inc. (“Cloudera”) to the
* contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY
* KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED
* WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO
* IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND
* FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU,
* AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS
* ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE
* OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY
* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR
* CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES
* RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF
* BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF
* DATA.
******************************************************************************/

package com.cloudera.cai.rag.files;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
import org.springframework.web.multipart.MultipartFile;

@Slf4j
@Component
public class FileSystemRagFileUploader implements RagFileUploader {

private static final String FILE_STORAGE_ROOT = fileStoragePath();

@Override
public void uploadFile(MultipartFile file, String s3Path) {
log.info("Uploading file to FS: {}", s3Path);
try {
Path filePath = Path.of(FILE_STORAGE_ROOT, s3Path);
Files.createDirectories(filePath.getParent());
Files.write(filePath, file.getBytes());
} catch (IOException e) {
throw new RuntimeException(e);
}
}

private static String fileStoragePath() {
var fileStoragePath = System.getenv("RAG_DATABASES_DIR") + "/file_storage";
log.info("configured with fileStoragePath = {}", fileStoragePath);
return fileStoragePath;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ public RagDocumentMetadata saveRagFile(MultipartFile file, Long dataSourceId, St
String documentId = idGenerator.generateId();
var s3Path = buildS3Path(dataSourceId, documentId);

ragFileUploader.uploadFile(file, s3Path, removeDirectories(file.getOriginalFilename()));
ragFileUploader.uploadFile(file, s3Path);
var ragDocument = createUnsavedDocument(file, documentId, s3Path, dataSourceId, actorCrn);
Long id = ragFileRepository.saveDocumentMetadata(ragDocument);
log.info("Saved document with id: {}", id);
Expand All @@ -95,7 +95,11 @@ public RagDocumentMetadata saveRagFile(MultipartFile file, Long dataSourceId, St
}

private String buildS3Path(Long dataSourceId, String documentId) {
return s3PathPrefix + "/" + dataSourceId + "/" + documentId;
var dataSourceDocumentPart = dataSourceId + "/" + documentId;
if (s3PathPrefix.isEmpty()) {
return dataSourceDocumentPart;
}
return s3PathPrefix + "/" + dataSourceDocumentPart;
}

private String extractFileExtension(String originalFilename) {
Expand Down
Loading

0 comments on commit 0251c7f

Please sign in to comment.