Merge pull request #76 from cloudera/main

* upgrade everything * small refactor for params, update loading * add bedrock converse * fix loading * Clean up Cohere suggested questions * Add property-based test for process_response() (#56) * Add hypothesis * Add property-based test for process_response() * Shorten variable * Formatting * Add type annotations * Fix type annotation * hacking on startup scripts * hacking on startup scripts, moar * fix wrong dir * try having the java side restart itself if it dies * see output from java startup * add debug info * add the executable bit * change the flags * Add docstrings for tests * refactor datasourceId * update to exclude 405b model and default to 8b * update readme for new cohere * fix broken tests monkeypatching * "wip on creating with models and response chunks" * wip on modal updates * commit java updates * wip on populating the chat setting modal * set up ui for updating a session * add update method * use updated session for chat * remove query configuration from the chat context * refactoring fe and fixing bug with empty model * remove the datasource id from the context and use the active session instead * Update release version to 1.4.0-beta * Support multiple embedding models (#59) * add embedding model to the data source in the java API * embedding model used from the datasource while indexing * replace the rest of the embedding model defaults * "test & fix bugs with embedding variability" * small refactoring to make embedding & llm caii methods look the same * fix linting issues * add a todo for a failing property test case * remove unused import --------- Co-authored-by: Elijah Williams <ewilliams@cloudera.com> * Provide CAII batch embedding for better performance (#35) * CAII endpoint discovery (#60) * "wip on endpoint listing" * "wip on list_endpoints typing" * "refactoring to endpoint object" * "wip filtering" * "endpoints queried!" * "refactoring" * "wip on cleaning up types" * "type cleanup complete" * "moving files" * "use a dummy embedding model for deletes" * fix some bits from merge, get evals working again with CAII, tests passing * formatting * clean up ruff stuff * use the chat llm for evals * fix mypy for reformatting * "wip on java reconciler" * "reconciler don't do no model; start python work" * "python - updating for summarization model" * "comment out batch embeddings to get it working again" * add handling for no summarization in the files table * finish up ui and python for summarization * make sure to update the time-updated fields on data sources and chat sessions * use no-op models when we don't need real ones for summary functionality * Update release version to dev-testing * use the summarization llm when summarizing summaries --------- Co-authored-by: Elijah Williams <ewilliams@cloudera.com> Co-authored-by: actions-user <actions@github.com> * Update release version to 1.4.0 * pass the original filename from java-> python so we don't need s3 metadata to store it * don't read the whole directory when summarizing docs * "refactor java to use RagFileService" * remove seaweedfs experiment * Make mypy happy (#62) * Refactor summary index to isolate the logic (#63) * Refactor summary index to isolate the logic * fix tests * handle race condition * handle mypy * ignore errors if the directory doesn't exist --------- Co-authored-by: jwatson <jkwatson@gmail.com> * image * Update catalog entry to match the official one (#66) * Update local catalog with official info * add the git-ref back * add the html long description (#67) * Shuffle API for data sources for easier human consumption (#68) * Shuffle API for data sources for easier human consumption * make mypy happy * remove prints * wip o fs rag file uploader * "now we're thinking with overtime" * Revert ""now we're thinking with overtime"" This reverts commit 3c93206. * get the databases directory from the environment (in local_dev) python file storage abstraction python tests currently broken real AMP startup script needs new env var * add a todo * merge from main * properly override the configuration in pytest configure to point at a temp directory * get the tests passing with filesystem file handoff * update project metadata to support new local filesystem storage * Update release version to dev-testing * fix java * cleanup after switching tests to use the local filesystem * Remove unused settings (#70) * remove unused dep * fix circular dep and refactor doc storage * Update release version to 1.4.0 * Summarize the data store on every document summarization (#69) * fix bug with s3 path when the prefix is not provided (#72) * add --reload to the fastapi startup_app script * Avoid global variables and use ephemeral folder for tests (#71) * Avoid global variables and use ephemeral folder for tests * fix with merge to main * Remove print * lint * refetch knowledge base summary on doc summary change * Bump @eslint/plugin-kit (#16) Bumps the npm_and_yarn group with 1 update in the /ui directory: [@eslint/plugin-kit](https://github.com/eslint/rewrite). Updates `@eslint/plugin-kit` from 0.2.2 to 0.2.3 - [Release notes](https://github.com/eslint/rewrite/releases) - [Changelog](https://github.com/eslint/rewrite/blob/main/release-please-config.json) - [Commits](eslint/rewrite@plugin-kit-v0.2.2...plugin-kit-v0.2.3) --- updated-dependencies: - dependency-name: "@eslint/plugin-kit" dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: jwatson <jkwatson@gmail.com> Co-authored-by: Michael Liu <mliu@cloudera.com> Co-authored-by: actions-user <actions@github.com> Co-authored-by: conradocloudera <csilvamiranda@cloudera.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
cloudera · Dec 12, 2024 · 0251c7f · 0251c7f
2 parents afcf7e2 + ec5804a
commit 0251c7f
Show file tree

Hide file tree

Showing 123 changed files with 3,920 additions and 1,746 deletions.
diff --git a/.env.example b/.env.example
@@ -12,8 +12,6 @@ DB_URL=jdbc:h2:../databases/rag
 
 # If using CAII, fill these in:
 CAII_DOMAIN=
-CAII_INFERENCE_ENDPOINT_NAME=
-CAII_EMBEDDING_ENDPOINT_NAME=
 
 # set this to true if you have uv installed on your system, other wise don't include this
 USE_SYSTEM_UV=true

diff --git a/.project-metadata.yaml b/.project-metadata.yaml
@@ -8,40 +8,32 @@ prototype_version: 1.0
 environment_variables:
     AWS_DEFAULT_REGION:
         default: "us-west-2"
-        description: "AWS Region where Bedrock is configured and the S3 bucket is located"
-        required: true
+        description: "AWS Region where Bedrock is configured and the S3 bucket is located. Required if Bedrock or S3 is used."
+        required: false
     S3_RAG_DOCUMENT_BUCKET:
         default: ""
-        description: "The S3 bucket where uploaded documents are stored"
-        required: true
+        description: "The S3 bucket where uploaded documents are stored. Only set if S3 is used for file storage."
+        required: false
     S3_RAG_BUCKET_PREFIX:
         default: "rag-studio"
-        description: "A prefix added to all S3 paths used by Rag Studio"
-        required: true
+        description: "A prefix added to all S3 paths used by Rag Studio. Only needed if S3 is used for file storage."
+        required: false
     AWS_ACCESS_KEY_ID:
         default: ""
-        description: "AWS Access Key ID"
-        required: true
+        description: "AWS Access Key ID. Required if Bedrock or S3 is used."
+        required: false
     AWS_SECRET_ACCESS_KEY:
         default: ""
-        description: "AWS Secret Access Key"
-        required: true
+        description: "AWS Secret Access Key. Required if Bedrock or S3 is used."
+        required: false
     USE_ENHANCED_PDF_PROCESSING:
         default: "false"
         description: "Use enhanced PDF processing for better text extraction. This option makes PDF parsing take significantly longer. A GPU is highly recommended to speed up the process."
         required: false
     CAII_DOMAIN:
-      default: ""
-      description: "The domain of the CAII service. Setting this will enable CAII as the sole source for both inference and embedding models."
-      required: false
-    CAII_INFERENCE_ENDPOINT_NAME:
-      default: ""
-      description: "The name of the inference endpoint for the CAII service. Required if CAII_DOMAIN is set."
-      required: false
-    CAII_EMBEDDING_ENDPOINT_NAME:
-      default: ""
-      description: "The name of the embedding endpoint for the CAII service. Required if CAII_DOMAIN is set."
-      required: false
+        default: ""
+        description: "The domain of the CAII service. Setting this will enable CAII as the sole source for both inference and embedding models."
+        required: false
     DB_URL:
         default: "jdbc:h2:file:~/databases/rag"
         description: "Internal DB URL. Do not change."

diff --git a/RAG-AMP.jpg b/RAG-AMP.jpg
diff --git a/README.md b/README.md
@@ -10,9 +10,9 @@ RAG Studio requires AWS for access to both LLM and embedding models. Please comp
 
 - A S3 bucket to store the documents
 - The following models configured and accessible via AWS Bedrock. Any of the models not enabled will not function in the UI.
-  - Llama3.1 8b Instruct V1 (`meta.llama3-1-8b-instruct-v1:0`) - This model is required for the RAG Studio to function
-  - Llama3.1 70b Instruct V1 (`meta.llama3-1-70b-instruct-v1:0`)
-  - Llama3.1 405b Instruct V1 (`meta.llama3-1-405b-instruct-v1:0`)
+  - Llama3.1 8b Instruct v1 (`meta.llama3-1-8b-instruct-v1:0`) - This model is required for the RAG Studio to function
+  - Llama3.1 70b Instruct v1 (`meta.llama3-1-70b-instruct-v1:0`)
+  - Cohere Command R+ v1 (`cohere.command-r-plus-v1:0`)
 - For Embedding, you will need to enable the following model in AWS Bedrock:
   - Cohere English Embedding v3 (`meta.cohere-english-embedding-v3:0`)
 

diff --git a/backend/src/main/java/com/cloudera/cai/rag/Types.java b/backend/src/main/java/com/cloudera/cai/rag/Types.java
@@ -80,6 +80,8 @@ public enum ConnectionType {
   public record RagDataSource(
       Long id,
       String name,
+      String embeddingModel,
+      String summarizationModel,
       Integer chunkSize,
       Integer chunkOverlapPercent,
       Instant timeCreated,
@@ -100,5 +102,7 @@ public record Session(
       Instant timeUpdated,
       String createdById,
       String updatedById,
-      Instant lastInteractionTime) {}
+      Instant lastInteractionTime,
+      String inferenceModel,
+      Integer responseChunks) {}
 }
diff --git a/backend/src/main/java/com/cloudera/cai/rag/configuration/AppConfiguration.java b/backend/src/main/java/com/cloudera/cai/rag/configuration/AppConfiguration.java
@@ -1,4 +1,4 @@
-/*******************************************************************************
+/*
  * CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP)
  * (C) Cloudera, Inc. 2024
  * All rights reserved.
@@ -38,6 +38,9 @@
 
 package com.cloudera.cai.rag.configuration;
 
+import com.cloudera.cai.rag.files.FileSystemRagFileUploader;
+import com.cloudera.cai.rag.files.RagFileUploader;
+import com.cloudera.cai.rag.files.S3RagFileUploader;
 import com.cloudera.cai.util.reconcilers.ReconcilerConfig;
 import com.cloudera.cai.util.s3.AmazonS3Client;
 import com.cloudera.cai.util.s3.S3Config;
@@ -64,20 +67,14 @@ public String s3BucketPrefix(S3Config s3Config) {
     return s3Config.getBucketPrefix();
   }
 
-  @Bean
-  public AmazonS3Client amazonS3Client(S3Config s3Config) {
-    return new AmazonS3Client(s3Config);
-  }
-
   @Bean
   public S3Config s3Config() {
     return S3Config.builder()
         .endpointUrl(System.getenv("AWS_ENDPOINT_URL_S3"))
         .accessKey(System.getenv("AWS_ACCESS_KEY_ID"))
         .secretKey(System.getenv("AWS_SECRET_ACCESS_KEY"))
         .awsRegion(System.getenv("AWS_DEFAULT_REGION"))
-        .bucketName(
-            Optional.ofNullable(System.getenv("S3_RAG_DOCUMENT_BUCKET")).orElse("rag-files"))
+        .bucketName(Optional.ofNullable(System.getenv("S3_RAG_DOCUMENT_BUCKET")).orElse(""))
         .bucketPrefix(Optional.ofNullable(System.getenv("S3_RAG_BUCKET_PREFIX")).orElse(""))
         .build();
   }
@@ -105,6 +102,15 @@ public HttpClient httpClient(OpenTelemetry openTelemetry) {
         .newHttpClient(HttpClient.newHttpClient());
   }
 
+  @Bean
+  public RagFileUploader ragFileUploader(S3Config configuration) {
+    if (configuration.getBucketName().isEmpty()) {
+      return new FileSystemRagFileUploader();
+    }
+    AmazonS3Client s3Client = new AmazonS3Client(configuration);
+    return new S3RagFileUploader(s3Client, configuration.getBucketName());
+  }
+
   public static String getRagIndexUrl() {
     return Optional.ofNullable(System.getenv("LLM_SERVICE_URL")).orElse("http://rag-backend:8000");
   }

diff --git a/backend/src/main/java/com/cloudera/cai/rag/datasources/RagDataSourceRepository.java b/backend/src/main/java/com/cloudera/cai/rag/datasources/RagDataSourceRepository.java
@@ -41,6 +41,7 @@
 import com.cloudera.cai.rag.Types.RagDataSource;
 import com.cloudera.cai.rag.configuration.JdbiConfiguration;
 import com.cloudera.cai.util.exceptions.NotFound;
+import java.time.Instant;
 import java.util.List;
 import lombok.extern.slf4j.Slf4j;
 import org.jdbi.v3.core.Jdbi;
@@ -58,35 +59,48 @@ public RagDataSourceRepository(Jdbi jdbi) {
   }
 
   public Long createRagDataSource(RagDataSource input) {
+
+    RagDataSource cleanedInputs = cleanInputs(input);
     return jdbi.inTransaction(
         handle -> {
           var sql =
               """
-                INSERT INTO rag_data_source (name, chunk_size, chunk_overlap_percent, created_by_id, updated_by_id, connection_type)
-                VALUES (:name, :chunkSize, :chunkOverlapPercent, :createdById, :updatedById, :connectionType)
+                INSERT INTO rag_data_source (name, chunk_size, chunk_overlap_percent, created_by_id, updated_by_id, connection_type, embedding_model, summarization_model)
+                VALUES (:name, :chunkSize, :chunkOverlapPercent, :createdById, :updatedById, :connectionType, :embeddingModel, :summarizationModel)
               """;
           try (var update = handle.createUpdate(sql)) {
-            update.bindMethods(input);
+            update.bindMethods(cleanedInputs);
             return update.executeAndReturnGeneratedKeys("id").mapTo(Long.class).one();
           }
         });
   }
 
+  private static RagDataSource cleanInputs(RagDataSource input) {
+    if (input.summarizationModel() != null && input.summarizationModel().isEmpty()) {
+      input = input.withSummarizationModel(null);
+    }
+    return input;
+  }
+
   public void updateRagDataSource(RagDataSource input) {
+
+    RagDataSource cleanedInputs = cleanInputs(input);
     jdbi.inTransaction(
         handle -> {
           var sql =
               """
               UPDATE rag_data_source
-              SET name = :name, connection_type = :connectionType, updated_by_id = :updatedById
+              SET name = :name, connection_type = :connectionType, updated_by_id = :updatedById, summarization_model = :summarizationModel, time_updated = :now
               WHERE id = :id AND deleted IS NULL
           """;
           try (var update = handle.createUpdate(sql)) {
             return update
-                .bind("name", input.name())
-                .bind("updatedById", input.updatedById())
-                .bind("connectionType", input.connectionType())
-                .bind("id", input.id())
+                .bind("name", cleanedInputs.name())
+                .bind("updatedById", cleanedInputs.updatedById())
+                .bind("connectionType", cleanedInputs.connectionType())
+                .bind("id", cleanedInputs.id())
+                .bind("summarizationModel", cleanedInputs.summarizationModel())
+                .bind("now", Instant.now())
                 .execute();
           }
         });

diff --git a/backend/src/main/java/com/cloudera/cai/rag/external/RagBackendClient.java b/backend/src/main/java/com/cloudera/cai/rag/external/RagBackendClient.java
@@ -1,4 +1,4 @@
-/*******************************************************************************
+/*
  * CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP)
  * (C) Cloudera, Inc. 2024
  * All rights reserved.
@@ -67,9 +67,11 @@ public void indexFile(
           indexUrl
               + "/data_sources/"
               + ragDocument.dataSourceId()
-              + "/documents/download-and-index",
+              + "/documents/"
+              + ragDocument.documentId()
+              + "/index",
           new IndexRequest(
-              ragDocument.documentId(), bucketName, ragDocument.s3Path(), configuration));
+              bucketName, ragDocument.s3Path(), ragDocument.filename(), configuration));
     } catch (IOException e) {
       throw new RuntimeException(e);
     }
@@ -78,8 +80,13 @@ public void indexFile(
   public String createSummary(Types.RagDocument ragDocument, String bucketName) {
     try {
       return client.post(
-          indexUrl + "/data_sources/" + ragDocument.dataSourceId() + "/summarize-document",
-          new SummaryRequest(bucketName, ragDocument.s3Path()));
+          indexUrl
+              + "/data_sources/"
+              + ragDocument.dataSourceId()
+              + "/documents/"
+              + ragDocument.documentId()
+              + "/summary",
+          new SummaryRequest(bucketName, ragDocument.s3Path(), ragDocument.filename()));
     } catch (IOException e) {
       throw new RuntimeException(e);
     }
@@ -98,14 +105,15 @@ public void deleteSession(Long sessionId) {
   }
 
   record IndexRequest(
-      @JsonProperty("document_id") String documentId,
       @JsonProperty("s3_bucket_name") String s3BucketName,
       @JsonProperty("s3_document_key") String s3DocumentKey,
+      @JsonProperty("original_filename") String originalFilename,
       IndexConfiguration configuration) {}
 
   public record SummaryRequest(
       @JsonProperty("s3_bucket_name") String s3BucketName,
-      @JsonProperty("s3_document_key") String s3DocumentKey) {}
+      @JsonProperty("s3_document_key") String s3DocumentKey,
+      @JsonProperty("original_filename") String originalFilename) {}
 
   public record IndexConfiguration(
       @JsonProperty("chunk_size") int chunkSize,
@@ -150,7 +158,9 @@ public void deleteDataSource(Long dataSourceId) {
       @Override
       public String createSummary(Types.RagDocument ragDocument, String bucketName) {
         String result = super.createSummary(ragDocument, bucketName);
-        tracker.track(new TrackedRequest<>(new SummaryRequest(bucketName, ragDocument.s3Path())));
+        tracker.track(
+            new TrackedRequest<>(
+                new SummaryRequest(bucketName, ragDocument.s3Path(), ragDocument.filename())));
         checkForException();
         return result;
       }

diff --git a/backend/src/main/java/com/cloudera/cai/rag/files/FileSystemRagFileUploader.java b/backend/src/main/java/com/cloudera/cai/rag/files/FileSystemRagFileUploader.java
@@ -0,0 +1,71 @@
+/*
+ * CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP)
+ * (C) Cloudera, Inc. 2024
+ * All rights reserved.
+ *
+ * Applicable Open Source License: Apache 2.0
+ *
+ * NOTE: Cloudera open source products are modular software products
+ * made up of hundreds of individual components, each of which was
+ * individually copyrighted.  Each Cloudera open source product is a
+ * collective work under U.S. Copyright Law. Your license to use the
+ * collective work is as provided in your written agreement with
+ * Cloudera.  Used apart from the collective work, this file is
+ * licensed for your use pursuant to the open source license
+ * identified above.
+ *
+ * This code is provided to you pursuant a written agreement with
+ * (i) Cloudera, Inc. or (ii) a third-party authorized to distribute
+ * this code. If you do not have a written agreement with Cloudera nor
+ * with an authorized and properly licensed third party, you do not
+ * have any rights to access nor to use this code.
+ *
+ * Absent a written agreement with Cloudera, Inc. (“Cloudera”) to the
+ * contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY
+ * KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED
+ * WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO
+ * IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND
+ * FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU,
+ * AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS
+ * ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE
+ * OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR
+ * CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES
+ * RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF
+ * BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF
+ * DATA.
+ ******************************************************************************/
+
+package com.cloudera.cai.rag.files;
+
+import java.io.IOException;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import lombok.extern.slf4j.Slf4j;
+import org.springframework.stereotype.Component;
+import org.springframework.web.multipart.MultipartFile;
+
+@Slf4j
+@Component
+public class FileSystemRagFileUploader implements RagFileUploader {
+
+  private static final String FILE_STORAGE_ROOT = fileStoragePath();
+
+  @Override
+  public void uploadFile(MultipartFile file, String s3Path) {
+    log.info("Uploading file to FS: {}", s3Path);
+    try {
+      Path filePath = Path.of(FILE_STORAGE_ROOT, s3Path);
+      Files.createDirectories(filePath.getParent());
+      Files.write(filePath, file.getBytes());
+    } catch (IOException e) {
+      throw new RuntimeException(e);
+    }
+  }
+
+  private static String fileStoragePath() {
+    var fileStoragePath = System.getenv("RAG_DATABASES_DIR") + "/file_storage";
+    log.info("configured with fileStoragePath = {}", fileStoragePath);
+    return fileStoragePath;
+  }
+}
diff --git a/backend/src/main/java/com/cloudera/cai/rag/files/RagFileService.java b/backend/src/main/java/com/cloudera/cai/rag/files/RagFileService.java
@@ -83,7 +83,7 @@ public RagDocumentMetadata saveRagFile(MultipartFile file, Long dataSourceId, St
     String documentId = idGenerator.generateId();
     var s3Path = buildS3Path(dataSourceId, documentId);
 
-    ragFileUploader.uploadFile(file, s3Path, removeDirectories(file.getOriginalFilename()));
+    ragFileUploader.uploadFile(file, s3Path);
     var ragDocument = createUnsavedDocument(file, documentId, s3Path, dataSourceId, actorCrn);
     Long id = ragFileRepository.saveDocumentMetadata(ragDocument);
     log.info("Saved document with id: {}", id);
@@ -95,7 +95,11 @@ public RagDocumentMetadata saveRagFile(MultipartFile file, Long dataSourceId, St
   }
 
   private String buildS3Path(Long dataSourceId, String documentId) {
-    return s3PathPrefix + "/" + dataSourceId + "/" + documentId;
+    var dataSourceDocumentPart = dataSourceId + "/" + documentId;
+    if (s3PathPrefix.isEmpty()) {
+      return dataSourceDocumentPart;
+    }
+    return s3PathPrefix + "/" + dataSourceDocumentPart;
   }
 
   private String extractFileExtension(String originalFilename) {