Project Bluebear is a cutting-edge conversational Gen AI solution designed to analyze datasets ranging from megabytes (MBs) to petabytes (PBs) using Amazon Bedrock Agents and Apache Spark. This framework provides two seamless execution options:
-
Spark on AWS Lambda (SoAL) – A lightweight, real-time processing engine for datasets up to 500MB, supporting single-node spark execution for optimized performance.
-
Amazon EMR Serverless – A scalable solution for handling larger datasets, ensuring efficient heavy-lifting for extensive data analysis.
-
Conversational Interface – Business users submit natural language queries.
-
AI-Powered Code Generation – Amazon Bedrock dynamically generates Spark code based on the user’s prompt.
-
Intelligent Execution – The Spark code runs on a dropdown interface, allowing users to choose between SoAL (Spark on AWS Lambda) and Amazon EMR Serverless, providing a cost-conscious option for executing their queries.
-
SoAL (Spark on AWS Lambda) for quick, real-time analysis of smaller datasets.
-
Amazon EMR Serverless for processing larger datasets, including petabytes of data, with robust computational power.
-
Natural language should be the new way of interacting with data, eliminating the need to spend months on ETL frameworks and deployment. Project Bluebear enables business users to perform analytics effortlessly through natural language queries, providing actionable insights in real time or at scale.
This project provides a conversational interface using Bedrock Claude Chatbot. Amazon Bedrock is used for generating the spark code based on the user prompt. The spark code is then run on a lightweight Apache Spark on AWS Lambda(SoAL) framework to provide analysis results to the user. If the input data file is small (<=500 MB), Spark on AWS Lambda (SoAL) is used for data processing. If the input data file is larger, Amazon EMR Serverless is used for data processing. SoAL helps with quick data processing and can provide the results in realtime. With Amazon EMR Serverless, users will receive results once the data processing is finished based on the size of the input data set.
- Amazon Bedrock Anthropic Claude Model Access
- S3 bucket to store uploaded documents and Textract output.
- Amazon Elastic Container Registry to store custom docker images.
- Optional:
- Create an Amazon DynamoDB table to store chat history (Run the notebook BedrockChatUI to create a DynamoDB Table). This is optional as there is a local disk storage option, however, I would recommend using Amazon DynamoDB.
- Amazon Textract. This is optional as there is an option to use python libraries
pypdf2
andpytessesract
for PDF and image processing. However, I would recommend using Amazon Textract for higher quality PDF and image processing. You will experience latency when usingpytesseract
.
To use the Advanced Analytics Feature, this additional step is required (ChatBot can still be used without enabling Advanced Analytics Feature
):
- Amazon Lambda function with custom python image to execute python code for analytics.
-
Create an private ECR repository by following the link in step 3.
-
On your local machine or any related AWS services including AWS CloudShell, Amazon Elastic Compute Cloud, Amazon Sageamker Studio etc. run the following CLI commands:
- install git and clone this git repo
git clone [github_link]
- navigate into the Docker directory
cd Docker
- if using local machine, authenticate with your AWS credentials
- install AWS Command Line Interface (AWS CLI) version 2 if not already installed.
- Follow the steps in the Deploying the image section under Using an AWS base image for Python in this documentation guide. Replace the placeholders with the appropiate values. You can skip step
2
if you already created an ECR repository. - In step 6, in addition to
AWSLambdaBasicExecutionRole
policy, ONLY grant least priveledged read and write Amazon S3 policies to the execution role. Scope down the policy to only include the necessary S3 bucket and S3 directory prefix where uploaded files will be stored and read from as configured in theconfig.json
file below. - In step 7, I recommend creating the Lambda function in a Amazon Virtual Private Cloud (VPC) without internet access and attach Amazon S3 and Amazon CloudWatch gateway and interface endpoints accordingly. The following step 7 command can be modified to include VPC paramters:
aws lambda create-function \ --function-name YourFunctionName \ --package-type Image \ --code ImageUri=your-account-id.dkr.ecr.your-region.amazonaws.com/your-repo:tag \ --role arn:aws:iam::your-account-id:role/YourLambdaExecutionRole \ --vpc-config SubnetIds=subnet-xxxxxxxx,subnet-yyyyyyyy,SecurityGroupIds=sg-zzzzzzzz \ --memory-size 512 \ --timeout 300 \ --region your-region
Modify the placeholders as appropiate. I recommend to keep
timeout
andmemory-size
params conservative as that will affect cost. A good staring point for memory is512
MB.- Ignore step 8.
- install git and clone this git repo
-
⚠ IMPORTANT SECURITY NOTE:
Enabling the Advanced Analytics Feature allows the LLM to generate and execute Python code to analyze your dataset that will automatically be executed in a Lambda function environment. To mitigate potential risks:
- VPC Configuration:
- It is recommended to place the Lambda function in an internet-free VPC.
- Use Amazon S3 and CloudWatch gateway/interface endpoints for necessary access.
- IAM Permissions:
- Scope down the Lambda execution role to only Amazon S3 and the required S3 resources. This is in addition to
AWSLambdaBasicExecutionRole
policy.
- Library Restrictions:
- Only libraries specified in
Docker/requirements.txt
will be available at runtime. - Modify this list carefully based on your needs.
- Resource Allocation:
- Adjust Lambda
timeout
andmemory-size
based on data size and analysis complexity.
- Production Considerations:
- This application is designed for POC use.
- Implement additional security measures before deploying to production.
The goal is to limit the potential impact of generated code execution.
##Configuration To customize the behavior for the conversational chatbot follow these instructions.
➡️ AWS Guide: Create an EC2 Instance
Since Streamlit runs on TCP port 8501, you must allow inbound traffic.
- In the AWS EC2 Console, navigate to Security Groups.
- Select the Security Group attached to your EC2 instance.
- Click Edit inbound rules and add the following
Ensure the EC2 instance profile role has the required IAM permissions to access AWS services used in this application.
➡️ AWS Guide: Assign an Instance Profile Role
➡️ AWS Guide: Connect to Your EC2 Instance
Run the following command to connect via SSH:
ssh -i your-key.pem ubuntu@your-ec2-public-ip
- Run the appropiate commands to update the ec2 instance.
sudo apt update
sudo apt upgrade
git clone [github_link]
If Python3 and Pip are not already installed, run the following command:
sudo apt install python3 python3-pip -y
If you decide to use Python libraries for PDF and image processing, you need to install Tesseract-OCR. Run the appropriate command based on your operating system:
sudo rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum -y update
sudo yum install -y tesseract
Run the following command to install the required dependencies:
sudo pip install -r req.txt --upgrade
-
tmux allows your Streamlit app to keep running even after you disconnect from the SSH session, ensuring uninterrupted execution.
-
Run command
-
tmux new -s mysession
to create a new session. -
Then in the new session created
cd
into the ChatBot dir and run below to start the stream lit app. This allows you to run the Streamlit application in the background and keep it running even if you disconnect from the terminal session. -
Copy the External URL link generated and paste in a new browser tab.
-
⚠ NOTE: The generated link is not secure! For additional guidance. To stop the
tmux
session, in your ec2 terminal PressCtrl+b
, thend
to detach. to kill the session, runtmux kill-session -t mysession
We have below items on future roadmap
- In case of a larger dataset, use subset of the dataset to provide realtime results back to the user.
- Automatically decide weather to use SoAL or EMR serverless based on the size of the dataset.
Terminate the EC2 instance