This repository contains an example that demonstrate how to analyze a (small) SPARK analysis output (pandas dataframe
) coming from dataproc serverless SPARK into the PaLM-2 Generative AI, powered by Vertex AI on Google Cloud.
The example pyspark
code contains three different analysis using the BigQuery NYC Citi Bike Trips Public Dataset, in particular:
- What are the most popular Citi Bike stations?
- What are the most popular routes by subscriber type?
- What are the top routes by gender?
-
Select or create a Google Cloud project, and enable the required APIs.
-
Create a service account with enough permissions to interact with the different services (BigQuery, dataproc spark, Vertex AI).
-
Open
Cloud shell
and clone this repository -
Edit the
build_dataproc_image.sh
file and specify:
PROJECT_ID="TO_DO_DEVELOPER"
GCP_REGION="TO_DO_DEVELOPER"
- Build a custom dataproc image that contains the Vertex AI python SDK
build_dataproc_image.sh
- Edit the
launch_job.sh
file and specify:
PROJECT_ID="TO_DO_DEVELOPER"
GCP_REGION="TO_DO_DEVELOPER"
SUBNET="TO_DO_DEVELOPER"
UMSA_FQN="TO_DO_DEVELOPER"
DEPS_BUCKET="TO_DO_DEVELOPER"
- Launch the dataproc serverless job indicating the analysis number
launch_job.sh <ANALYSIS_NUMBER>