Skip to content

Latest commit

 

History

History

spark_analysis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

SPARK dataframe analysis on Vertex Generative AI

This repository contains an example that demonstrate how to analyze a (small) SPARK analysis output (pandas dataframe) coming from dataproc serverless SPARK into the PaLM-2 Generative AI, powered by Vertex AI on Google Cloud.

Architecture

Select or create a Google Cloud project

Use case

The example pyspark code contains three different analysis using the BigQuery NYC Citi Bike Trips Public Dataset, in particular:

  • What are the most popular Citi Bike stations?
  • What are the most popular routes by subscriber type?
  • What are the top routes by gender?

Sample outputs

Sample output 1 Sample output 2

Running the code

  1. Select or create a Google Cloud project, and enable the required APIs.

  2. Create a service account with enough permissions to interact with the different services (BigQuery, dataproc spark, Vertex AI).

  3. Open Cloud shell and clone this repository

  4. Edit the build_dataproc_image.sh file and specify:

PROJECT_ID="TO_DO_DEVELOPER"
GCP_REGION="TO_DO_DEVELOPER"
  1. Build a custom dataproc image that contains the Vertex AI python SDK
build_dataproc_image.sh
  1. Edit the launch_job.sh file and specify:
PROJECT_ID="TO_DO_DEVELOPER"
GCP_REGION="TO_DO_DEVELOPER"
SUBNET="TO_DO_DEVELOPER"
UMSA_FQN="TO_DO_DEVELOPER"
DEPS_BUCKET="TO_DO_DEVELOPER"
  1. Launch the dataproc serverless job indicating the analysis number
launch_job.sh <ANALYSIS_NUMBER>