BigQuery Remote Inference Pipeline

BigQuery supports remote models, such as Vertex AI LLMs, to perform remote inference operations on both structured and unstructured data. When using remote inference, the user needs to pay attention to quotas and limits, which can result in retryable error in a subset of rows and require reprocessing.

The BQML dataform library assists users to create BQML pipelines that are resilient to transient failures by automatic reprocessing and incrementally updating the output table.

Quick Start Guide

Installation

Add the bqml package to your package.json file in your Dataform project.

{
  "dependencies": {
    "bqml": "https://github.com/dataform-co/dataform-bqml/archive/[RELEASE_VERSION].tar.gz"
  }
}

You can find the most up to date package version on the releases page.

Usage

The following example shows how to generate text from images using the Vertex AI multimodel.

// Import the module
const bqml = require("bqml");

// Name of the multimodel that has `gemini-pro-vision` as the endpoint
let model = "multi-llm";
// Name of the object table that points to a set of images
let source_table = "product_image";
// Name of the table for storing the result
let output_table = "product_image_description";

// Optionally declare the model and source table as dataform datasources 
// if it is not defined in other actions.
declare({name: model});
declare({name: source_table});

// Execute the pipeline
bqml.vision_generate_text(
    source_table, output_table, model, 
    "Describe the image in 20 words", {
        flatten_json_output: true
    }
);

This is another example showing how to generate text using a source query to form the prompt.

// Import the module
const bqml = require("bqml");

// Name of the model
let model = "llm";
// Name of the table that forms the source query
let source_table = "product_comment";
// Name of the table for storing the result
let output_table = "sentiment";
// Columns to uniquely identify a row in the source table
let keys = ["comment_id"];

// Optionally declare the model and source table as dataform datasources 
// if it is not defined in other actions.
declare({name: model});
declare({name: source_table});

// Execute the pipeline
bqml.generate_text(
    output_table,
    keys,
    model,
    (ctx) => `SELECT *, "Classify the text into neutral, negative, or positive. Text: " || comment AS prompt FROM ${ctx.ref(source_table)}`,
    {flatten_json_output: true});

Function Reference

Function generate_text

Signature

function generate_text(
    output_table, unique_keys,
    ml_model, source_query, ml_configs, options)

Description

Performs the ML.GENERATE_TEXT function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text

Param	Type	Description
output_table	`String`	the name of the table to store the final result
unique_keys	`String` \| `Array`	column name(s) for identifying an unique row in the source table
ml_model	`Resolvable`	the remote model to use for the ML operation that uses one of the Vertex AI LLM endpoints
source_query	`String` \| `function`	either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields
ml_configs	`Object`	configurations for the ML operation
options	`Object`	the configuration object for the table_ml function

Function vision_generate_text

Signature

function vision_generate_text(
    source_table, output_table, model, prompt, llm_config, options)

Description

Performs the ML.GENERATE_TEXT function on visual content in the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text#gemini-pro-vision

Param	Type	Description
source_table	`Resolvable`	represents the source object table
output_table	`String`	name of the output table
model	`Resolvable`	name the remote model with the `gemini-pro-vision` endpoint
prompt	`String`	the prompt text for the LLM
llm_config	`Object`	extra configurations to the LLM
options	`Object`	the configuration object for the obj_table_ml function

Function generate_embedding

Signature

function generate_embedding(
    output_table, unique_keys,
    ml_model, source_query, ml_configs, options)

Description

Performs the ML.GENERATE_EMBEDDING function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding

Param	Type	Description
output_table	`String`	the name of the table to store the final result
unique_keys	`String` \| `Array`	column name(s) for identifying an unique row in the source table
ml_model	`Resolvable`	the remote model to use for the ML operation that uses one of the `textembedding-gecko*` Vertex AI LLMs as endpoint
source_query	`String` \| `function`	either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields
ml_configs	`Object`	configurations for the ML operation
options	`Object`	the configuration object for the table_ml function

Function understand_text

Signature

function understand_text(
    output_table, unique_keys,
    ml_model, source_query, ml_configs, options)

Description

Performs the ML.UNDERSTAND_TEXT function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-understand-text

Param	Type	Description
output_table	`String`	the name of the table to store the final result
unique_keys	`String` \| `Array`	column name(s) for identifying an unique row in the source table
ml_model	`Resolvable`	the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_NATURAL_LANGUAGE_V1
source_query	`String` \| `function`	either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields
ml_configs	`Object`	configurations for the ML operation
options	`Object`	the configuration object for the table_ml function

Function translate

Signature

function translate(
    output_table, unique_keys,
    ml_model, source_query, ml_configs, options)

Description

Performs the ML.TRANSLATE function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-translate

Param	Type	Description
output_table	`String`	the name of the table to store the final result
unique_keys	`String` \| `Array`	column name(s) for identifying an unique row in the source table
ml_model	`Resolvable`	the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_TRANSLATE_V3
source_query	`String` \| `function`	either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields
ml_configs	`Object`	configurations for the ML operation
options	`Object`	the configuration object for the table_ml function

Function annotate_image

Signature

function annotate_image(
    source_table, output_table, model, features, options)

Description

Performs the ML.ANNOTATE_IMAGE function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-annotate-image

Param	Type	Description
source_table	`Resolvable`	represents the source object table
output_table	`String`	name of the output table
model	`Resolvable`	the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_VISION_V1
features	`Array`	specifies one or more feature names of supported Vision API features
options	`Object`	the configuration object for the obj_table_ml function

Function transcribe

Signature

function transcribe(
    source_table, output_table, model, recognition_config, options)

Description

Performs the ML.TRANSCRIBE function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transcribe

Param	Type	Description
source_table	`Resolvable`	represents the source object table
output_table	`String`	name of the output table
model	`Resolvable`	the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_SPEECH_TO_TEXT_V2
recognition_config	`Object`	the recognition configuration to override the default configuration of the specified recognizer
options	`Object`	the configuration object for the obj_table_ml function

Function process_document

Signature

function process_document(
    source_table, output_table, model, options)

Description

Performs the ML.PROCESS_DOCUMENT function on the given source table.

See: https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document

Param	Type	Description
source_table	`Resolvable`	represents the source object table
output_table	`String`	name of the output table
model	`Resolvable`	the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_DOCUMENT_V1
options	`Object`	the configuration object for the obj_table_ml function

Function table_ml

Signature

function table_ml(
    output_table, unique_keys, ml_function, ml_model,
    source_query, accept_filter, ml_configs = {}, {
      batch_size = 10000,
      batch_duration_secs = 22 * 60 * 60
} = {})

Description

A generic structured table ML pipeline. It incrementally performs an ML operation on rows from the source table and merges to the output table until all rows are processed or runs longer than the specific duration.

Param	Type	Description
output_table	`String`	name of the output table
unique_keys	`String` \| `Array`	column name(s) for identifying an unique row in the source table
ml_function	`String`	the name of the BQML function to call
ml_model	`Resolvable`	the remote model to use for the ML operation
source_query	`String` \| `function`	either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields
accept_filter	`String`	a SQL boolean expression for accepting a row to the output table after the ML operation
ml_configs	`Object`	configurations for the ML operation
batch_size	`Number`	number of rows to process in each SQL job. Rows in the object table will be processed in batches according to the batch size. Default batch size is 10000
batch_duration_secs	`Number`	the number of seconds to pass before breaking the batching loop if it hasn't been finished before within this duration. Default value is 22 hours

Function obj_table_ml

Signature

function obj_table_ml(
    source_table, source, output_table, accept_filter, {
      batch_size = 500,
      unique_key = "uri",
      updated_column = "updated",
      batch_duration_secs = 22 * 60 * 60,
} = {})

Description

A generic object table ML pipeline. It incrementally performs an ML operation on new rows from the source table and merges to the output table until no new row is detected or runs longer than the specific duration. A row from the source table is considered as new if the unique_key (default to "uri") of a row is absent in the output table, or if the updated_column (default to "updated") column is newer than the largest value in the output table.

Param	Type	Description
source_table	`Resolvable`	represents the source object table
source	`String` \| `function`	either a query string or a Contextable function to produce the query on the source data
output_table	`String`	the name of the table to store the final result
accept_filter	`String`	a SQL expression for finding rows that contains retryable error
batch_size	`Number`	number of rows to process in each SQL job. Rows in the object table will be processed in batches according to the batch size. Default batch size is 500
unique_key	`String`	the primary key in the output table for incremental update. Default value is "uri".
updated_column	`String`	the column that carries the last updated timestamp of an object in the object table. Default value is "updated"
batch_duration_secs	`Number`	the number of seconds to pass before breaking the batching loop if it hasn't been finished before within this duration. Default value is 22 hours

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

License

dataform-co/dataform-bqml

Folders and files

Latest commit

History

Repository files navigation

BigQuery Remote Inference Pipeline

Quick Start Guide

Installation

Usage

Function Reference

Function generate_text

Signature

Description

Function vision_generate_text

Signature

Description

Function generate_embedding

Signature

Description

Function understand_text

Signature

Description

Function translate

Signature

Description

Function annotate_image

Signature

Description

Function transcribe

Signature

Description

Function process_document

Signature

Description

Function table_ml

Signature

Description

Function obj_table_ml

Signature

Description

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages