bigquery: add documentation for out of process pagination #8173

derekperkins · 2023-06-24T03:06:28Z

It's pretty standard to want to paginate through results, and while it is simple to iterate through a full query, there is minimal documentation about how to hydrate a RowIterator with a page token across separate RPCs. All the examples I can find are in process only. I was expecting there to be a convenience method like RowIteratorFromPageToken(), but there isn't, and there doesn't seem to be a way to populate an iterator manually. Best I can tell, you're required to keep track of job and manually kick things off. Best I can tell, this is how you have to paginate. Is this correct? If so, would you accept a PR for an example?

func paginateQuery(ctx context.Context, bqClient *bigquery.Client, jobID, pageToken string, pageSize int) ([]int, string, string, error) {
	var (
		iter *bigquery.RowIterator
		job  *bigquery.Job
		err  error
	)

	// if we have a page token, we need to get a job handle from the page token,
	// otherwise execute the query to initiate the job
	if pageToken == "" {
		q := bqClient.Query("SELECT num FROM UNNEST(GENERATE_ARRAY(1, 25)) num")
		iter, err = q.Read(ctx)
		if err != nil {
			return nil, "", "", err
		}

		job = iter.SourceJob()
	} else {
		// get a job handle from the job ID
		job, err = bqClient.JobFromID(ctx, jobID)
		if err != nil {
			return nil, "", "", err
		}

		iter, err = job.Read(ctx)
		if err != nil {
			return nil, "", "", err
		}
	}

	// set the page token and size
	iter.PageInfo().Token = pageToken
	iter.PageInfo().MaxSize = pageSize

	var numbers []int
	for {
		var num int
		err = iter.Next(&num)
		
		// if we're done, return the numbers, job ID, and empty page token
		if err == iterator.Done {
			return numbers, job.ID(), "", nil
		}
		if err != nil {
			return nil, "", "", err
		}

		numbers = append(numbers, num)

		// if we have a full page, return the job ID and page token
		if len(numbers) == pageSize {
			// if we're only partially through a pageToken, how do we know we're starting back in the right place?
			return numbers, job.ID(), iter.PageInfo().Token, nil
		}
	}
}

There were a few other references in issues here that were similarly confused:

The text was updated successfully, but these errors were encountered:

shollyman · 2023-06-27T20:17:35Z

I'd love to understand the pattern a bit better, as it's not clear what the goal is. Is this some kind of out-of-process resumption where retaining the iterator reference is insufficient? Or are you fanning out page processing in some way?

Intermingling row and page iteration is ambiguous, so it seems like page-based iteration is what you want. Generally this is best handled with https://pkg.go.dev/google.golang.org/api/iterator#Pager rather than manipulating the iterator's page info directly.

We don't really have a carveout at this point for more advanced patterns, but we could consider including one. Depending on how common this pattern is it's not clear if this should live in the main package docs, or if we should keep more advanced examples elsewhere like a readme or as a tested sample.

derekperkins · 2023-06-27T20:25:38Z

This is for communicating with a dashboard, like a table, and letting the user click forwards / backwards without shipping all the rows at once. Seems like it should be pretty common, like the actual BigQuery UI result viewer. Maybe I'm missing a better way to do this.

shollyman · 2023-06-27T20:41:32Z

BQ iterators are forward-only by design, so to reverse you'd need to end up needing to keep track of the token list, and there's the issues with page size alignment.

I wonder if the bigquerystorage API might be a more natural fit here. The ReadRowsRequest supports specifying a row offset when calling the ReadRows rpc, so the pagination maps a bit more naturally and is far less stateful than the opaque page tokens. That said, bootstrapping is slightly harder in that you need to resolve the destination table for a given query, then create a read session. You still have the problem of limited control of page size on this route, however.

CC @alvarowolfx for thoughts given their work on storage integrations

derekperkins · 2023-06-27T20:55:14Z

BQ iterators are forward-only by design, so to reverse you'd need to end up needing to keep track of the token list, and there's the issues with page size alignment.

We had also considered not using the token at all and just using StartIndex on the PageIterator, letting the client do the math of page sizes, to eliminate the token tracking overhead

The issues I linked had similar questions, and I think figuring out the right pattern would be great for docs. It may be worthy of a first class abstraction via https://pkg.go.dev/google.golang.org/api/iterator, given that it's not even clear for you what the "right" implementation should be

derekperkins · 2023-06-30T21:06:38Z

I just noticed that EnableStorageReadClient landed a few versions ago, and if enabled will use the storage api under the hood. At a glance, it would require roughly the same logic I posted above to hook back into the job, just without the page token.

…n with Storage API (#8419) Fixes #7285, fixes #8373, fixes #8173

derekperkins · 2023-08-18T22:51:29Z

I don't feel like this issue was resolved, as #8419 only clarifies that out of process pagination is completely unsupported via the storage api

derekperkins added the triage me I really want to be triaged. label Jun 24, 2023

product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Jun 24, 2023

blunderbuss-gcf bot assigned shollyman Jun 24, 2023

noahdietz added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Jul 5, 2023

codyoss added type: docs Improvement to the documentation for an API. and removed type: question Request for information or clarification. Not an issue. labels Jul 18, 2023

derekperkins mentioned this issue Aug 1, 2023

bigquery: StorageReadClient doesn't use page token #8373

Closed

alvarowolfx mentioned this issue Aug 15, 2023

docs(bigquery): improve RowIterator docs and out of process pagination with Storage API #8419

Merged

gcf-merge-on-green bot closed this as completed in #8419 Aug 15, 2023

gcf-merge-on-green bot pushed a commit that referenced this issue Aug 15, 2023

docs(bigquery): improve RowIterator docs and out of process paginatio…

40f8f80

…n with Storage API (#8419) Fixes #7285, fixes #8373, fixes #8173

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bigquery: add documentation for out of process pagination #8173

bigquery: add documentation for out of process pagination #8173

derekperkins commented Jun 24, 2023 •

edited

Loading

shollyman commented Jun 27, 2023

derekperkins commented Jun 27, 2023

shollyman commented Jun 27, 2023

derekperkins commented Jun 27, 2023 •

edited

Loading

derekperkins commented Jun 30, 2023

derekperkins commented Aug 18, 2023

bigquery: add documentation for out of process pagination #8173

bigquery: add documentation for out of process pagination #8173

Comments

derekperkins commented Jun 24, 2023 • edited Loading

shollyman commented Jun 27, 2023

derekperkins commented Jun 27, 2023

shollyman commented Jun 27, 2023

derekperkins commented Jun 27, 2023 • edited Loading

derekperkins commented Jun 30, 2023

derekperkins commented Aug 18, 2023

derekperkins commented Jun 24, 2023 •

edited

Loading

derekperkins commented Jun 27, 2023 •

edited

Loading