Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bigquery: add documentation for out of process pagination #8173

Closed
derekperkins opened this issue Jun 24, 2023 · 6 comments · Fixed by #8419
Closed

bigquery: add documentation for out of process pagination #8173

derekperkins opened this issue Jun 24, 2023 · 6 comments · Fixed by #8419
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: docs Improvement to the documentation for an API.

Comments

@derekperkins
Copy link
Contributor

derekperkins commented Jun 24, 2023

It's pretty standard to want to paginate through results, and while it is simple to iterate through a full query, there is minimal documentation about how to hydrate a RowIterator with a page token across separate RPCs. All the examples I can find are in process only. I was expecting there to be a convenience method like RowIteratorFromPageToken(), but there isn't, and there doesn't seem to be a way to populate an iterator manually. Best I can tell, you're required to keep track of job and manually kick things off. Best I can tell, this is how you have to paginate. Is this correct? If so, would you accept a PR for an example?

func paginateQuery(ctx context.Context, bqClient *bigquery.Client, jobID, pageToken string, pageSize int) ([]int, string, string, error) {
	var (
		iter *bigquery.RowIterator
		job  *bigquery.Job
		err  error
	)

	// if we have a page token, we need to get a job handle from the page token,
	// otherwise execute the query to initiate the job
	if pageToken == "" {
		q := bqClient.Query("SELECT num FROM UNNEST(GENERATE_ARRAY(1, 25)) num")
		iter, err = q.Read(ctx)
		if err != nil {
			return nil, "", "", err
		}

		job = iter.SourceJob()
	} else {
		// get a job handle from the job ID
		job, err = bqClient.JobFromID(ctx, jobID)
		if err != nil {
			return nil, "", "", err
		}

		iter, err = job.Read(ctx)
		if err != nil {
			return nil, "", "", err
		}
	}

	// set the page token and size
	iter.PageInfo().Token = pageToken
	iter.PageInfo().MaxSize = pageSize

	var numbers []int
	for {
		var num int
		err = iter.Next(&num)
		
		// if we're done, return the numbers, job ID, and empty page token
		if err == iterator.Done {
			return numbers, job.ID(), "", nil
		}
		if err != nil {
			return nil, "", "", err
		}

		numbers = append(numbers, num)

		// if we have a full page, return the job ID and page token
		if len(numbers) == pageSize {
			// if we're only partially through a pageToken, how do we know we're starting back in the right place?
			return numbers, job.ID(), iter.PageInfo().Token, nil
		}
	}
}

There were a few other references in issues here that were similarly confused:

@derekperkins derekperkins added the triage me I really want to be triaged. label Jun 24, 2023
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the BigQuery API. label Jun 24, 2023
@shollyman
Copy link
Contributor

I'd love to understand the pattern a bit better, as it's not clear what the goal is. Is this some kind of out-of-process resumption where retaining the iterator reference is insufficient? Or are you fanning out page processing in some way?

Intermingling row and page iteration is ambiguous, so it seems like page-based iteration is what you want. Generally this is best handled with https://pkg.go.dev/google.golang.org/api/iterator#Pager rather than manipulating the iterator's page info directly.

We don't really have a carveout at this point for more advanced patterns, but we could consider including one. Depending on how common this pattern is it's not clear if this should live in the main package docs, or if we should keep more advanced examples elsewhere like a readme or as a tested sample.

@derekperkins
Copy link
Contributor Author

This is for communicating with a dashboard, like a table, and letting the user click forwards / backwards without shipping all the rows at once. Seems like it should be pretty common, like the actual BigQuery UI result viewer. Maybe I'm missing a better way to do this.

@shollyman
Copy link
Contributor

BQ iterators are forward-only by design, so to reverse you'd need to end up needing to keep track of the token list, and there's the issues with page size alignment.

I wonder if the bigquerystorage API might be a more natural fit here. The ReadRowsRequest supports specifying a row offset when calling the ReadRows rpc, so the pagination maps a bit more naturally and is far less stateful than the opaque page tokens. That said, bootstrapping is slightly harder in that you need to resolve the destination table for a given query, then create a read session. You still have the problem of limited control of page size on this route, however.

CC @alvarowolfx for thoughts given their work on storage integrations

@derekperkins
Copy link
Contributor Author

derekperkins commented Jun 27, 2023

BQ iterators are forward-only by design, so to reverse you'd need to end up needing to keep track of the token list, and there's the issues with page size alignment.

We had also considered not using the token at all and just using StartIndex on the PageIterator, letting the client do the math of page sizes, to eliminate the token tracking overhead

The issues I linked had similar questions, and I think figuring out the right pattern would be great for docs. It may be worthy of a first class abstraction via https://pkg.go.dev/google.golang.org/api/iterator, given that it's not even clear for you what the "right" implementation should be

@derekperkins
Copy link
Contributor Author

I just noticed that EnableStorageReadClient landed a few versions ago, and if enabled will use the storage api under the hood. At a glance, it would require roughly the same logic I posted above to hook back into the job, just without the page token.

@noahdietz noahdietz added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Jul 5, 2023
@codyoss codyoss added type: docs Improvement to the documentation for an API. and removed type: question Request for information or clarification. Not an issue. labels Jul 18, 2023
gcf-merge-on-green bot pushed a commit that referenced this issue Aug 15, 2023
@derekperkins
Copy link
Contributor Author

I don't feel like this issue was resolved, as #8419 only clarifies that out of process pagination is completely unsupported via the storage api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: docs Improvement to the documentation for an API.
Projects
None yet
4 participants