A large quantities of databases in Spider2 are stored in BigQuery, including industrial-level databases that reach TB-level. To evaluate on our benchmark, you need to:
- Register a BigQuery account and enable the BigQuery Sandbox function;
🌹 You can get access to 70% SQLs of our benchmark, which query open-sourced databases in public Google Cloud Project
bigquery-public-data
.
If you are a beginner to BigQuery, please follow the guidelines below. We will walk you through
as well as
First of all, you need to enable BigQuery Sandbox function for your Google account. This part is summarized from the official tutorial.
- In the Google Cloud console, go to the BigQuery page.
https://console.cloud.google.com/bigquery
The Google Cloud console is the graphical interface that you use to create and manage BigQuery resources and to run SQL queries.
-
Authenticate with your Google Account, or create a new one.
-
Click button
Select a project
on the top menu bar to select one Google Cloud Project (GCP).
- [Optional] If you do not have a GCP currently, create a new one. On the New Project page, do the following:
- For Project name, enter a name for your project.
- For Organization, keep the default
No organization
if you are not part of one. - Click Create. You are redirected back to the BigQuery page in the Google Cloud console.
- Then, you have successfully created a new GCP and enable the BigQuery Sandbox function.
About 70% of SQLs in Spider2 are contained in bigquery public projects, which contains a large quantities of freely accessed databases, e.g., bigquery-public-data.new_york
. To view the schema/content of these open-sourced databases and run trial SQLs on them, you can star the bigquery-public-data
on your left panel (based on official tutorial). Concretely,
-
In the Explorer pane, click the button
➕ ADD
. -
Next, in the pop-up right dialog, click the row
🔍 Star a project by name
.
-
Type in
bigquery-public-data
in the small pop-up dialog. -
Then, you are able to view the database schema or run sample SQLs to query the database in project
bigquery-public-data
.
If you want to access bigquery-public-data
in a more elegant way, that is using programming, API or command line interface (CLI), you need to set up the BigQuery credential. Typically, there are three different credential types, and we suggest using Service Account
. Here is a step-by-step tutorial on how to get service account credentials based on official link:
- Go to Service Account page in the Google Cloud console (select the GCP previously created that you want to access through the credentials).
https://console.cloud.google.com/iam-admin/serviceaccounts
-
Click the button
➕ CREATE SERVICE ACCOUNT
on the top menu bar. -
Fill in the service account details, then click the button
CREATE AND CONTINUE
.
🔥 You can type any service account name, the globally unique service account ID will be automatically generated.
-
Click the button
CONTINUE
, just skipping this step and use the default roleOwner
. -
Click the button
DONE
, just leaving the two input boxes empty.
-
Next, you will come back to the service account page. Click the three vertical dots icon
⋮
in the same row of your created service account and choose optionManage keys
. -
In the new page, click the button
ADD KEY
, use the defaultJSON
format, and clickCREATE
. The key file (which is named as{your_service_account}-xxxx.json
) will be automatically downloaded.
- Now, you can access your GCP using the downloaded service account credentials (
.json
file). For example, with Python libraries:
Firstly, please install necessary libraries:
pip install --upgrade google-cloud-bigquery google-api-core
.
from google.oauth2 import service_account
from google.cloud import bigquery
credential_path = 'service_account.json' # path/to/your/keyfile.json
credentials = service_account.Credentials.from_service_account_file(credential_path)
client = bigquery.Client(credentials=credentials)
# alternatively, you can also set the credential path via environment vairable
# import os
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/path/to/keyfile.json"
# client = bigquery.Client()
# Perform a sample query.
sql_query = 'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` WHERE state = "TX" LIMIT 10'
query_job = client.query(sql_query) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.name)