Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

bigquery.Table: Increase default maxResults 1024 -> 100000 #220

Closed
wants to merge 1 commit into from

Conversation

jdanbrown
Copy link
Contributor

Test plan:

  • Run a query that downloads a nontrivial amount of results
len(datalab.bigquery.Query('''
    select * from `bigquery-public-data`.noaa_gsod.gsod2016 limit 100000
''').to_dataframe(dialect='standard', use_cache=False))

- Setting too low is slow (too many http reqs)
- Setting too high is safe (api enforces a max)
- Docs indicate that 100000 is the max allowed:
  - https://cloud.google.com/bigquery/docs/data

Test plan:
- Run a query that downloads a nontrivial amount of results
```
len(datalab.bigquery.Query('''
    select * from `bigquery-public-data`.noaa_gsod.gsod2016 limit 100000
''').to_dataframe(dialect='standard', use_cache=False))
```
- Before:
  - 86s
  - Log of http reqs: https://gist.github.com/jdanbrown/d979fa3b088b98a7070e0babac95eeae
- After:
  - 47s
  - Log of http reqs: https://gist.github.com/jdanbrown/a9a1b8373dc8dc3131d776eda2094b27
Copy link
Contributor

@yebrahim yebrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear whether this is a good idea for other scenarios when all is needed is to preview a page or two of the table. I think we need to think a little more about this. We might want to have two different defaults, one for iterating through the table (using cached pages), and one for getting the entire table contents, such as converting it to a dataframe.
@nikhilk @qimingj any thoughts here?

jdanbrown added a commit to jdanbrown/pydatalab that referenced this pull request Mar 9, 2017
- Critical for debugging issues like googledatalab#195 and googledatalab#220 (I've been using this locally)
- Silent by default, to avoid bothering users
- To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc.

Example with no logging (default):
```py
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()

Your active configuration is: [foo]

   f0_
0    3
```

Example with logging (opt in):
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import datalab.bigquery as bq
>>> print(bq.Query("select 3").to_dataframe())
Your active configuration is: [foo]

DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None]
   f0_
0    3
```
jdanbrown added a commit to jdanbrown/pydatalab that referenced this pull request Mar 9, 2017
- Critical for debugging issues like googledatalab#195 and googledatalab#220 (I've been using this locally)
- Silent by default, to avoid bothering users
- To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc.

Test plan:

- No logging, old api:
```py
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()
Your active configuration is: [foo]

   f0_
0    3
```

- No logging, new api:
```py
>>> import google.datalab.bigquery as bq
>>> bq.Query("select 3").execute().result().to_dataframe()
Your active configuration is: [foo]

   f0_
0    3
```

- With logging, old api:
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()
Your active configuration is: [foo]

DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None]
   f0_
0    3
```

- With logging, new api:
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import google.datalab.bigquery as bq
>>> bq.Query("select 3").execute().result().to_dataframe()
Your active configuration is: [foo]

DEBUG:google.datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"priority": "INTERACTIVE", "query": {"query": "select 3", "allowLargeResults": false, "useLegacySql": false, "useQueryCache": true}, "dryRun": false}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_5OZWn-K-SHCwFxi8B-55quDr254?timeoutMs=30000&startIndex=0&maxResults=0], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_5OZWn-K-SHCwFxi8B-55quDr254], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4/data?maxResults=1024], body[None]
   f0_
0    3
```
jdanbrown added a commit to jdanbrown/pydatalab that referenced this pull request Mar 9, 2017
- Critical for debugging issues like googledatalab#195 and googledatalab#220 (I've been using this locally)
- Silent by default, to avoid bothering users
- To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc.

Test plan:

- No logging, old api:
```py
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()
Your active configuration is: [foo]

   f0_
0    3
```

- No logging, new api:
```py
>>> import google.datalab.bigquery as bq
>>> bq.Query("select 3").execute().result().to_dataframe()
Your active configuration is: [foo]

   f0_
0    3
```

- With logging, old api:
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()
Your active configuration is: [foo]

DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None]
   f0_
0    3
```

- With logging, new api:
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import google.datalab.bigquery as bq
>>> bq.Query("select 3").execute().result().to_dataframe()
Your active configuration is: [foo]

DEBUG:google.datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"priority": "INTERACTIVE", "query": {"query": "select 3", "allowLargeResults": false, "useLegacySql": false, "useQueryCache": true}, "dryRun": false}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_5OZWn-K-SHCwFxi8B-55quDr254?timeoutMs=30000&startIndex=0&maxResults=0], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_5OZWn-K-SHCwFxi8B-55quDr254], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4/data?maxResults=1024], body[None]
   f0_
0    3
```
chmeyers pushed a commit that referenced this pull request Mar 15, 2017
- Critical for debugging issues like #195 and #220 (I've been using this locally)
- Silent by default, to avoid bothering users
- To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc.

Test plan:

- No logging, old api:
```py
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()
Your active configuration is: [foo]

   f0_
0    3
```

- No logging, new api:
```py
>>> import google.datalab.bigquery as bq
>>> bq.Query("select 3").execute().result().to_dataframe()
Your active configuration is: [foo]

   f0_
0    3
```

- With logging, old api:
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import datalab.bigquery as bq
>>> bq.Query("select 3").to_dataframe()
Your active configuration is: [foo]

DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None]
DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None]
   f0_
0    3
```

- With logging, new api:
```py
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> import google.datalab.bigquery as bq
>>> bq.Query("select 3").execute().result().to_dataframe()
Your active configuration is: [foo]

DEBUG:google.datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"priority": "INTERACTIVE", "query": {"query": "select 3", "allowLargeResults": false, "useLegacySql": false, "useQueryCache": true}, "dryRun": false}}]
INFO:oauth2client.client:Attempting refresh to obtain initial access_token
INFO:oauth2client.client:Refreshing access_token
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_5OZWn-K-SHCwFxi8B-55quDr254?timeoutMs=30000&startIndex=0&maxResults=0], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_5OZWn-K-SHCwFxi8B-55quDr254], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4], body[None]
DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4/data?maxResults=1024], body[None]
   f0_
0    3
```
@yebrahim
Copy link
Contributor

yebrahim commented Apr 4, 2017

I think we should revisit this change in the context of googledatalab/datalab#1283. A larger page size would help in the case when the entire table is being brought down, for example into a dataframe. But in the case of a table iterator (calling __getitem__), or if a range of rows is requested, a smaller page size makes sense. Copying a page of size 1024 is significantly faster than a 100,000-row page.

@jdanbrown, would you be willing to work on this change to handle these two scenarios differently?

@jdanbrown
Copy link
Contributor Author

@yebrahim I'd be happy to dig in and make the change, but I don't anticipate having time to do it in the next ~4w, so don't hesitate to jump in and do it first. Our team is using this PR as is for the time being.

yebrahim added a commit that referenced this pull request Apr 5, 2017
Use large page size for downloading entire tables
Concatenate paged dataframes instead of appending
Follow up from the discussion on #329.
Replaces #220
@yebrahim
Copy link
Contributor

yebrahim commented Apr 5, 2017

Fixed in #339, closing this one.

@yebrahim yebrahim closed this Apr 5, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants