-
Notifications
You must be signed in to change notification settings - Fork 79
bigquery.Table: Increase default maxResults 1024 -> 100000 #220
Conversation
- Setting too low is slow (too many http reqs) - Setting too high is safe (api enforces a max) - Docs indicate that 100000 is the max allowed: - https://cloud.google.com/bigquery/docs/data Test plan: - Run a query that downloads a nontrivial amount of results ``` len(datalab.bigquery.Query(''' select * from `bigquery-public-data`.noaa_gsod.gsod2016 limit 100000 ''').to_dataframe(dialect='standard', use_cache=False)) ``` - Before: - 86s - Log of http reqs: https://gist.github.com/jdanbrown/d979fa3b088b98a7070e0babac95eeae - After: - 47s - Log of http reqs: https://gist.github.com/jdanbrown/a9a1b8373dc8dc3131d776eda2094b27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear whether this is a good idea for other scenarios when all is needed is to preview a page or two of the table. I think we need to think a little more about this. We might want to have two different defaults, one for iterating through the table (using cached pages), and one for getting the entire table contents, such as converting it to a dataframe.
@nikhilk @qimingj any thoughts here?
- Critical for debugging issues like googledatalab#195 and googledatalab#220 (I've been using this locally) - Silent by default, to avoid bothering users - To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc. Example with no logging (default): ```py >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` Example with logging (opt in): ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import datalab.bigquery as bq >>> print(bq.Query("select 3").to_dataframe()) Your active configuration is: [foo] DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None] f0_ 0 3 ```
- Critical for debugging issues like googledatalab#195 and googledatalab#220 (I've been using this locally) - Silent by default, to avoid bothering users - To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc. Test plan: - No logging, old api: ```py >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` - No logging, new api: ```py >>> import google.datalab.bigquery as bq >>> bq.Query("select 3").execute().result().to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` - With logging, old api: ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None] f0_ 0 3 ``` - With logging, new api: ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import google.datalab.bigquery as bq >>> bq.Query("select 3").execute().result().to_dataframe() Your active configuration is: [foo] DEBUG:google.datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"priority": "INTERACTIVE", "query": {"query": "select 3", "allowLargeResults": false, "useLegacySql": false, "useQueryCache": true}, "dryRun": false}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_5OZWn-K-SHCwFxi8B-55quDr254?timeoutMs=30000&startIndex=0&maxResults=0], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_5OZWn-K-SHCwFxi8B-55quDr254], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4/data?maxResults=1024], body[None] f0_ 0 3 ```
- Critical for debugging issues like googledatalab#195 and googledatalab#220 (I've been using this locally) - Silent by default, to avoid bothering users - To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc. Test plan: - No logging, old api: ```py >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` - No logging, new api: ```py >>> import google.datalab.bigquery as bq >>> bq.Query("select 3").execute().result().to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` - With logging, old api: ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None] f0_ 0 3 ``` - With logging, new api: ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import google.datalab.bigquery as bq >>> bq.Query("select 3").execute().result().to_dataframe() Your active configuration is: [foo] DEBUG:google.datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"priority": "INTERACTIVE", "query": {"query": "select 3", "allowLargeResults": false, "useLegacySql": false, "useQueryCache": true}, "dryRun": false}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_5OZWn-K-SHCwFxi8B-55quDr254?timeoutMs=30000&startIndex=0&maxResults=0], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_5OZWn-K-SHCwFxi8B-55quDr254], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4/data?maxResults=1024], body[None] f0_ 0 3 ```
- Critical for debugging issues like #195 and #220 (I've been using this locally) - Silent by default, to avoid bothering users - To enable, use standard logging idioms like `logging.basicConfig`, `logging.fileConfig`, etc. Test plan: - No logging, old api: ```py >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` - No logging, new api: ```py >>> import google.datalab.bigquery as bq >>> bq.Query("select 3").execute().result().to_dataframe() Your active configuration is: [foo] f0_ 0 3 ``` - With logging, old api: ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import datalab.bigquery as bq >>> bq.Query("select 3").to_dataframe() Your active configuration is: [foo] DEBUG:datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"query": {"query": "select 3", "useQueryCache": true, "allowLargeResults": false, "useLegacySql": true, "userDefinedFunctionResources": []}, "dryRun": false, "priority": "INTERACTIVE"}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_u67WYzV6RCbO4F-C5JB7hRocdxA?maxResults=0&timeoutMs=30000&startIndex=0], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_u67WYzV6RCbO4F-C5JB7hRocdxA], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829], body[None] DEBUG:datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anonda2cd79fe2c683f6e17ec63437a72c0e2144c829/data?maxResults=1024], body[None] f0_ 0 3 ``` - With logging, new api: ```py >>> import logging >>> logging.basicConfig(level=logging.DEBUG) >>> import google.datalab.bigquery as bq >>> bq.Query("select 3").execute().result().to_dataframe() Your active configuration is: [foo] DEBUG:google.datalab.utils._http:request: method[POST], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/], body[{"kind": "bigquery#job", "configuration": {"priority": "INTERACTIVE", "query": {"query": "select 3", "allowLargeResults": false, "useLegacySql": false, "useQueryCache": true}, "dryRun": false}}] INFO:oauth2client.client:Attempting refresh to obtain initial access_token INFO:oauth2client.client:Refreshing access_token DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/queries/job_5OZWn-K-SHCwFxi8B-55quDr254?timeoutMs=30000&startIndex=0&maxResults=0], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/jobs/job_5OZWn-K-SHCwFxi8B-55quDr254], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4], body[None] DEBUG:google.datalab.utils._http:request: method[GET], url[https://www.googleapis.com/bigquery/v2/projects/dwh-v2/datasets/_2f96775300d8858559d2bd23c05bad0392345e30/tables/anon921947a4e6645dc2b34411c365f9a45e0895d5a4/data?maxResults=1024], body[None] f0_ 0 3 ```
I think we should revisit this change in the context of googledatalab/datalab#1283. A larger page size would help in the case when the entire table is being brought down, for example into a dataframe. But in the case of a table iterator (calling @jdanbrown, would you be willing to work on this change to handle these two scenarios differently? |
@yebrahim I'd be happy to dig in and make the change, but I don't anticipate having time to do it in the next ~4w, so don't hesitate to jump in and do it first. Our team is using this PR as is for the time being. |
Fixed in #339, closing this one. |
Test plan: