Table.to_dataframe optimization fixes #339

yebrahim · 2017-04-05T00:31:13Z

Use large page size for downloading entire tables
Concatenate paged dataframes instead of appending

Follow up from the discussion on #329.
Replaces #220

…stead of appending

craigcitro

Changes look good, but general question: do you want to add any sort of tests? In particular, I could imagine two forms of test I'd like to see:

A simple check that to_dataframe is using a large pagesize by default, protecting yourself from this disappearing in a future refactor.
An actual benchmark -- setting up a general framework here would be onerous, but I bet even a simple %timeit call in a test with a sanity check on the final result could be made un-noisy.

craigcitro · 2017-04-05T03:21:16Z

google/datalab/bigquery/_table.py

  _DEFAULT_PAGE_SIZE = 1024
+  # When fetching the entire table, use the maximum number of rows. The BigQuery service
+  # is likely to return less rows than this if their encoded JSON size is larger than 10MB


grammar nit: less rows -> fewer rows

Also, I think it's not just likely -- BigQuery will return fewer rows than this.

Not necessarily, if the row is very small including the header, you can fit more than 100,000 rows in 10MB.

OK, I'm nitpicking, but the sentence here is "BQ is likely to return fewer rows than this if their encoded JSON size is less than 10MB." If that second part is true, then it's not just likely, it's guaranteed, right? 😉

Oh I see, you're right. Fixed. :)

craigcitro · 2017-04-05T03:33:25Z

google/datalab/bigquery/_table.py

-          df = pandas.DataFrame.from_records(page_rows)
-        else:
-          df = df.append(page_rows, ignore_index=True)
+        df_list.append(pandas.DataFrame.from_records(page_rows))


As one more potential speed comparison: did we consider just collating the list of rows (not slices of the final DataFrame), and then only creating a single DataFrame at the end? That is, something like

rows = [] while True: ... if len(page_rows): rows.extend(page_rows) ... df = pd.DataFrame.from_records(rows)

I experimented with this a little, but I'm not seeing any speedup. It looks like creating a dataframe out of a list is a cheap operation, and extending large lists might sometimes result in copying so that might be offsetting any benefit from using one big list.

craigcitro · 2017-04-05T03:34:46Z

google/datalab/bigquery/_table.py

  _DEFAULT_PAGE_SIZE = 1024
+  # When fetching the entire table, use the maximum number of rows. The BigQuery service
+  # is likely to return less rows than this if their encoded JSON size is larger than 10MB
+  _MAX_PAGE_SIZE = 100000


Any reason to make this a constant instead of just an optional arg to to_dataframe?

It is an optional arg to to_dataframe, this is its default value, which is good to keep it here as a clear static constant than a hardcoded number.

craigcitro · 2017-04-05T03:37:27Z

google/datalab/bigquery/_table.py

@@ -103,8 +103,11 @@ class Table(object):
  # Allowed characters in a BigQuery table column name
  _VALID_COLUMN_NAME_CHARACTERS = '_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

-  # When fetching table contents, the max number of rows to fetch per HTTP request
+  # When fetching table contents for a range or iteration, use a small page size per request


Not actually related to this CL: I have to admit, I feel a little confused by 1024 here.

If the goal really is low latency for operations like "let me see a sample of my dataframe", then 100 rows is already more than enough, and should be faster yet than 1024.

If the goal is to minimize traffic, I feel like we should just make this 100k -- we already know that BQ is going to cap us at ~10MB of data.

You could totally shut me up on questions like this with some nice tables about speeds for smallish numbers of rows. 😁

Sure, here are some numbers: :)

Page size Fetch time

50 ~160ms

100 ~180ms

1024 ~700ms

10,000 ~5s

So we need to minimize latency, because going the 100k route will have the users waiting for ~5 seconds to get their first page (even though they probably won't wait for another fetch). This also applies if you want to use the table iterator to get rows 1000 to 1020 for example.

So based on that data, should this constant be 100 instead of 1024?

(Again, probably worth splitting off into a different issue, since it's largely orthogonal.)

jimmc

I agree with Craig that it would be nice to have some timing tests that could catch cases where a code change is still functionally correct but significantly slower. We could open another issue for that.

yebrahim · 2017-04-05T17:06:44Z

I agree about tests, will work on them in this PR.

craigcitro · 2017-04-05T19:42:06Z

I'm happy with all the existing bits modulo tests. 😀

yebrahim · 2017-04-05T21:06:43Z

I added a simple unit test to validate the page size. I don't think we should have a test that times calls to the service as part of the unit tests, this seems like it should be part of a benchmark suite that measures several aspects of the APIs. Thoughts are welcome.

craigcitro · 2017-04-05T21:14:30Z

I agree on the separate benchmark -- maybe worth filing an issue?

yebrahim · 2017-04-05T21:17:55Z

Sure, opened #342.

use large page size for downloading tables, concatenate dataframes in…

b1d2a02

…stead of appending

yebrahim requested review from jimmc, craigcitro, ojarjur and chmeyers April 5, 2017 00:31

yebrahim mentioned this pull request Apr 5, 2017

Diagnose, benchmark and provide guidance for loading large dataframe from BigQuery #329

Open

ojarjur approved these changes Apr 5, 2017

View reviewed changes

craigcitro reviewed Apr 5, 2017

View reviewed changes

jimmc approved these changes Apr 5, 2017

View reviewed changes

added unit test

e4f511a

flake8

e1390aa

craigcitro approved these changes Apr 5, 2017

View reviewed changes

yebrahim merged commit 6a2d179 into googledatalab:master Apr 5, 2017

yebrahim mentioned this pull request Apr 5, 2017

bigquery.Table: Increase default maxResults 1024 -> 100000 #220

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table.to_dataframe optimization fixes #339

Table.to_dataframe optimization fixes #339

yebrahim commented Apr 5, 2017

craigcitro left a comment

craigcitro Apr 5, 2017

yebrahim Apr 5, 2017

craigcitro Apr 5, 2017

yebrahim Apr 5, 2017

craigcitro Apr 5, 2017

yebrahim Apr 5, 2017

craigcitro Apr 5, 2017

craigcitro Apr 5, 2017

yebrahim Apr 5, 2017

craigcitro Apr 5, 2017

yebrahim Apr 5, 2017

craigcitro Apr 5, 2017

jimmc left a comment

yebrahim commented Apr 5, 2017

craigcitro commented Apr 5, 2017

yebrahim commented Apr 5, 2017

craigcitro commented Apr 5, 2017

yebrahim commented Apr 5, 2017

Table.to_dataframe optimization fixes #339

Table.to_dataframe optimization fixes #339

Conversation

yebrahim commented Apr 5, 2017

craigcitro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimmc left a comment

Choose a reason for hiding this comment

yebrahim commented Apr 5, 2017

craigcitro commented Apr 5, 2017

yebrahim commented Apr 5, 2017

craigcitro commented Apr 5, 2017

yebrahim commented Apr 5, 2017