-
Notifications
You must be signed in to change notification settings - Fork 79
Table.to_dataframe optimization fixes #339
Table.to_dataframe optimization fixes #339
Conversation
…stead of appending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, but general question: do you want to add any sort of tests? In particular, I could imagine two forms of test I'd like to see:
-
A simple check that
to_dataframe
is using a large pagesize by default, protecting yourself from this disappearing in a future refactor. -
An actual benchmark -- setting up a general framework here would be onerous, but I bet even a simple
%timeit
call in a test with a sanity check on the final result could be made un-noisy.
google/datalab/bigquery/_table.py
Outdated
_DEFAULT_PAGE_SIZE = 1024 | ||
# When fetching the entire table, use the maximum number of rows. The BigQuery service | ||
# is likely to return less rows than this if their encoded JSON size is larger than 10MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar nit: less rows
-> fewer rows
Also, I think it's not just likely -- BigQuery will return fewer rows than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, if the row is very small including the header, you can fit more than 100,000 rows in 10MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm nitpicking, but the sentence here is "BQ is likely to return fewer rows than this if their encoded JSON size is less than 10MB." If that second part is true, then it's not just likely, it's guaranteed, right? 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, you're right. Fixed. :)
df = pandas.DataFrame.from_records(page_rows) | ||
else: | ||
df = df.append(page_rows, ignore_index=True) | ||
df_list.append(pandas.DataFrame.from_records(page_rows)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As one more potential speed comparison: did we consider just collating the list of rows (not slices of the final DataFrame
), and then only creating a single DataFrame
at the end? That is, something like
rows = []
while True:
...
if len(page_rows):
rows.extend(page_rows)
...
df = pd.DataFrame.from_records(rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with this a little, but I'm not seeing any speedup. It looks like creating a dataframe out of a list is a cheap operation, and extending large lists might sometimes result in copying so that might be offsetting any benefit from using one big list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm.
google/datalab/bigquery/_table.py
Outdated
_DEFAULT_PAGE_SIZE = 1024 | ||
# When fetching the entire table, use the maximum number of rows. The BigQuery service | ||
# is likely to return less rows than this if their encoded JSON size is larger than 10MB | ||
_MAX_PAGE_SIZE = 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to make this a constant instead of just an optional arg to to_dataframe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an optional arg to to_dataframe
, this is its default value, which is good to keep it here as a clear static constant than a hardcoded number.
@@ -103,8 +103,11 @@ class Table(object): | |||
# Allowed characters in a BigQuery table column name | |||
_VALID_COLUMN_NAME_CHARACTERS = '_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' | |||
|
|||
# When fetching table contents, the max number of rows to fetch per HTTP request | |||
# When fetching table contents for a range or iteration, use a small page size per request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not actually related to this CL: I have to admit, I feel a little confused by 1024
here.
If the goal really is low latency for operations like "let me see a sample of my dataframe", then 100 rows is already more than enough, and should be faster yet than 1024.
If the goal is to minimize traffic, I feel like we should just make this 100k -- we already know that BQ is going to cap us at ~10MB of data.
You could totally shut me up on questions like this with some nice tables about speeds for smallish numbers of rows. 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, here are some numbers: :)
Page size | Fetch time |
---|---|
50 | ~160ms |
100 | ~180ms |
1024 | ~700ms |
10,000 | ~5s |
So we need to minimize latency, because going the 100k route will have the users waiting for ~5 seconds to get their first page (even though they probably won't wait for another fetch). This also applies if you want to use the table iterator to get rows 1000 to 1020 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So based on that data, should this constant be 100
instead of 1024
?
(Again, probably worth splitting off into a different issue, since it's largely orthogonal.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Craig that it would be nice to have some timing tests that could catch cases where a code change is still functionally correct but significantly slower. We could open another issue for that.
I agree about tests, will work on them in this PR. |
I'm happy with all the existing bits modulo tests. 😀 |
I added a simple unit test to validate the page size. I don't think we should have a test that times calls to the service as part of the unit tests, this seems like it should be part of a benchmark suite that measures several aspects of the APIs. Thoughts are welcome. |
I agree on the separate benchmark -- maybe worth filing an issue? |
Sure, opened #342. |
Follow up from the discussion on #329.
Replaces #220