Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

Table.to_dataframe optimization fixes #339

Merged
merged 3 commits into from
Apr 5, 2017

Conversation

yebrahim
Copy link
Contributor

@yebrahim yebrahim commented Apr 5, 2017

  • Use large page size for downloading entire tables
  • Concatenate paged dataframes instead of appending

Follow up from the discussion on #329.
Replaces #220

Copy link
Contributor

@craigcitro craigcitro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, but general question: do you want to add any sort of tests? In particular, I could imagine two forms of test I'd like to see:

  1. A simple check that to_dataframe is using a large pagesize by default, protecting yourself from this disappearing in a future refactor.

  2. An actual benchmark -- setting up a general framework here would be onerous, but I bet even a simple %timeit call in a test with a sanity check on the final result could be made un-noisy.

_DEFAULT_PAGE_SIZE = 1024
# When fetching the entire table, use the maximum number of rows. The BigQuery service
# is likely to return less rows than this if their encoded JSON size is larger than 10MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar nit: less rows -> fewer rows

Also, I think it's not just likely -- BigQuery will return fewer rows than this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, if the row is very small including the header, you can fit more than 100,000 rows in 10MB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm nitpicking, but the sentence here is "BQ is likely to return fewer rows than this if their encoded JSON size is less than 10MB." If that second part is true, then it's not just likely, it's guaranteed, right? 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you're right. Fixed. :)

df = pandas.DataFrame.from_records(page_rows)
else:
df = df.append(page_rows, ignore_index=True)
df_list.append(pandas.DataFrame.from_records(page_rows))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As one more potential speed comparison: did we consider just collating the list of rows (not slices of the final DataFrame), and then only creating a single DataFrame at the end? That is, something like

rows = []
while True:
  ...
  if len(page_rows):
    rows.extend(page_rows)
  ...
df = pd.DataFrame.from_records(rows)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with this a little, but I'm not seeing any speedup. It looks like creating a dataframe out of a list is a cheap operation, and extending large lists might sometimes result in copying so that might be offsetting any benefit from using one big list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm.

_DEFAULT_PAGE_SIZE = 1024
# When fetching the entire table, use the maximum number of rows. The BigQuery service
# is likely to return less rows than this if their encoded JSON size is larger than 10MB
_MAX_PAGE_SIZE = 100000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to make this a constant instead of just an optional arg to to_dataframe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an optional arg to to_dataframe, this is its default value, which is good to keep it here as a clear static constant than a hardcoded number.

@@ -103,8 +103,11 @@ class Table(object):
# Allowed characters in a BigQuery table column name
_VALID_COLUMN_NAME_CHARACTERS = '_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

# When fetching table contents, the max number of rows to fetch per HTTP request
# When fetching table contents for a range or iteration, use a small page size per request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not actually related to this CL: I have to admit, I feel a little confused by 1024 here.

If the goal really is low latency for operations like "let me see a sample of my dataframe", then 100 rows is already more than enough, and should be faster yet than 1024.

If the goal is to minimize traffic, I feel like we should just make this 100k -- we already know that BQ is going to cap us at ~10MB of data.

You could totally shut me up on questions like this with some nice tables about speeds for smallish numbers of rows. 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, here are some numbers: :)

Page size Fetch time
50 ~160ms
100 ~180ms
1024 ~700ms
10,000 ~5s

So we need to minimize latency, because going the 100k route will have the users waiting for ~5 seconds to get their first page (even though they probably won't wait for another fetch). This also applies if you want to use the table iterator to get rows 1000 to 1020 for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So based on that data, should this constant be 100 instead of 1024?

(Again, probably worth splitting off into a different issue, since it's largely orthogonal.)

Copy link
Contributor

@jimmc jimmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Craig that it would be nice to have some timing tests that could catch cases where a code change is still functionally correct but significantly slower. We could open another issue for that.

@yebrahim
Copy link
Contributor Author

yebrahim commented Apr 5, 2017

I agree about tests, will work on them in this PR.

@craigcitro
Copy link
Contributor

I'm happy with all the existing bits modulo tests. 😀

@yebrahim
Copy link
Contributor Author

yebrahim commented Apr 5, 2017

I added a simple unit test to validate the page size. I don't think we should have a test that times calls to the service as part of the unit tests, this seems like it should be part of a benchmark suite that measures several aspects of the APIs. Thoughts are welcome.

@craigcitro
Copy link
Contributor

I agree on the separate benchmark -- maybe worth filing an issue?

@yebrahim
Copy link
Contributor Author

yebrahim commented Apr 5, 2017

Sure, opened #342.

@yebrahim yebrahim merged commit 6a2d179 into googledatalab:master Apr 5, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants