Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Get query results as pandas DataFrame #4354

Merged
merged 21 commits into from
Dec 4, 2017

Conversation

alixhami
Copy link
Contributor

@alixhami alixhami commented Nov 7, 2017

This is a basic implementation of a method converting query results to a pandas DataFrame. I am hoping for comments on how best to implement this. Here is an example of how to use the current implementation:
df = client.query(QUERY).to_dataframe()

The intent is that pandas would be an optional dependency, and would not be required unless the DataFrame functionality is used. This is not yet handled in my current implementation, which is why CircleCI will fail. I ran the tests locally with pandas installed and they pass.

The base of this PR is #4350

@alixhami alixhami added api: bigquery Issues related to the BigQuery API. do not merge Indicates a pull request not ready for merge, due to either quality or timing. labels Nov 7, 2017
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Nov 7, 2017
@jba
Copy link
Contributor

jba commented Nov 7, 2017

I don't personally like the idea of an optional dependency. I guess I try to think of Python as a typed language as much as possible (and hopefully someday it will be).

Maybe instead of actually returning a pandas DataFrame, you could (a) expose whatever currently hidden information is needed to efficiently create one, and (b) add a different module that depends on this one and pandas, and creates a DataFrame.

@dhermes
Copy link
Contributor

dhermes commented Nov 7, 2017

I don't personally like the idea of an optional dependency.

I don't particularly like it too much, but in this case, the pandas dependency is a bit too heavy to burden all users with. We could have a more helpful error message if the import fails. We also should provide an "extra" (e.g.) that allows installing via:

pip install google-cloud-bigquery[pandas-extra]

(you can name the "extra" anything you like, I just chose pandas-extra out of a hat).

@@ -1949,6 +1949,18 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
return self._client.list_rows(dest_table, selected_fields=schema,
retry=retry)

def to_dataframe(self):

This comment was marked as spam.

This comment was marked as spam.

return pd.DataFrame(rows, columns=column_headers)

def __iter__(self):
return iter(self.result())

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@@ -1235,6 +1235,30 @@ def test_query_future(self):
row_tuples = [r.values() for r in iterator]
self.assertEqual(row_tuples, [(1,)])

def test_query_iter(self):

This comment was marked as spam.

self.assertEqual(row_tuples, [(1,)])

def test_query_to_dataframe(self):
import pandas as pd

This comment was marked as spam.

This comment was marked as spam.

def test_to_dataframe(self):
import pandas as pd

begun_resource = self._makeResource()

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.


self.assertIsInstance(df, pd.DataFrame)
self.assertEqual(len(df), 4)
self.assertEqual(list(df), ['name', 'age'])

This comment was marked as spam.

This comment was marked as spam.

self.assertEqual(len(df), 4)
self.assertEqual(list(df), ['name', 'age'])

def test_to_dataframe_w_empty_results(self):

This comment was marked as spam.

This comment was marked as spam.

@@ -1949,6 +1949,15 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
return self._client.list_rows(dest_table, selected_fields=schema,
retry=retry)

def to_dataframe(self):
import pandas as pd

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@@ -1235,6 +1235,23 @@ def test_query_future(self):
row_tuples = [r.values() for r in iterator]
self.assertEqual(row_tuples, [(1,)])

def test_query_to_dataframe(self):

This comment was marked as spam.

This comment was marked as spam.

@@ -2720,6 +2720,70 @@ def test_reload_w_alternate_client(self):
self.assertEqual(req['path'], PATH)
self._verifyResourceProperties(job, RESOURCE)

def test_to_dataframe(self):

This comment was marked as spam.

This comment was marked as spam.

@theacodes
Copy link
Contributor

+1 to what @dhermes said. We should provide a helpful message if they use a method that requires pandas (I also doubt the folks who want to use this will install bigquery before pandas), and we should provide the extra for pip.

@@ -1949,6 +1949,15 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
return self._client.list_rows(dest_table, selected_fields=schema,
retry=retry)

def to_dataframe(self):
import pandas as pd

This comment was marked as spam.

def to_dataframe(self):
import pandas as pd

iterator = self.result()

This comment was marked as spam.

This comment was marked as spam.

Copy link
Contributor

@dhermes dhermes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, nox -s cover will fail unless you:

  • Explicitly add a "no cover" pragma around these tests and functions (not my preference)
  • Install pandas into the text environment (session.install('-e', '.[pandas]'))

def to_dataframe(self):
"""Create a pandas DataFrame from the query results.

:rtype: ``pandas.DataFrame``

This comment was marked as spam.

@@ -58,6 +58,10 @@
'requests >= 2.18.0',
]

EXTRAS_REQUIREMENTS = {
'pandas': ['pandas >= 0.3.0'],

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@dhermes
Copy link
Contributor

dhermes commented Nov 10, 2017

So @alixhami, after discussion with @jonparrott we have decided it is "preferred" if all new docstrings are in the Google style (vs. traditional Sphinx).

I'm happy to help out / chat over hangouts / send you some code snippets.

Can you change the newly added docstring to Google style?

@dhermes
Copy link
Contributor

dhermes commented Nov 10, 2017

LGTM, but you need to resolve the coverage issues

@@ -1965,7 +1965,7 @@ def to_dataframe(self):
ValueError: If the `pandas` library cannot be imported.

"""
if pandas is None:
if pandas is None: # pragma: NO COVER

This comment was marked as spam.

@dhermes
Copy link
Contributor

dhermes commented Nov 13, 2017

@alixhami This is good to merge once CI goes green, right?

@alixhami alixhami removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Nov 13, 2017
@alixhami
Copy link
Contributor Author

@dhermes it should be good to merge since all the comments have now been addressed. @tswast what do you think?

@dhermes
Copy link
Contributor

dhermes commented Nov 13, 2017

lint failed 😀

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I think it could be a little more general. I'd love if this was a method of the iterator we return from QueryJob.results() and Client.list_rows().

@@ -1949,6 +1953,29 @@ def result(self, timeout=None, retry=DEFAULT_RETRY):
return self._client.list_rows(dest_table, selected_fields=schema,
retry=retry)

def to_dataframe(self):
"""Create a pandas DataFrame from the query results.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

column_headers = [field.name for field in query_results.schema]
rows = [row.values() for row in query_results]

return pandas.DataFrame(rows, columns=column_headers)

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@alixhami alixhami force-pushed the pandas-integration branch 2 times, most recently from a00c9da to 1458da6 Compare November 15, 2017 02:29
@alixhami alixhami changed the title BigQuery: Adds QueryJob.to_dataframe() BigQuery: Get query results as pandas DataFrame Nov 27, 2017
@theacodes
Copy link
Contributor

This LGTM, @tseaver @dhermes did y'all have any reservations before merging this?

@theacodes theacodes merged commit 3511e87 into googleapis:master Dec 4, 2017
@alixhami alixhami deleted the pandas-integration branch December 4, 2017 22:45
@dhermes
Copy link
Contributor

dhermes commented Dec 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants