Skip to content

pandas.io.gbq Version 2 #6937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 30, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion ci/requirements-2.6.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ python-dateutil==1.5
pytz==2013b
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
html5lib==1.0b2
bigquery==2.0.17
numexpr==1.4.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don;'t you still need the bigquery package so that bq is installed? (or is that in the google-api-python-client), what a horrible package name google!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if requirements are changing, pls update install.rst as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigquery is only required for the to_gbq() test suite, which can't be run in CI anyways due to lack of valid project id. Will update install.rst soon.

sqlalchemy==0.7.1
pymysql==0.6.0
Expand Down
4 changes: 3 additions & 1 deletion ci/requirements-2.7.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,7 @@ lxml==3.2.1
scipy==0.13.3
beautifulsoup4==4.2.1
statsmodels==0.5.0
bigquery==2.0.17
boto==2.26.1
httplib2==0.8
python-gflags==2.0
google-api-python-client==1.2
4 changes: 3 additions & 1 deletion doc/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,9 @@ Optional Dependencies
:func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
distributions will have xclip and/or xsel immediately available for
installation.
* `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__
* Google's `python-gflags` and `google-api-python-client`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add httplib2 here as well

* Needed for :mod:`~pandas.io.gbq`
* `httplib2`
* Needed for :mod:`~pandas.io.gbq`
* One of the following combinations of libraries is needed to use the
top-level :func:`~pandas.io.html.read_html` function:
Expand Down
98 changes: 47 additions & 51 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3373,83 +3373,79 @@ Google BigQuery (Experimental)
The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
analytics web service to simplify retrieving results from BigQuery tables
using SQL-like queries. Result sets are parsed into a pandas
DataFrame with a shape derived from the source table. Additionally,
DataFrames can be uploaded into BigQuery datasets as tables
if the source datatypes are compatible with BigQuery ones.
DataFrame with a shape and data types derived from the source table.
Additionally, DataFrames can be appended to existing BigQuery tables if
the destination table is the same shape as the DataFrame.

For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__

As an example, suppose you want to load all data from an existing table
: `test_dataset.test_table`
into BigQuery and pull it into a DataFrame.
As an example, suppose you want to load all data from an existing BigQuery
table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.io.read_gbq -> pandas.read_gbq

function.

.. code-block:: python

from pandas.io import gbq

# Insert your BigQuery Project ID Here
# Can be found in the web console, or
# using the command line tool `bq ls`
# Can be found in the Google web console
projectid = "xxxxxxxx"

data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)

The user will then be authenticated by the `bq` command line client -
this usually involves the default browser opening to a login page,
though the process can be done entirely from command line if necessary.
Datasets and additional parameters can be either configured with `bq`,
passed in as options to `read_gbq`, or set using Google's gflags (this
is not officially supported by this module, though care was taken
to ensure that they should be followed regardless of how you call the
method).
You will then be authenticated to the specified BigQuery account
via Google's Oauth2 mechanism. In general, this is as simple as following the
prompts in a browser window which will be opened for you. Should the browser not
be available, or fail to launch, a code will be provided to complete the process
manually. Additional information on the authentication mechanism can be found
`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__

Additionally, you can define which column to use as an index as well as a preferred column order as follows:
You can define which column from BigQuery to use as an index in the
destination DataFrame as well as a preferred column order as follows:

.. code-block:: python

data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
col_order='[col1, col2, col3,...]', project_id = projectid)

Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
col_order=['col1', 'col2', 'col3'], project_id = projectid)

Finally, you can append data to a BigQuery table from a pandas DataFrame
using the :func:`~pandas.io.to_gbq` function. This function uses the
Google streaming API which requires that your destination table exists in
BigQuery. Given the BigQuery table already exists, your DataFrame should
match the destination table in column order, structure, and data types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and is it then appended? (not fully clear to me, previously you had fail/replace/append, now only one default action?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other actions were a benefit of relying on bq.py in the past. While possible to do strictly with the API, it's a lot of code for very little benefit. The data is strictly appended which was, at least in our experience, the most common use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed the rather obvious "you can append data using to_gbq()" part.

So OK, no problem here. But maybe add it more explicitely in the docstring of to_gbq as well?

DataFrame indexes are not supported. By default, rows are streamed to
BigQuery in chunks of 10,000 rows, but you can pass other chuck values
via the ``chunksize`` argument. You can also see the progess of your
post via the ``verbose`` flag which defaults to ``True``. The http
response code of Google BigQuery can be successful (200) even if the
append failed. For this reason, if there is a failure to append to the
table, the complete error response from BigQuery is returned which
can be quite long given it provides a status for each row. You may want
to start with smaller chuncks to test that the size and types of your
dataframe match your destination table to make debugging simpler.

.. code-block:: python

df = pandas.DataFrame({'string_col_name' : ['hello'],
'integer_col_name' : [1],
'boolean_col_name' : [True]})
schema = ['STRING', 'INTEGER', 'BOOLEAN']
data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
if_exists='fail', schema = schema, project_id = projectid)

To add more rows to this, simply:

.. code-block:: python

df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
'integer_col_name' : [2],
'boolean_col_name' : [False]})
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
df.to_gbq('my_dataset.my_table', project_id = projectid)

.. note::
The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__

A default project id can be set using the command line:
`bq init`.
While BigQuery uses SQL-like syntax, it has some important differences
from traditional databases both in functionality, API limitations (size and
qunatity of queries or uploads), and how Google charges for use of the service.
You should refer to Google documentation often as the service seems to
be changing and evolving. BiqQuery is best for analyzing large sets of
data quickly, but it is not a direct replacement for a transactional database.

There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
see `here <https://developers.google.com/bigquery/query-reference>`__

You can access the management console to determine project id's by:
<https://code.google.com/apis/console/b/0/?noredirect>
You can access the management console to determine project id's by:
<https://code.google.com/apis/console/b/0/?noredirect>

.. warning::

To use this module, you will need a BigQuery account. See
<https://cloud.google.com/products/big-query> for details.

As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
but any client changes will not make it into 0.13.1. See:
http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
To use this module, you will need a valid BigQuery account. See
<https://cloud.google.com/products/big-query> for details on the
service.

.. _io.stata:

Expand Down
13 changes: 5 additions & 8 deletions doc/source/v0.14.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -154,14 +154,11 @@ Performance
Experimental
~~~~~~~~~~~~

``pandas.io.data.Options`` has gained a ``get_all_data method``, and now consistently returns a multi-indexed ``DataFrame`` (:issue:`5602`). See :ref:`the docs<remote_data.yahoo_options>`

.. ipython:: python

from pandas.io.data import Options
aapl = Options('aapl', 'yahoo')
data = aapl.get_all_data()
data.iloc[0:5, 0:5]
- ``io.gbq.read_gbq`` and ``io.gbq.to_gbq`` were refactored to remove the
dependency on the Google ``bq.py`` command line client. This submodule
now uses ``httplib2`` and the Google ``apiclient`` and ``oauth2client`` API client
libraries which should be more stable and, therefore, reliable than
``bq.py`` (:issue:`6937`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe add a more elaborate description of what actually changed in the API? So from the point of vue of someone who was already using these functions: what has he/she to adapt in the code? (maybe an example of a function call now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would example code be appropriate in this file? If so, @azbones and I can come up with something. The comment about the API client was more of a reference to the back-end implementation, though as you noted above - there are a few minor changes that pandas users will face.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can certainly put some example code in the whatsnew file. And maybe also summarize the interface changes (some keywords removed, ..)


.. _whatsnew_0141.bug_fixes:

Expand Down
62 changes: 29 additions & 33 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -669,47 +669,43 @@ def to_dict(self, outtype='dict'):
else: # pragma: no cover
raise ValueError("outtype %s not understood" % outtype)

def to_gbq(self, destination_table, schema=None, col_order=None,
if_exists='fail', **kwargs):
def to_gbq(self, destination_table, project_id=None, chunksize=10000,
verbose=True, reauth=False):
"""Write a DataFrame to a Google BigQuery table.

If the table exists, the DataFrame will be appended. If not, a new
table will be created, in which case the schema will have to be
specified. By default, rows will be written in the order they appear
in the DataFrame, though the user may specify an alternative order.
THIS IS AN EXPERIMENTAL LIBRARY

If the table exists, the dataframe will be written to the table using
the defined table schema and column types. For simplicity, this method
uses the Google BigQuery streaming API. The to_gbq method chunks data
into a default chunk size of 10,000. Failures return the complete error
response which can be quite long depending on the size of the insert.
There are several important limitations of the Google streaming API
which are detailed at:
https://developers.google.com/bigquery/streaming-data-into-bigquery.

Parameters
---------------
----------
dataframe : DataFrame
DataFrame to be written
destination_table : string
name of table to be written, in the form 'dataset.tablename'
schema : sequence (optional)
list of column types in order for data to be inserted, e.g.
['INTEGER', 'TIMESTAMP', 'BOOLEAN']
col_order : sequence (optional)
order which columns are to be inserted, e.g. ['primary_key',
'birthday', 'username']
if_exists : {'fail', 'replace', 'append'} (optional)
- fail: If table exists, do nothing.
- replace: If table exists, drop it, recreate it, and insert data.
- append: If table exists, insert data. Create if does not exist.
kwargs are passed to the Client constructor

Raises
------
SchemaMissing :
Raised if the 'if_exists' parameter is set to 'replace', but no
schema is specified
TableExists :
Raised if the specified 'destination_table' exists but the
'if_exists' parameter is set to 'fail' (the default)
InvalidSchema :
Raised if the 'schema' parameter does not match the provided
DataFrame
Name of table to be written, in the form 'dataset.tablename'
project_id : str
Google BigQuery Account project ID.
chunksize : int (default 10000)
Number of rows to be inserted in each chunk from the dataframe.
verbose : boolean (default True)
Show percentage complete
reauth : boolean (default False)
Force Google BigQuery to reauthenticate the user. This is useful
if multiple accounts are used.

"""

from pandas.io import gbq
return gbq.to_gbq(self, destination_table, schema=None, col_order=None,
if_exists='fail', **kwargs)
return gbq.to_gbq(self, destination_table, project_id=project_id,
chunksize=chunksize, verbose=verbose,
reauth=reauth)

@classmethod
def from_records(cls, data, index=None, exclude=None, columns=None,
Expand Down
Loading