Skip to content

CLN: remove pandas/io/gbq.py and tests and replace with pandas-gbq #15484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions ci/requirements-2.7.pip
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
blosc
httplib2
google-api-python-client==1.2
python-gflags==2.0
oauth2client==1.5.0
pandas-gbq
pathlib
backports.lzma
py
Expand Down
3 changes: 0 additions & 3 deletions ci/requirements-3.4.pip
Original file line number Diff line number Diff line change
@@ -1,5 +1,2 @@
python-dateutil==2.2
blosc
httplib2
google-api-python-client
oauth2client
3 changes: 0 additions & 3 deletions ci/requirements-3.4_SLOW.pip

This file was deleted.

1 change: 1 addition & 0 deletions ci/requirements-3.5.pip
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
xarray==0.9.1
pandas-gbq
289 changes: 7 additions & 282 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4652,293 +4652,18 @@ And then issue the following queries:
Google BigQuery
---------------

.. versionadded:: 0.13.0

The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
analytics web service to simplify retrieving results from BigQuery tables
using SQL-like queries. Result sets are parsed into a pandas
DataFrame with a shape and data types derived from the source table.
Additionally, DataFrames can be inserted into new BigQuery tables or appended
to existing tables.

.. warning::

To use this module, you will need a valid BigQuery account. Refer to the
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
for details on the service itself.

The key functions are:

.. currentmodule:: pandas.io.gbq

.. autosummary::
:toctree: generated/

read_gbq
to_gbq

.. currentmodule:: pandas


Supported Data Types
''''''''''''''''''''

Pandas supports all these `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
are not supported.

Integer and boolean ``NA`` handling
'''''''''''''''''''''''''''''''''''

.. versionadded:: 0.20

Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA``
support for integer and boolean types, this module will store ``INTEGER`` or
``BOOLEAN`` columns with at least one ``NULL`` value as ``dtype=object``.
Otherwise those columns will be stored as ``dtype=int64`` or ``dtype=bool``
respectively.

This is opposite to default pandas behaviour which will promote integer
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
for detailed explaination.

While this trade-off works well for most cases, it breaks down for storing
values greater than 2**53. Such values in BigQuery can represent identifiers
and unnoticed precision lost for identifier is what we want to avoid.

.. _io.bigquery_deps:

Dependencies
''''''''''''

This module requires following additional dependencies:

- `httplib2 <https://github.com/httplib2/httplib2>`__: HTTP client
- `google-api-python-client <http://github.com/google/google-api-python-client>`__: Google's API client
- `oauth2client <https://github.com/google/oauth2client>`__: authentication and authorization for Google's API

.. _io.bigquery_authentication:

Authentication
''''''''''''''

.. versionadded:: 0.18.0

Authentication to the Google ``BigQuery`` service is via ``OAuth 2.0``.
Is possible to authenticate with either user account credentials or service account credentials.

Authenticating with user account credentials is as simple as following the prompts in a browser window
which will be automatically opened for you. You will be authenticated to the specified
``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host.
The remote authentication using user account credentials is not currently supported in pandas.
Additional information on the authentication mechanism can be found
`here <https://developers.google.com/identity/protocols/OAuth2#clientside/>`__.

Authentication with service account credentials is possible via the `'private_key'` parameter. This method
is particularly useful when working on remote servers (eg. jupyter iPython notebook on remote host).
Additional information on service accounts can be found
`here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount>`__.

Authentication via ``application default credentials`` is also possible. This is only valid
if the parameter ``private_key`` is not provided. This method also requires that
the credentials can be fetched from the environment the code is running in.
Otherwise, the OAuth2 client-side authentication is used.
Additional information on
`application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__.

.. versionadded:: 0.19.0

.. note::

The `'private_key'` parameter can be set to either the file path of the service account key
in JSON format, or key contents of the service account key in JSON format.

.. note::

A private key can be obtained from the Google developers console by clicking
`here <https://console.developers.google.com/permissions/serviceaccounts>`__. Use JSON key type.

.. _io.bigquery_reader:

Querying
''''''''

Suppose you want to load all data from an existing BigQuery table : `test_dataset.test_table`
into a DataFrame using the :func:`~pandas.io.gbq.read_gbq` function.

.. code-block:: python

# Insert your BigQuery Project ID Here
# Can be found in the Google web console
projectid = "xxxxxxxx"

data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', projectid)


You can define which column from BigQuery to use as an index in the
destination DataFrame as well as a preferred column order as follows:

.. code-block:: python

data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
index_col='index_column_name',
col_order=['col1', 'col2', 'col3'], projectid)


Starting with 0.20.0, you can specify the query config as parameter to use additional options of your job.
For more information about query configuration parameters see
`here <https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.query>`__.

.. code-block:: python

configuration = {
'query': {
"useQueryCache": False
}
}
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
configuration=configuration, projectid)


.. note::

You can find your project id in the `Google developers console <https://console.developers.google.com>`__.


.. note::

You can toggle the verbose output via the ``verbose`` flag which defaults to ``True``.

.. note::

The ``dialect`` argument can be used to indicate whether to use BigQuery's ``'legacy'`` SQL
or BigQuery's ``'standard'`` SQL (beta). The default value is ``'legacy'``. For more information
on BigQuery's standard SQL, see `BigQuery SQL Reference
<https://cloud.google.com/bigquery/sql-reference/>`__

.. _io.bigquery_writer:

Writing DataFrames
''''''''''''''''''

Assume we want to write a DataFrame ``df`` into a BigQuery table using :func:`~pandas.DataFrame.to_gbq`.

.. ipython:: python

df = pd.DataFrame({'my_string': list('abc'),
'my_int64': list(range(1, 4)),
'my_float64': np.arange(4.0, 7.0),
'my_bool1': [True, False, True],
'my_bool2': [False, True, False],
'my_dates': pd.date_range('now', periods=3)})

df
df.dtypes

.. code-block:: python

df.to_gbq('my_dataset.my_table', projectid)

.. note::

The destination table and destination dataset will automatically be created if they do not already exist.

The ``if_exists`` argument can be used to dictate whether to ``'fail'``, ``'replace'``
or ``'append'`` if the destination table already exists. The default value is ``'fail'``.

For example, assume that ``if_exists`` is set to ``'fail'``. The following snippet will raise
a ``TableCreationError`` if the destination table already exists.

.. code-block:: python

df.to_gbq('my_dataset.my_table', projectid, if_exists='fail')

.. note::

If the ``if_exists`` argument is set to ``'append'``, the destination dataframe will
be written to the table using the defined table schema and column types. The
dataframe must match the destination table in structure and data types.
If the ``if_exists`` argument is set to ``'replace'``, and the existing table has a
different schema, a delay of 2 minutes will be forced to ensure that the new schema
has propagated in the Google environment. See
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.

Writing large DataFrames can result in errors due to size limitations being exceeded.
This can be avoided by setting the ``chunksize`` argument when calling :func:`~pandas.DataFrame.to_gbq`.
For example, the following writes ``df`` to a BigQuery table in batches of 10000 rows at a time:

.. code-block:: python

df.to_gbq('my_dataset.my_table', projectid, chunksize=10000)

You can also see the progress of your post via the ``verbose`` flag which defaults to ``True``.
For example:

.. code-block:: python

In [8]: df.to_gbq('my_dataset.my_table', projectid, chunksize=10000, verbose=True)

Streaming Insert is 10% Complete
Streaming Insert is 20% Complete
Streaming Insert is 30% Complete
Streaming Insert is 40% Complete
Streaming Insert is 50% Complete
Streaming Insert is 60% Complete
Streaming Insert is 70% Complete
Streaming Insert is 80% Complete
Streaming Insert is 90% Complete
Streaming Insert is 100% Complete

.. note::

If an error occurs while streaming data to BigQuery, see
`Troubleshooting BigQuery Errors <https://cloud.google.com/bigquery/troubleshooting-errors>`__.

.. note::

The BigQuery SQL query language has some oddities, see the
`BigQuery Query Reference Documentation <https://cloud.google.com/bigquery/query-reference>`__.

.. note::

While BigQuery uses SQL-like syntax, it has some important differences from traditional
databases both in functionality, API limitations (size and quantity of queries or uploads),
and how Google charges for use of the service. You should refer to `Google BigQuery documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
sets of data quickly, but it is not a direct replacement for a transactional database.

.. _io.bigquery_create_tables:

Creating BigQuery Tables
''''''''''''''''''''''''

.. warning::

As of 0.17, the function :func:`~pandas.io.gbq.generate_bq_schema` has been deprecated and will be
removed in a future version.

As of 0.15.2, the gbq module has a function :func:`~pandas.io.gbq.generate_bq_schema` which will
produce the dictionary representation schema of the specified pandas DataFrame.

.. code-block:: ipython

In [10]: gbq.generate_bq_schema(df, default_type='STRING')
Starting in 0.20.0, pandas has split off Google BigQuery support into the
separate package ``pandas-gbq``. You can ``pip install pandas-gbq`` to get it.

Out[10]: {'fields': [{'name': 'my_bool1', 'type': 'BOOLEAN'},
{'name': 'my_bool2', 'type': 'BOOLEAN'},
{'name': 'my_dates', 'type': 'TIMESTAMP'},
{'name': 'my_float64', 'type': 'FLOAT'},
{'name': 'my_int64', 'type': 'INTEGER'},
{'name': 'my_string', 'type': 'STRING'}]}

.. note::
The ``pandas-gbq`` package provides functionality to read/write from Google BigQuery.

If you delete and re-create a BigQuery table with the same name, but different table schema,
you must wait 2 minutes before streaming data into the table. As a workaround, consider creating
the new table with a different name. Refer to
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
pandas integrates with this external package. if ``pandas-gbq`` is installed, you can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this first paragraph as intro, or part of it? (but just replace the 'the pandas.io.gbq module' with 'the pandas-gbq package')

use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the
respective functions from ``pandas-gbq``.

Full cocumentation can be found `here <https://pandas-gbq.readthedocs.io/>`__

.. _io.stata:

Expand Down
9 changes: 9 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,15 @@ New Behavior:
In [5]: df['a']['2011-12-31 23:59:59']
Out[5]: 1

.. _whatsnew_0200.api_breaking.gbq:

Pandas Google BigQuery support has moved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

pandas has split off Google BigQuery support into a separate package ``pandas-gbq``. You can ``pip install pandas-gbq`` to get it.
The functionality of ``pd.read_gbq()`` and ``.to_gbq()`` remains the same with the currently released version of ``pandas-gbq=0.1.2``. (:issue:`15347`)
Documentation is now hosted `here <https://pandas-gbq.readthedocs.io/>`__

.. _whatsnew_0200.api_breaking.memory_usage:

Memory Usage for Index is more Accurate
Expand Down
8 changes: 7 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,8 @@
OrderedDict, raise_with_traceback)
from pandas import compat
from pandas.compat.numpy import function as nv
from pandas.util.decorators import deprecate_kwarg, Appender, Substitution
from pandas.util.decorators import (deprecate_kwarg, Appender,
Substitution, docstring_wrapper)
from pandas.util.validators import validate_bool_kwarg

from pandas.tseries.period import PeriodIndex
Expand Down Expand Up @@ -941,6 +942,11 @@ def to_gbq(self, destination_table, project_id, chunksize=10000,
chunksize=chunksize, verbose=verbose, reauth=reauth,
if_exists=if_exists, private_key=private_key)

def _f():
from pandas.io.gbq import _try_import
return _try_import().to_gbq.__doc__
to_gbq = docstring_wrapper(to_gbq, _f)

@classmethod
def from_records(cls, data, index=None, exclude=None, columns=None,
coerce_float=False, nrows=None):
Expand Down
Loading