-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SQLAlchemy: Add
insert_bulk
fast-path INSERT
method for pandas
This method supports efficient batch inserts using CrateDB's bulk operations endpoint. https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
- Loading branch information
Showing
9 changed files
with
218 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
.. _sqlalchemy-pandas: | ||
.. _sqlalchemy-dataframe: | ||
|
||
================================ | ||
SQLAlchemy: DataFrame operations | ||
================================ | ||
|
||
About | ||
===== | ||
|
||
This section of the documentation demonstrates support for efficient batch | ||
``INSERT`` operations with `pandas`_, using the CrateDB SQLAlchemy dialect. | ||
|
||
|
||
Introduction | ||
============ | ||
|
||
The :ref:`pandas DataFrame <pandas:api.dataframe>` is a structure that contains | ||
two-dimensional data and its corresponding labels. DataFrames are widely used | ||
in data science, machine learning, scientific computing, and many other | ||
data-intensive fields. | ||
|
||
DataFrames are similar to SQL tables or the spreadsheets that you work with in | ||
Excel or Calc. In many cases, DataFrames are faster, easier to use, and more | ||
powerful than tables or spreadsheets because they are an integral part of the | ||
`Python`_ and `NumPy`_ ecosystems. | ||
|
||
The :ref:`pandas I/O subsystem <pandas:api.io>` for `relational databases`_ | ||
using `SQL`_ is based on `SQLAlchemy`_. | ||
|
||
|
||
.. rubric:: Table of Contents | ||
|
||
.. contents:: | ||
:local: | ||
|
||
|
||
Efficient ``INSERT`` operations with pandas | ||
=========================================== | ||
|
||
The package provides a ``bulk_insert`` function to use the | ||
:meth:`pandas:pandas.DataFrame.to_sql` method most efficiently, based on the `CrateDB | ||
bulk operations`_ endpoint. It will effectively split your insert workload across | ||
multiple batches, using a defined chunk size. | ||
|
||
>>> import sqlalchemy as sa | ||
>>> from pandas._testing import makeTimeDataFrame | ||
>>> from crate.client.sqlalchemy.support import insert_bulk | ||
... | ||
>>> # Define number of records, and chunk size. | ||
>>> INSERT_RECORDS = 42 | ||
>>> CHUNK_SIZE = 8 | ||
... | ||
>>> # Connect to CrateDB, and create a pandas DataFrame. | ||
>>> df = makeTimeDataFrame(nper=INSERT_RECORDS, freq="S") | ||
>>> engine = sa.create_engine(f"crate://{crate_host}") | ||
... | ||
>>> # Insert batches of records. Effectively, six. 42 / 8 = 5.25. | ||
>>> df.to_sql( | ||
... name="test-testdrive", | ||
... con=engine, | ||
... if_exists="replace", | ||
... index=False, | ||
... chunksize=CHUNK_SIZE, | ||
... method=insert_bulk, | ||
... ) | ||
|
||
.. TIP:: | ||
|
||
You will observe that the optimal chunk size highly depends on the shape of | ||
your data, specifically the width of each record, i.e. the number of columns | ||
and their individual sizes. You will need to determine a good chunk size by | ||
running corresponding experiments on your own behalf. For that purpose, you | ||
can use the `insert_pandas.py`_ program as a blueprint. | ||
|
||
It is a good idea to start your explorations with a chunk size of 5000, and | ||
then see if performance improves when you increase or decrease that figure. | ||
Chunk sizes of 20000 may also be applicable, but make sure to take the limits | ||
of your HTTP infrastructure into consideration. | ||
|
||
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow) | ||
data means in the context of `DataFrame computing`_, let us refer you to `a | ||
general introduction <wide-narrow-general_>`_, the corresponding section in | ||
the `Data Computing book <wide-narrow-data-computing_>`_, and a `pandas | ||
tutorial <wide-narrow-pandas-tutorial_>`_ about the same topic. | ||
|
||
|
||
.. hidden: Disconnect from database | ||
>>> engine.dispose() | ||
.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations | ||
.. _DataFrame computing: https://realpython.com/pandas-dataframe/ | ||
.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py | ||
.. _NumPy: https://en.wikipedia.org/wiki/NumPy | ||
.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software) | ||
.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html | ||
.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language) | ||
.. _relational databases: https://en.wikipedia.org/wiki/Relational_database | ||
.. _SQL: https://en.wikipedia.org/wiki/SQL | ||
.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html | ||
.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data | ||
.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow | ||
.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# -*- coding: utf-8; -*- | ||
# | ||
# Licensed to CRATE Technology GmbH ("Crate") under one or more contributor | ||
# license agreements. See the NOTICE file distributed with this work for | ||
# additional information regarding copyright ownership. Crate licenses | ||
# this file to you under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. You may | ||
# obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT | ||
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the | ||
# License for the specific language governing permissions and limitations | ||
# under the License. | ||
# | ||
# However, if you have executed another commercial license agreement | ||
# with Crate these terms will supersede the license and you may use the | ||
# software solely pursuant to the terms of the relevant commercial agreement. | ||
import logging | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
def insert_bulk(pd_table, conn, keys, data_iter): | ||
""" | ||
Use CrateDB's "bulk operations" endpoint as a fast path for pandas' and Dask's `to_sql()` [1] method. | ||
The idea is to break out of SQLAlchemy, compile the insert statement, and use the raw | ||
DBAPI connection client, in order to invoke a request using `bulk_parameters` [2]:: | ||
cursor.execute(sql=sql, bulk_parameters=data) | ||
The vanilla implementation, used by SQLAlchemy, is:: | ||
data = [dict(zip(keys, row)) for row in data_iter] | ||
conn.execute(pd_table.table.insert(), data) | ||
Batch chunking will happen outside of this function, for example [3] demonstrates | ||
the relevant code in `pandas.io.sql`. | ||
[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html | ||
[2] https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations | ||
[3] https://github.com/pandas-dev/pandas/blob/v2.0.1/pandas/io/sql.py#L1011-L1027 | ||
""" | ||
|
||
# Compile SQL statement and materialize batch. | ||
sql = str(pd_table.table.insert().compile(bind=conn)) | ||
data = list(data_iter) | ||
|
||
# For debugging and tracing the batches running through this method. | ||
# Because it's a performance-optimized code path, the log statements are not active by default. | ||
# logger.info(f"Bulk SQL: {sql}") | ||
# logger.info(f"Bulk records: {len(data)}") | ||
# logger.info(f"Bulk data: {data}") | ||
|
||
# Invoke bulk insert operation. | ||
cursor = conn._dbapi_connection.cursor() | ||
cursor.execute(sql=sql, bulk_parameters=data) | ||
cursor.close() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters