Add fast-path INSERT method `insert_bulk` for SQLAlchemy/pandas/Dask #553

amotl · 2023-05-11T01:21:20Z

Hi again,

in the same spirit as crate/crate-pdo#143, this patch unlocks CrateDB bulk operations for efficient batch inserts using pandas and Dask. It will accompany a followup to the Data Processing and Analytics with Dask and CrateDB: A Step-by-Step Tutorial.

The corresponding documentation section can be inspected at Preview: SQLAlchemy: DataFrame operations.

With kind regards,
Andreas.

/cc @marijaselakovic, @hammerhead, @hlcianfagna, @proddata, @WalBeh

amotl · 2023-05-11T07:51:42Z

docs/by-example/sqlalchemy/dataframe.rst

+.. TIP::
+
+    You will observe that the optimal chunk size highly depends on the shape of
+    your data, specifically the width of each record, i.e. the number of columns
+    and their individual sizes. You will need to determine a good chunk size by
+    running corresponding experiments on your own behalf. For that purpose, you
+    can use the `insert_pandas.py`_ program as a blueprint.
+
+    It is a good idea to start your explorations with a chunk size of 5000, and
+    then see if performance improves when you increase or decrease that figure.
+    Chunk sizes of 20000 may also be applicable, but make sure to take the limits
+    of your HTTP infrastructure into consideration.
+
+    In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
+    data means in the context of `DataFrame computing`_, let us refer you to `a
+    general introduction <wide-narrow-general_>`_, the corresponding section in
+    the `Data Computing book <wide-narrow-data-computing_>`_, and a `pandas
+    tutorial <wide-narrow-pandas-tutorial_>`_ about the same topic.


On this section, about choosing the optimal batch size, @marijaselakovic had a good suggestion:

Looks good, I am just thinking maybe it makes sense to add that when choosing “optimal” chunksize to consider besides the share and size of the data, also the memory of the machine. I find here that chunk should be small enough to fit the memory but large enough to minimize the overhead of data ingestion. Even though this is from Dask documentation, should apply to pandas too. The rest looks good!

Thank you!

@all: Please let me know if you can think of any other details I might have missed in this context of optimally educating users about this topic. Now, it's the right time to bring them in, optimally in the form of references to other valuable resources. Thank you already!

Hi @marijaselakovic. I've improved the documentation section with 480897a, according to your suggestions. I did not use the link to Dask yet, because this section currently only explains efficient batching on behalf of pandas, and I do not want to confuse users too much.

Another section about Dask should probably be added right away, in order to better outline the parallel computing aspects around it, and to support the upcoming tutorial appropriately.

Hi @marijaselakovic,

more details have been added to this patch with 3e7d1ea, and at ¹, you can find the corresponding documentation section about inserting data efficiently with Dask. I still did not use the link you shared, because, after getting into the details, I found it specifically educates users about chunking in the context of Dask arrays. While the very same principles apply, I discovered the Dask Best Practices resources ², which also expands the scope into Dask DataFrames, which we are working with here, so I used that link instead.

With kind regards,
Andreas.

Footnotes

https://crate-python--553.org.readthedocs.build/en/553/by-example/sqlalchemy/dataframe.html#efficient-insert-operations-with-dask ↩

https://docs.dask.org/en/stable/best-practices.html ↩

You might also consider adding Dask Dataframes best practices¹

Footnotes

https://docs.dask.org/en/stable/dataframe-best-practices.html ↩

Good idea. The link has been added.

src/crate/client/sqlalchemy/tests/bulk_test.py

amotl · 2023-05-11T10:11:20Z

Regarding the failing software tests, they should not stop you from reviewing this patch.

I will exclude the relevant test matrix slots from being executed, and add a corresponding .. note:: section to the documentation, about which versions of Python and SQLAlchemy are supported.

amotl · 2023-05-11T13:33:56Z

src/crate/client/sqlalchemy/tests/bulk_test.py

+        # 42 records / 4 partitions means each partition has a size of 10.5 elements.
+        # Because the chunk size 8 is slightly smaller than 10, the partition will not
+        # fit into it, so two batches will be emitted to the database for each data
+        # partition. 4 partitions * 2 batches = 8 insert operations will be emitted.
+        INSERT_RECORDS = 42
+        NPARTITIONS = 4
+        CHUNK_SIZE = 8
+        OPCOUNT = math.ceil(INSERT_RECORDS / NPARTITIONS / CHUNK_SIZE) * NPARTITIONS


This is a perfect example of non-optimal settings, and has been made so on purpose. We should probably add a corresponding inline comment here, in order not to confuse dear readers of our code.

Can you add the comment?

docs/by-example/sqlalchemy/dataframe.rst

src/crate/client/sqlalchemy/support.py

seut

Nice! added some suggestions

docs/by-example/sqlalchemy/dataframe.rst

seut · 2023-05-12T09:37:43Z

src/crate/client/sqlalchemy/support.py

+    # For debugging and tracing the batches running through this method.
+    # Because it's a performance-optimized code path, the log statements are not active by default.
+    # logger.info(f"Bulk SQL:     {sql}")
+    # logger.info(f"Bulk records: {len(data)}")
+    # logger.info(f"Bulk data:    {data}")


Can't this be simplified by changing the level to DEBUG and surround it with something like if logger.level is DEBUG (don't know the exact syntax) to avoid any serialization?

I tried with 4d038ec105fc, but omitted logging the data payload.

seut · 2023-05-12T09:39:17Z

src/crate/client/sqlalchemy/tests/bulk_test.py

+        # 42 records / 4 partitions means each partition has a size of 10.5 elements.
+        # Because the chunk size 8 is slightly smaller than 10, the partition will not
+        # fit into it, so two batches will be emitted to the database for each data
+        # partition. 4 partitions * 2 batches = 8 insert operations will be emitted.
+        INSERT_RECORDS = 42
+        NPARTITIONS = 4
+        CHUNK_SIZE = 8
+        OPCOUNT = math.ceil(INSERT_RECORDS / NPARTITIONS / CHUNK_SIZE) * NPARTITIONS


Can you add the comment?

amotl · 2023-05-12T12:57:11Z

Hi.

Thanks for all the excellent review comments.

[This patch] will accompany a followup to the Data Processing and Analytics with Dask and CrateDB: A Step-by-Step Tutorial by @marijaselakovic.

Initially, I rushed a bit to bring this in in time for this tutorial, but I missed the deadline. Now, that we have a little more headroom, I think this patch should only be the foundation, and those aspects should be addressed by corresponding followup patches:

Expand support for fast-path bulk insert operations to the other DML operations UPDATE and DELETE offered by the CrateDB bulk operations endpoint, when this is sensible within a pandas/Dask environment.
Don't stop at to_sql(), also cover optimal read_sql() techniques within the documentation, when possible.

I hope you agree with that strategy.

With kind regards,
Andreas.

P.S.: Added as backlog items at crate/sqlalchemy-cratedb#74.

hlcianfagna

Great work @amotl

docs/by-example/sqlalchemy/dataframe.rst

marijaselakovic

The documentation on Dask and Pandas fast insert looks good!

marijaselakovic · 2023-05-15T12:58:32Z

docs/by-example/sqlalchemy/dataframe.rst

-    running corresponding experiments on your own behalf. For that purpose, you
-    can use the `insert_pandas.py`_ program as a blueprint.
+    and their individual sizes, which will in the end determine the total size of
+    each batch/chunk.


Can we also add that the chunk should be small enough to fit the memory but large enough to minimize the overhead of data ingestion.

Thank you. It is there, a few more lines below.

marijaselakovic · 2023-05-15T13:03:31Z

docs/by-example/sqlalchemy/dataframe.rst

+    ...
+    >>> # Create a pandas DataFrame, and connect to CrateDB.
+    >>> df = makeTimeDataFrame(nper=INSERT_RECORDS, freq="S")
+    >>> engine = sa.create_engine(f"crate://{crate_host}")


Can we add here the full sqlachemy URI: crate://user:password@crate_host:port

Thanks for your suggestion. It is technically not possible, because this reStructuredText file is not only documentation, but is also part of the test suite, being tested using Python's doctests, in oder to make sure the code examples don't break. Apologies.

marijaselakovic · 2023-05-15T13:18:49Z

setup.py

@@ -68,6 +68,7 @@ def read(path):
              'zope.testrunner>=5,<7',
              'zc.customdoctests>=1.0.1,<2',
              'createcoverage>=1,<2',
+              'dask',


I am not sure but shouldn't we have here dask[complete] to install all dask dependencies (e.g., dask.dataframe)

This list of dependencies is only used as requirements for running the test suite. We do not automatically install pandas or dask together with the library, as runtime dependencies. As such, I think it is fine to use a minimal set of dependencies, which is enough to satisfy the requirements of the test suite.

This method supports efficient batch inserts using CrateDB's bulk operations endpoint. https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations Co-authored-by: hlcianfagna <110453267+hlcianfagna@users.noreply.github.com>

Specifically, outline _two_ concrete considerations for determining the optimal chunk size, and improve wording.

amotl requested review from hammerhead, seut and matriv May 11, 2023 01:21

amotl force-pushed the amo/pandas-improvements branch from af9d8f5 to 864642c Compare May 11, 2023 01:26

amotl changed the title ~~Add insert_bulk fast-path INSERT method for pandas/SQLAlchemy~~ Add fast-path INSERT method insert_bulk for pandas/SQLAlchemy May 11, 2023

amotl requested review from hlcianfagna and removed request for hammerhead May 11, 2023 07:16

amotl force-pushed the amo/pandas-improvements branch from 864642c to 8a42dd4 Compare May 11, 2023 07:30

amotl commented May 11, 2023

View reviewed changes

amotl mentioned this pull request May 11, 2023

Documentation: Add executable code examples #552

Closed

amotl force-pushed the amo/pandas-improvements branch from 8a42dd4 to 837426b Compare May 11, 2023 09:40

amotl changed the base branch from amo/examples to master May 11, 2023 09:41

amotl force-pushed the amo/pandas-improvements branch 2 times, most recently from 55928e9 to 480897a Compare May 11, 2023 09:50

github-advanced-security bot found potential problems May 11, 2023

View reviewed changes

src/crate/client/sqlalchemy/tests/bulk_test.py Fixed Show fixed Hide fixed

amotl force-pushed the amo/pandas-improvements branch from 480897a to c1a52cf Compare May 11, 2023 11:10

amotl marked this pull request as draft May 11, 2023 12:18

amotl changed the title ~~Add fast-path INSERT method insert_bulk for pandas/SQLAlchemy~~ Add fast-path INSERT method insert_bulk for SQLAlchemy/pandas/Dask May 11, 2023

amotl force-pushed the amo/pandas-improvements branch 3 times, most recently from e6b12b4 to 81cd1d6 Compare May 11, 2023 13:10

amotl commented May 11, 2023

View reviewed changes

hlcianfagna reviewed May 11, 2023

View reviewed changes

amotl mentioned this pull request May 11, 2023

[Python/SQLAlchemy] Add example programs about efficient bulk inserts with SQLAlchemy, pandas, and Dask crate/cratedb-examples#64

Merged

4 tasks

amotl force-pushed the amo/pandas-improvements branch 3 times, most recently from 13f4b7b to 3e7d1ea Compare May 11, 2023 18:46

amotl marked this pull request as ready for review May 12, 2023 07:37

amotl requested a review from hlcianfagna May 12, 2023 07:37

seut reviewed May 12, 2023

View reviewed changes

hlcianfagna approved these changes May 12, 2023

View reviewed changes

amotl commented May 12, 2023

View reviewed changes

docs/by-example/sqlalchemy/dataframe.rst Outdated Show resolved Hide resolved

docs/by-example/sqlalchemy/dataframe.rst Outdated Show resolved Hide resolved

amotl commented May 12, 2023

View reviewed changes

docs/by-example/sqlalchemy/dataframe.rst Outdated Show resolved Hide resolved

docs/by-example/sqlalchemy/dataframe.rst Outdated Show resolved Hide resolved

amotl commented May 12, 2023

View reviewed changes

docs/by-example/sqlalchemy/dataframe.rst Show resolved Hide resolved

amotl commented May 12, 2023

View reviewed changes

docs/by-example/sqlalchemy/dataframe.rst Outdated Show resolved Hide resolved

marijaselakovic reviewed May 15, 2023

View reviewed changes

amotl force-pushed the amo/pandas-improvements branch from 3e7d1ea to 7da9592 Compare May 15, 2023 19:19

amotl mentioned this pull request Jun 22, 2023

[CrateDB] Add support for data acquisition and data export daq-tools/kotori#148

Draft

6 tasks

amotl mentioned this pull request Jul 5, 2023

SQLAlchemy backlog crate/sqlalchemy-cratedb#74

Open

6 tasks

amotl force-pushed the amo/pandas-improvements branch from 7da9592 to c00c0e6 Compare July 5, 2023 22:42

amotl and others added 5 commits July 6, 2023 00:46

RTD: Update to Ubuntu 22 and Python 3.11

8f9ca2a

Documentation: Improve section about batch operations with pandas

a811703

Specifically, outline _two_ concrete considerations for determining the optimal chunk size, and improve wording.

SQLAlchemy: Test suite adjustments for pandas software tests

7252e9c

SQLAlchemy: Add documentation and tests for usage with Dask

754e1dc

amotl force-pushed the amo/pandas-improvements branch from c00c0e6 to 754e1dc Compare July 5, 2023 22:47

amotl requested review from seut and marijaselakovic July 5, 2023 22:56

amotl merged commit 4aeb98c into master Jul 6, 2023

amotl deleted the amo/pandas-improvements branch July 6, 2023 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast-path INSERT method `insert_bulk` for SQLAlchemy/pandas/Dask #553

Add fast-path INSERT method `insert_bulk` for SQLAlchemy/pandas/Dask #553

amotl commented May 11, 2023 •

edited

Loading

amotl May 11, 2023

amotl May 11, 2023 •

edited

Loading

amotl May 11, 2023 •

edited

Loading

amotl May 12, 2023

marijaselakovic May 15, 2023

amotl Jul 5, 2023

amotl commented May 11, 2023 •

edited

Loading

amotl May 11, 2023 •

edited

Loading

seut May 12, 2023

seut left a comment

seut May 12, 2023

amotl Jul 5, 2023

seut May 12, 2023

amotl commented May 12, 2023 •

edited

Loading

hlcianfagna left a comment

marijaselakovic left a comment

marijaselakovic May 15, 2023

amotl Jul 5, 2023

marijaselakovic May 15, 2023

amotl Jul 5, 2023

marijaselakovic May 15, 2023

amotl Jul 5, 2023 •

edited

Loading

Add fast-path INSERT method insert_bulk for SQLAlchemy/pandas/Dask #553

Add fast-path INSERT method insert_bulk for SQLAlchemy/pandas/Dask #553

Conversation

amotl commented May 11, 2023 • edited Loading

Choose a reason for hiding this comment

amotl May 11, 2023 • edited Loading

Choose a reason for hiding this comment

amotl May 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

amotl commented May 11, 2023 • edited Loading

amotl May 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seut left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl commented May 12, 2023 • edited Loading

hlcianfagna left a comment

Choose a reason for hiding this comment

marijaselakovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amotl Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Add fast-path INSERT method `insert_bulk` for SQLAlchemy/pandas/Dask #553

Add fast-path INSERT method `insert_bulk` for SQLAlchemy/pandas/Dask #553

amotl commented May 11, 2023 •

edited

Loading

amotl May 11, 2023 •

edited

Loading

amotl May 11, 2023 •

edited

Loading

amotl commented May 11, 2023 •

edited

Loading

amotl May 11, 2023 •

edited

Loading

amotl commented May 12, 2023 •

edited

Loading

amotl Jul 5, 2023 •

edited

Loading