Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(c/driver/sqlite): Wrap bulk ingests in a single begin/commit txn (#…
…910) Fixes #466 by having a single begin/commit txn for ingesting tables, instead of committing once per row. # Testing ## R I first installed the [SQLite R driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and measured the time it takes to bulk ingest the `nycflights13` dataset, noting the 336776 rowcount: ```r library(adbcdrivermanager) db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db") con <- adbc_connection_init(db) flights <- nycflights13::flights flights$time_hour <- NULL stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights") adbc_statement_bind(stmt, flights) start_time <- Sys.time() adbc_statement_execute_query(stmt) end_time <- Sys.time() [1] 336776 ``` As well as the time it takes to execute the query: ```r end_time - start_time Time difference of 1.711345 mins ``` As a followup, I noticed it takes significantly longer (~30 minutes) to execute the query on my XPS 15 9520: - Ubuntu 22.04.2 - kernel 5.19.0 - i9-12900HK @ 4.90 GHz - Intel Alder Lake-P - 64 GB RAM Compared to my Macbook Air (1.711345 minutes): - macOS Ventura 13.4.1 - Apple M2 - 16 GB RAM Both on the same version of R 4.3.1 (Beagle Scout). After making my changes, I ran `R CMD INSTALL . --preclean` in `arrow-adbc/r/adbcdrivermanager`. I also installed the following R packages: ```r # install.packages("devtools") # install.packages("pkgbuild") ``` After which I ran the following commands to validate no build / compile issues showing up as R packaged my changes: ```r devtools::build() devtools::check() devtools::install() ``` Noting that the file I made changes to showed up as a vendored file: ```r Vendoring files from arrow-adbc to src/: - ../../adbc.h - ../../c/driver/sqlite/sqlite.c - ../../c/driver/sqlite/statement_reader.c - ../../c/driver/sqlite/statement_reader.h - ../../c/driver/sqlite/types.h - ../../c/driver/common/utils.c - ../../c/driver/common/utils.h - ../../c/vendor/nanoarrow/nanoarrow.h - ../../c/vendor/nanoarrow/nanoarrow.c - ../../c/vendor/sqlite3/sqlite3.h - ../../c/vendor/sqlite3/sqlite3.c All files successfully copied to src/ ``` After packaging and installing my changes, I ran through the same bulk ingest commands for the `nycflights13` dataset and verified that the table contained the same number of rows as the previous run, also noting the speedup from 1.7 minutes to 0.2 seconds: ```r ... > start_time <- Sys.time() adbc_statement_execute_query(stmt) end_time <- Sys.time() [1] 336776 > end_time - start_time Time difference of 0.2236128 secs ```
- Loading branch information