Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(c/driver/sqlite): Wrap bulk ingests in a single begin/commit txn #910

Merged
merged 4 commits into from
Jul 25, 2023

Conversation

ywc88
Copy link
Collaborator

@ywc88 ywc88 commented Jul 17, 2023

Fixes #466 by having a single begin/commit txn for ingesting tables, instead of committing once per row.

Testing

R

I first installed the SQLite R driver and measured the time it takes to bulk ingest the nycflights13 dataset, noting the 336776 rowcount:

library(adbcdrivermanager)
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
con <- adbc_connection_init(db)

flights <- nycflights13::flights
flights$time_hour <- NULL

stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
adbc_statement_bind(stmt, flights)

start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776

As well as the time it takes to execute the query:

end_time - start_time
Time difference of 1.711345 mins

As a followup, I noticed it takes significantly longer (~30 minutes) to execute the query on my XPS 15 9520:

  • Ubuntu 22.04.2
  • kernel 5.19.0
  • i9-12900HK @ 4.90 GHz
  • Intel Alder Lake-P
  • 64 GB RAM

Compared to my Macbook Air (1.711345 minutes):

  • macOS Ventura 13.4.1
  • Apple M2
  • 16 GB RAM

Both on the same version of R 4.3.1 (Beagle Scout).

After making my changes, I ran R CMD INSTALL . --preclean in arrow-adbc/r/adbcdrivermanager. I also installed the following R packages:

# install.packages("devtools")
# install.packages("pkgbuild")

After which I ran the following commands to validate no build / compile issues showing up as R packaged my changes:

devtools::build()
devtools::check()
devtools::install()

Noting that the file I made changes to showed up as a vendored file:

Vendoring files from arrow-adbc to src/:
- ../../adbc.h
- ../../c/driver/sqlite/sqlite.c
- ../../c/driver/sqlite/statement_reader.c
- ../../c/driver/sqlite/statement_reader.h
- ../../c/driver/sqlite/types.h
- ../../c/driver/common/utils.c
- ../../c/driver/common/utils.h
- ../../c/vendor/nanoarrow/nanoarrow.h
- ../../c/vendor/nanoarrow/nanoarrow.c
- ../../c/vendor/sqlite3/sqlite3.h
- ../../c/vendor/sqlite3/sqlite3.c
All files successfully copied to src/

After packaging and installing my changes, I ran through the same bulk ingest commands for the nycflights13 dataset and verified that the table contained the same number of rows as the previous run, also noting the speedup from 1.7 minutes to 0.2 seconds:

...
> start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
> end_time - start_time
Time difference of 0.2236128 secs

@ywc88 ywc88 requested a review from lidavidm as a code owner July 17, 2023 21:03
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Is it possible to add a short test? Even in R or Python?

@lidavidm lidavidm added this to the ADBC Libraries 0.6.0 milestone Jul 20, 2023
@lidavidm lidavidm merged commit 4394686 into apache:main Jul 25, 2023
61 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C] SQLite driver is very slow when trying to bulk ingest with autocommit = true
2 participants