Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Support to read/write from/to ODBC #849

Merged
merged 15 commits into from
Mar 5, 2022
Merged

Support to read/write from/to ODBC #849

merged 15 commits into from
Mar 5, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Feb 17, 2022

This PR adds support to reading from, and writing to, an ODBC driver.

I anticipate this to be one of the most efficient ways of loading data into arrow, as odbc-api offers an API to load data in a columnar format whereby most buffers are copied back-to-back into arrow (even when nulls are present). variable length and validity needs a small O(N) deserialization, so not as fast as Arrow IPC (but likely much faster than parquet).

@jorgecarleitao jorgecarleitao added the feature A new feature label Feb 17, 2022
@codecov
Copy link

codecov bot commented Feb 17, 2022

Codecov Report

Merging #849 (d971e40) into main (972c6f8) will decrease coverage by 0.94%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
- Coverage   71.50%   70.56%   -0.95%     
==========================================
  Files         337      343       +6     
  Lines       18445    18693     +248     
==========================================
+ Hits        13190    13191       +1     
- Misses       5255     5502     +247     
Impacted Files Coverage Δ
src/io/odbc/mod.rs 0.00% <0.00%> (ø)
src/io/odbc/read/deserialize.rs 0.00% <0.00%> (ø)
src/io/odbc/read/schema.rs 0.00% <0.00%> (ø)
src/io/odbc/write/mod.rs 0.00% <0.00%> (ø)
src/io/odbc/write/schema.rs 0.00% <0.00%> (ø)
src/io/odbc/write/serialize.rs 0.00% <0.00%> (ø)
src/compute/arithmetics/time.rs 26.60% <0.00%> (+0.91%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 972c6f8...d971e40. Read the comment docs.

@ghuls
Copy link
Contributor

ghuls commented Feb 18, 2022

This might also be something to consider in the future when it is more worked out: https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/

@jorgecarleitao jorgecarleitao changed the title WIP: Support to read from ODBC Support to read from ODBC Feb 18, 2022
@jorgecarleitao
Copy link
Owner Author

One design question that I have no answer to: arrow uses variable sized utf8 and binary, but ODBC requires a fixed length.

Any ideas?

@ritchie46
Copy link
Collaborator

One design question that I have no answer to: arrow uses variable sized utf8 and binary, but ODBC requires a fixed length.

Any ideas?

How does that work with different sized strings? Does it take the max length? Or pointers to the heap?

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Feb 18, 2022

It has the following representation: given a maximum length N and a batch size of n we have:

values: vec![u8; N * n]
lengths: vec![isize; n]

where length == -1 for null entries. A (valid) value is retrieved at &values[index * N..index * N + lengths[index]].

@ritchie46
Copy link
Collaborator

It has the following representation: given a maximum length N and a batch size of n we have:

values: vec![u8; N * n]
lengths: vec![isize; n]

where length == -1 for null entries. A (valid) value is retrieved at &values[index * N..index * N + lengths[index]].

So a slot in values is padded witn n = max_len - current_len bytes?

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Feb 18, 2022

exactly, some bytes are un-used:

["hello world", "hi", None, "bye bye", "confortably numb"]

is represented as

values
 <-----N=16----->
[hello world     ][hi              ][                ][bye bye         ][confortably numb]
lengths: [11, 2, -1, 7, 16]

(in rows)
 <-----N=16----->
[hello world     ] 11
[hi              ] 2
[                ] -1
[bye bye         ] 7
[confortably numb] 16

@ritchie46
Copy link
Collaborator

Right.. and this is per batch? If so, I would choose a relatively low size so that ouliers don't blow up memory.

@Dandandan Dandandan marked this pull request as ready for review February 18, 2022 19:56
@Dandandan Dandandan marked this pull request as draft February 18, 2022 19:57
@jorgecarleitao jorgecarleitao changed the title Support to read from ODBC Support to read/write from/to ODBC Feb 19, 2022
@jorgecarleitao
Copy link
Owner Author

This is almost done, pending pacman82/odbc-api#174

@ritchie46
Copy link
Collaborator

ritchie46 commented Mar 3, 2022

Just curious. Are the odbc drivers compiled via cargo or is this something that should be setup at the client side? I know that setting up ODBC with TurboDBC was non trivial.

edit

Note to self, click a bit before asking questions:

To build this library you need to link against the odbc library of your systems ODBC driver manager. It should be automatically detected by the build. On Windows systems it is preinstalled. On Linux and OS-X unix-odbc must be installed. To create a Connections to a data source, its ODBC driver must also be installed.

I do wonder how this would go with precompiled binaries. If we can dynamically link here.

@pacman82
Copy link
Collaborator

pacman82 commented Mar 3, 2022

I know that setting up ODBC with TurboDBC was non trivial

Usually the complexity comes from building turbodbc, with the right Arrow / Python / C++ ABI. This is because Turbodbc does depend on the Python ABI, C++ ABI and Arrow version, for calling into C++ Code, and creating pyarrow arrays.

Following the issue tracker of turbodbc I feel like well over 90% of the issues are bulid chain related. This had been my motivation to create arrow-odbc-py to begin with.

This crate does not link against the Python Interpreter directly. It does not rely an ABI compatibility with the Arrow C++ interface. It does not even use C++ (and therfore not depend on the ABI). My hope is, that it should be way easier to set up.

Setting up the ODBC driver itself manager is not required on windows, and usually trivial on MacOS or Linux, at least if you have sudo privileges. A brew install or apt-get install usually does the trick.

Installing the driver for a specific data source is an additional step. Yet this is also the point of ODBC, we don't want to ship every driver ever. And we could not for the ones which will be written in the future. The effort it takes to install a specifc drivers vary hugly between vendors. Usually the system package manage works fine, though.

I do wonder how this would go with precompiled binaries. If we can dynamically link here.

a) Redeploying unix_odbc might be a licensing issue.
b) I made experiments with shipping unix_odbc dynamic libraries directly in a wheel, yet this makes setting up drivers so much harder, as it might not pick up on drivers installed in the system right away.

@jorgecarleitao jorgecarleitao marked this pull request as ready for review March 5, 2022 18:46
@jorgecarleitao jorgecarleitao merged commit c999595 into main Mar 5, 2022
@jorgecarleitao jorgecarleitao deleted the odbc branch March 5, 2022 19:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants