-
Notifications
You must be signed in to change notification settings - Fork 85
Add Apache Arrow result set support as turbodbc_arrow #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@MathMagique I've added in the latest commit C-only (i.e. boost-free) routines to translate SQL Datetimes/Dates to ms/us. Might be interesting for other parts of |
|
Those routines sure are interesting - another step to get rid of boost :-) |
75b5ee0 to
994af0e
Compare
|
@MathMagique one problem I face here is that |
|
The Python library is not necessarily called "libpython" on each system. Pybind11's cmake scripts will expose a |
|
Either we go ahead and use |
| struct tm date = {0}; | ||
| date.tm_year = sql_date.year - 1900; | ||
| date.tm_mon = sql_date.month - 1; | ||
| date.tm_mday = sql_date.day; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not also have -1? since the definition for std::tm is "since 1900 01 01"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this is wrong now.
|
Just out of curiosity, does this implement a fetchbatch or fetchmany method? |
|
At the moment it only implements a fetchall method but adding fetchbatch/fetchmany wouldn't be that hard. First I would like to get the branch merged :) |
|
Before it diverges too much ;) |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be awesome to have; let me know if there's anything I can do to help. @Maxris may be able to help with MSVC packaging + conda-forge builds as soon as this is merged
| // Milliseconds since the epoch | ||
| return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000); | ||
| // days since the epoch | ||
| return lrint(difftime(mktime(&date), mktime(&epoch)) / 86400); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take note of what was required for msvc compatibility https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/util/datetime.h#L35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I didn't take a look at the code, but are you able to comment on whether the datatypes coming out will be closely corresponding with those fed in? For e.g. in turbodbc_numpy all ints are coerced to int8s, which is extremely space inefficient for a series of byteints. |
|
Is byteint different from int8? |
|
My bad, I meant int64 |
|
We should use the exact C types for Arrow |
|
@yalwan-scee Currently, the internal API of turbodbc is only capable of reporting 64 byte integers. That is not a fundamental issue and easily fixed, however it requires extra work. I'd like to have a first increment for the arrow support before I extend anything internally. |
|
Of course, it doesn't look like it will be hard work from what I've seen and scope creep is a killer. Is there anything I can do to help this PR move along? I'd also like to see arrow support :) |
|
@MathMagique Please proceed to review. It is now in a stage where it is not performant but such that it works and can be merged. C++ unit tests on OSX are failing due to a missing |
|
@xhochy Could you elaborate on your comment about "it is not performant"? |
|
@xhochy I'll try to review this over the weekend. |
|
@yaxxie There are some bits like extra copies in the code at the moment that can be done in a more efficient way. While this implementation is ready to be included in the master, it should not be used for performance comparisions yet. |
Codecov Report
@@ Coverage Diff @@
## master #26 +/- ##
=========================================
+ Coverage 97.49% 97.79% +0.3%
=========================================
Files 134 138 +4
Lines 1994 2356 +362
=========================================
+ Hits 1944 2304 +360
- Misses 50 52 +2
Continue to review full report at Codecov.
|
|
@yaxxie Also I would like make a follow-up PR soon that should be able to return Arrow dictionary arrays that would than in turn produce Pandas Categorical columns. This should show magnitudes of performance improvements in string results. |
|
@xhochy if we move the dictionary encoding / hashing code from parquet-cpp into arrow, then we can create a typed dictionary array builder without too much effort |
MathMagique
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the indentation everywhere (I started moving the codebase to four spaces of indentation).
| install: | ||
| - pip install numpy==1.8.0 six twine pytest-cov coveralls | ||
| - pip install numpy==1.10.4 pyarrow==0.4.0 six twine pytest-cov coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the move to numpy 1.10.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the minimal numpy requirement for Arrow & Python 3 due to some bugs in NumPy Macros. We can stay on a lower version for Python 2 if needed. But as e.g. conda packages are nowadays built against 1.11+, it would probably be ok to drop some older versions of NumPy now.
cpp/turbodbc_arrow/CMakeLists.txt
Outdated
| find_package(Boost REQUIRED COMPONENTS system) | ||
| include_directories(SYSTEM ${Boost_INCLUDE_DIRS}) | ||
|
|
||
| #find_package(PythonLibs 2.7 REQUIRED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not required, please remove it
|
|
||
| target_link_libraries(turbodbc_arrow_support | ||
| PUBLIC ${Boost_LIBRARIES} | ||
| PUBLIC ${Odbc_LIBRARIES} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation looks weird here
| case turbodbc::type_code::timestamp: | ||
| return std::unique_ptr<TimestampBuilder>(new TimestampBuilder(default_memory_pool(), arrow::timestamp(TimeUnit::MICRO))); | ||
| case turbodbc::type_code::date: | ||
| return std::unique_ptr<Date32Builder>(new Date32Builder(default_memory_pool())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird indentation
| { | ||
| auto & sql_ts = *reinterpret_cast<SQL_TIMESTAMP_STRUCT const *>(data_pointer); | ||
| long const microseconds = sql_ts.fraction / 1000; | ||
| struct tm datetime = {0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, some indentation issues. Recently, I reformatted the files I touched with four spaces of indentation instead of tabs.
| ASSERT_EQ(field->nullable(), true); | ||
| } | ||
|
|
||
| TEST(ArrowResultSetTest, AllTypesSchemaConversion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer smaller tests for individual data types. You can use a helper function to keep the implementation simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll refactor some of the tests but this one is quite comprehensive already. Splitting it up will probably not produce any better understandable code.
| ASSERT_TRUE(expected_table->Equals(*table)); | ||
| } | ||
|
|
||
| TEST(ArrowResultSetTest, MultipleBatchMultipleColumnResultSetConversion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other tests are pretty long already. This one approaches 300 lines! Please, split it up into sensible subtests with sensible names. Use helper functions to extract common setup logic where it makes sense.
| import pytest | ||
|
|
||
| # Skip all parquet tests if we can't import pyarrow.parquet | ||
| pa = pytest.importorskip('pyarrow') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this assignment necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with this statement the tests are skipped if pyarrow is not installed, this is needed e.g. for the unit tests on windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work without the assignment part...
|
|
||
| @for_each_database | ||
| @pyarrow | ||
| def test_arrow_int_column(dsn, configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are very similar to each other. Try to extract a common test function.
|
|
||
| @for_each_database | ||
| @pyarrow | ||
| def test_arrow_binary_column_with_null(dsn, configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary could be confused with the BINARY SQL type
|
@wesm yes, that's what I just started working on in Arrow |
|
@MathMagique I'll fix the indentation so that it's looks okish and after this PR is merged, I'll add |
|
Quite the beast of a pull request! Excellent work @xhochy, and many thanks! I plan a release for the near future with the new arrow feature listed as being in "alpha" status. This way, we can collect feedback while users know that improvements will follow. |
|
Fixes #24 |
|
This is great! It'd be great to get the conda-forge packages updated to make it easier for some users to kick the tires. I think this is the first usage of the pyarrow C++ API |
No description provided.