Add Apache Arrow result set support as turbodbc_arrow #26

xhochy · 2016-12-12T08:13:34Z

No description provided.

xhochy · 2017-01-05T07:09:24Z

@MathMagique I've added in the latest commit C-only (i.e. boost-free) routines to translate SQL Datetimes/Dates to ms/us. Might be interesting for other parts of turbodbc.

MathMagique · 2017-01-06T10:05:06Z

Those routines sure are interesting - another step to get rid of boost :-)

xhochy · 2017-02-17T21:04:40Z

@MathMagique one problem I face here is that turbodbc_arrow_unit_test depends on linking against libpython. I suspect that this is not available on all platforms. Any suggestions on how to deal with that?

MathMagique · 2017-02-17T22:20:51Z

The Python library is not necessarily called "libpython" on each system. Pybind11's cmake scripts will expose a ${PYTHON_LIBRARY} variable you should be able to use:
https://github.com/pybind/pybind11/blob/f3de2d5521c3e95e3212b7c5dc26d9d4e23b17fb/tools/FindPythonLibsNew.cmake#L166

xhochy · 2017-02-18T08:50:00Z

Either we go ahead and use conda in the Travis setup to install Arrow or this PR will be blocked at the moment by https://issues.apache.org/jira/browse/ARROW-566. It is definitely possible to build wheels that satisfy the requirements for pip based turbodbc_arrow but not as simple as with conda.

yaxxie · 2017-03-11T20:26:32Z

cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp

+    struct tm date = {0};
+    date.tm_year = sql_date.year - 1900;
+    date.tm_mon = sql_date.month - 1;
+    date.tm_mday = sql_date.day;


Should this not also have -1? since the definition for std::tm is "since 1900 01 01"?

I understand why this is wrong now.

yalwan-scee · 2017-03-23T15:53:08Z

Just out of curiosity, does this implement a fetchbatch or fetchmany method?

xhochy · 2017-03-23T16:04:55Z

At the moment it only implements a fetchall method but adding fetchbatch/fetchmany wouldn't be that hard. First I would like to get the branch merged :)

yalwan-scee · 2017-03-23T16:09:13Z

Before it diverges too much ;)

wesm

This will be awesome to have; let me know if there's anything I can do to help. @Maxris may be able to help with MSVC packaging + conda-forge builds as soon as this is merged

wesm · 2017-05-15T13:08:28Z

cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp

-    // Milliseconds since the epoch
-    return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000);
+    // days since the epoch
+    return lrint(difftime(mktime(&date), mktime(&epoch)) / 86400);


Take note of what was required for msvc compatibility https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/util/datetime.h#L35

@wesm The master branch already supports MSVC packaging (Python 3.5+) and @xhochy has created turbodbc, so I hope there is not too much to do on the packaging side.

yaxxie · 2017-05-16T15:44:28Z

I didn't take a look at the code, but are you able to comment on whether the datatypes coming out will be closely corresponding with those fed in? For e.g. in turbodbc_numpy all ints are coerced to int8s, which is extremely space inefficient for a series of byteints.

wesm · 2017-05-16T15:45:36Z

Is byteint different from int8?

yaxxie · 2017-05-16T16:21:09Z

My bad, I meant int64

wesm · 2017-05-16T16:35:15Z

We should use the exact C types for Arrow

MathMagique · 2017-05-17T06:44:04Z

@yalwan-scee Currently, the internal API of turbodbc is only capable of reporting 64 byte integers. That is not a fundamental issue and easily fixed, however it requires extra work. I'd like to have a first increment for the arrow support before I extend anything internally.

yaxxie · 2017-05-24T10:29:23Z

Of course, it doesn't look like it will be hard work from what I've seen and scope creep is a killer.

Is there anything I can do to help this PR move along? I'd also like to see arrow support :)

xhochy · 2017-05-26T14:21:31Z

@MathMagique Please proceed to review. It is now in a stage where it is not performant but such that it works and can be merged.

C++ unit tests on OSX are failing due to a missing libpython.dylib. We either have the option to always compile a fresh python from source or skip these tests for now. As they are included in the Python unit tests, I would vote for skipping them on OSX.

yaxxie · 2017-05-26T20:19:53Z

@xhochy Could you elaborate on your comment about "it is not performant"?

MathMagique · 2017-05-26T22:45:00Z

@xhochy I'll try to review this over the weekend.

xhochy · 2017-05-27T15:12:03Z

@yaxxie There are some bits like extra copies in the code at the moment that can be done in a more efficient way. While this implementation is ready to be included in the master, it should not be used for performance comparisions yet.

codecov-io · 2017-05-27T15:30:17Z

Codecov Report

Merging #26 into master will increase coverage by 0.3%.
The diff coverage is 99.46%.

@@            Coverage Diff            @@
##           master      #26     +/-   ##
=========================================
+ Coverage   97.49%   97.79%   +0.3%     
=========================================
  Files         134      138      +4     
  Lines        1994     2356    +362     
=========================================
+ Hits         1944     2304    +360     
- Misses         50       52      +2

Impacted Files	Coverage Δ
cpp/turbodbc/Library/src/time_helpers.cpp	`100% <100%> (ø)`
cpp/turbodbc_numpy/Library/src/datetime_column.cpp	`100% <100%> (ø)`	⬆️
cpp/turbodbc_arrow/Library/src/python_bindings.cpp	`100% <100%> (ø)`
...urbodbc_arrow/Test/tests/arrow_result_set_test.cpp	`100% <100%> (ø)`
python/turbodbc/cursor.py	`100% <100%> (ø)`	⬆️
...pp/turbodbc_arrow/Library/src/arrow_result_set.cpp	`98.18% <98.18%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 082d85c...1c1c87f. Read the comment docs.

xhochy · 2017-05-27T15:35:03Z

@yaxxie Also I would like make a follow-up PR soon that should be able to return Arrow dictionary arrays that would than in turn produce Pandas Categorical columns. This should show magnitudes of performance improvements in string results.

wesm · 2017-05-27T21:48:30Z

@xhochy if we move the dictionary encoding / hashing code from parquet-cpp into arrow, then we can create a typed dictionary array builder without too much effort

MathMagique

Please fix the indentation everywhere (I started moving the codebase to four spaces of indentation).

MathMagique · 2017-05-28T18:49:40Z

.travis.yml


 install:
-  - pip install numpy==1.8.0 six twine pytest-cov coveralls
+  - pip install numpy==1.10.4 pyarrow==0.4.0 six twine pytest-cov coveralls


Why the move to numpy 1.10.4?

This is the minimal numpy requirement for Arrow & Python 3 due to some bugs in NumPy Macros. We can stay on a lower version for Python 2 if needed. But as e.g. conda packages are nowadays built against 1.11+, it would probably be ok to drop some older versions of NumPy now.

MathMagique · 2017-05-28T18:52:04Z

cpp/turbodbc_arrow/CMakeLists.txt

+find_package(Boost REQUIRED COMPONENTS system)
+include_directories(SYSTEM ${Boost_INCLUDE_DIRS})
+
+#find_package(PythonLibs 2.7 REQUIRED)


If this is not required, please remove it

MathMagique · 2017-05-28T18:52:41Z

cpp/turbodbc_arrow/Library/CMakeLists.txt

+
+target_link_libraries(turbodbc_arrow_support
+    PUBLIC ${Boost_LIBRARIES}
+	PUBLIC ${Odbc_LIBRARIES}


Indentation looks weird here

MathMagique · 2017-05-28T18:54:04Z

cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp

+			case turbodbc::type_code::timestamp:
+				return std::unique_ptr<TimestampBuilder>(new TimestampBuilder(default_memory_pool(), arrow::timestamp(TimeUnit::MICRO)));
+			case turbodbc::type_code::date:
+        return std::unique_ptr<Date32Builder>(new Date32Builder(default_memory_pool()));


weird indentation

MathMagique · 2017-05-28T18:55:14Z

cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp

+	{
+		auto & sql_ts = *reinterpret_cast<SQL_TIMESTAMP_STRUCT const *>(data_pointer);
+		long const microseconds = sql_ts.fraction / 1000;
+    struct tm datetime = {0};


Again, some indentation issues. Recently, I reformatted the files I touched with four spaces of indentation instead of tabs.

MathMagique · 2017-05-28T19:55:10Z

cpp/turbodbc_arrow/Test/tests/arrow_result_set_test.cpp

+  ASSERT_EQ(field->nullable(), true);
+}
+
+TEST(ArrowResultSetTest, AllTypesSchemaConversion)


I'd prefer smaller tests for individual data types. You can use a helper function to keep the implementation simple.

I'll refactor some of the tests but this one is quite comprehensive already. Splitting it up will probably not produce any better understandable code.

MathMagique · 2017-05-28T20:01:22Z

cpp/turbodbc_arrow/Test/tests/arrow_result_set_test.cpp

+  ASSERT_TRUE(expected_table->Equals(*table));
+}
+
+TEST(ArrowResultSetTest, MultipleBatchMultipleColumnResultSetConversion)


The other tests are pretty long already. This one approaches 300 lines! Please, split it up into sensible subtests with sensible names. Use helper functions to extract common setup logic where it makes sense.

MathMagique · 2017-05-28T20:02:34Z

python/turbodbc_test/test_has_arrow_support.py

+import pytest
+
+# Skip all parquet tests if we can't import pyarrow.parquet
+pa = pytest.importorskip('pyarrow')


is this assignment necessary?

Yes, with this statement the tests are skipped if pyarrow is not installed, this is needed e.g. for the unit tests on windows.

This should work without the assignment part...

MathMagique · 2017-05-28T20:03:46Z

python/turbodbc_test/test_select_arrow.py

+
+@for_each_database
+@pyarrow
+def test_arrow_int_column(dsn, configuration):


These tests are very similar to each other. Try to extract a common test function.

MathMagique · 2017-05-28T20:04:39Z

python/turbodbc_test/test_select_arrow.py

+
+@for_each_database
+@pyarrow
+def test_arrow_binary_column_with_null(dsn, configuration):


binary could be confused with the BINARY SQL type

xhochy · 2017-05-29T10:56:39Z

@wesm yes, that's what I just started working on in Arrow

xhochy · 2017-05-29T10:57:34Z

@MathMagique I'll fix the indentation so that it's looks okish and after this PR is merged, I'll add clang-format to the build toolchain so that it's easier keep this in sync in future.

MathMagique · 2017-06-02T13:58:30Z

Quite the beast of a pull request! Excellent work @xhochy, and many thanks! I plan a release for the near future with the new arrow feature listed as being in "alpha" status. This way, we can collect feedback while users know that improvements will follow.

MathMagique · 2017-06-02T13:58:59Z

Fixes #24

wesm · 2017-06-02T17:12:18Z

This is great! It'd be great to get the conda-forge packages updated to make it easier for some users to kick the tires. I think this is the first usage of the pyarrow C++ API

xhochy mentioned this pull request Dec 15, 2016

ARROW-425: Add private API to get python Table from a C++ object apache/arrow#241

Closed

xhochy force-pushed the turbodbc_arrow branch from da949a0 to 75b5ee0 Compare January 5, 2017 07:04

xhochy force-pushed the turbodbc_arrow branch from 75b5ee0 to 994af0e Compare February 17, 2017 15:17

yaxxie reviewed Mar 11, 2017

View reviewed changes

xhochy force-pushed the turbodbc_arrow branch from 408e1e2 to 6220d9f Compare May 13, 2017 14:46

wesm reviewed May 15, 2017

View reviewed changes

xhochy and others added 10 commits May 25, 2017 13:24

Start work on turbodbc_arrow

3e2cdaf

Unit tests pass

60e44e6

Update to latest arrow HEAD

5a667c6

Convert Table to python object

e8678f5

Remove unused code

0b63fb3

Add fetchallarrow

11130ab

Add first round of unittest for Arrow on the python side

9064896

More work on the Arrow support

e6b1278

Use pybind11

33f207c

Install Apache Arrow on Travis

a8e2e5e

xhochy added 4 commits May 26, 2017 14:11

Clear ARROW_FOUND

b71445b

Add missing import

6993809

Python 3 fixes

034e574

Add missing six import

a1dbc53

Skip turbodbc_arrow C++ tests on OSX

5f17291

Add turbodbc_arrow to setup.py

955c965

xhochy changed the title ~~WIP: turbodbc_arrow~~ Add Apache Arrow result set support as turbodbc_arrow May 27, 2017

MathMagique reviewed May 28, 2017

View reviewed changes

xhochy and others added 8 commits May 29, 2017 13:36

First round of review comments

13bd659

First round of unit test cleanup

ec7ec6d

Factor out int and float column tests

2e73228

Add boolean and string tests

6801048

Refactored tests

9c3e6d9

Use same time conversion routines across all turbodbc libs

a770194

Add missing includes for Windows

c6a4bfe

Fix linking on macOS

1c1c87f

MathMagique merged commit b08aa1c into blue-yonder:master Jun 2, 2017

xhochy deleted the turbodbc_arrow branch June 2, 2017 17:25

Add Apache Arrow result set support as turbodbc_arrow #26

Add Apache Arrow result set support as turbodbc_arrow #26

Uh oh!

Conversation

xhochy commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhochy commented Jan 5, 2017

Uh oh!

MathMagique commented Jan 6, 2017

Uh oh!

xhochy commented Feb 17, 2017

Uh oh!

MathMagique commented Feb 17, 2017

Uh oh!

xhochy commented Feb 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yalwan-scee commented Mar 23, 2017

Uh oh!

xhochy commented Mar 23, 2017

Uh oh!

yalwan-scee commented Mar 23, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaxxie commented May 16, 2017

Uh oh!

wesm commented May 16, 2017

Uh oh!

yaxxie commented May 16, 2017

Uh oh!

wesm commented May 16, 2017

Uh oh!

MathMagique commented May 17, 2017

Uh oh!

yaxxie commented May 24, 2017

Uh oh!

xhochy commented May 26, 2017

Uh oh!

yaxxie commented May 26, 2017

Uh oh!

MathMagique commented May 26, 2017

Uh oh!

xhochy commented May 27, 2017

Uh oh!

codecov-io commented May 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xhochy commented May 27, 2017

Uh oh!

wesm commented May 27, 2017

Uh oh!

MathMagique left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

xhochy commented Dec 12, 2016 •

edited

Loading

codecov-io commented May 27, 2017 •

edited

Loading