Skip to content

Conversation

@xhochy
Copy link
Collaborator

@xhochy xhochy commented Dec 12, 2016

No description provided.

@xhochy
Copy link
Collaborator Author

xhochy commented Jan 5, 2017

@MathMagique I've added in the latest commit C-only (i.e. boost-free) routines to translate SQL Datetimes/Dates to ms/us. Might be interesting for other parts of turbodbc.

@MathMagique
Copy link
Member

Those routines sure are interesting - another step to get rid of boost :-)

@xhochy
Copy link
Collaborator Author

xhochy commented Feb 17, 2017

@MathMagique one problem I face here is that turbodbc_arrow_unit_test depends on linking against libpython. I suspect that this is not available on all platforms. Any suggestions on how to deal with that?

@MathMagique
Copy link
Member

The Python library is not necessarily called "libpython" on each system. Pybind11's cmake scripts will expose a ${PYTHON_LIBRARY} variable you should be able to use:
https://github.com/pybind/pybind11/blob/f3de2d5521c3e95e3212b7c5dc26d9d4e23b17fb/tools/FindPythonLibsNew.cmake#L166

@xhochy
Copy link
Collaborator Author

xhochy commented Feb 18, 2017

Either we go ahead and use conda in the Travis setup to install Arrow or this PR will be blocked at the moment by https://issues.apache.org/jira/browse/ARROW-566. It is definitely possible to build wheels that satisfy the requirements for pip based turbodbc_arrow but not as simple as with conda.

struct tm date = {0};
date.tm_year = sql_date.year - 1900;
date.tm_mon = sql_date.month - 1;
date.tm_mday = sql_date.day;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not also have -1? since the definition for std::tm is "since 1900 01 01"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this is wrong now.

@yalwan-scee
Copy link

Just out of curiosity, does this implement a fetchbatch or fetchmany method?

@xhochy
Copy link
Collaborator Author

xhochy commented Mar 23, 2017

At the moment it only implements a fetchall method but adding fetchbatch/fetchmany wouldn't be that hard. First I would like to get the branch merged :)

@yalwan-scee
Copy link

Before it diverges too much ;)

Copy link
Contributor

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be awesome to have; let me know if there's anything I can do to help. @Maxris may be able to help with MSVC packaging + conda-forge builds as soon as this is merged

// Milliseconds since the epoch
return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000);
// days since the epoch
return lrint(difftime(mktime(&date), mktime(&epoch)) / 86400);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm The master branch already supports MSVC packaging (Python 3.5+) and @xhochy has created turbodbc, so I hope there is not too much to do on the packaging side.

@yaxxie
Copy link
Contributor

yaxxie commented May 16, 2017

I didn't take a look at the code, but are you able to comment on whether the datatypes coming out will be closely corresponding with those fed in? For e.g. in turbodbc_numpy all ints are coerced to int8s, which is extremely space inefficient for a series of byteints.

@wesm
Copy link
Contributor

wesm commented May 16, 2017

Is byteint different from int8?

@yaxxie
Copy link
Contributor

yaxxie commented May 16, 2017

My bad, I meant int64

@wesm
Copy link
Contributor

wesm commented May 16, 2017

We should use the exact C types for Arrow

@MathMagique
Copy link
Member

@yalwan-scee Currently, the internal API of turbodbc is only capable of reporting 64 byte integers. That is not a fundamental issue and easily fixed, however it requires extra work. I'd like to have a first increment for the arrow support before I extend anything internally.

@yaxxie
Copy link
Contributor

yaxxie commented May 24, 2017

Of course, it doesn't look like it will be hard work from what I've seen and scope creep is a killer.

Is there anything I can do to help this PR move along? I'd also like to see arrow support :)

@xhochy
Copy link
Collaborator Author

xhochy commented May 26, 2017

@MathMagique Please proceed to review. It is now in a stage where it is not performant but such that it works and can be merged.

C++ unit tests on OSX are failing due to a missing libpython.dylib. We either have the option to always compile a fresh python from source or skip these tests for now. As they are included in the Python unit tests, I would vote for skipping them on OSX.

@yaxxie
Copy link
Contributor

yaxxie commented May 26, 2017

@xhochy Could you elaborate on your comment about "it is not performant"?

@MathMagique
Copy link
Member

@xhochy I'll try to review this over the weekend.

@xhochy
Copy link
Collaborator Author

xhochy commented May 27, 2017

@yaxxie There are some bits like extra copies in the code at the moment that can be done in a more efficient way. While this implementation is ready to be included in the master, it should not be used for performance comparisions yet.

@codecov-io
Copy link

codecov-io commented May 27, 2017

Codecov Report

Merging #26 into master will increase coverage by 0.3%.
The diff coverage is 99.46%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #26     +/-   ##
=========================================
+ Coverage   97.49%   97.79%   +0.3%     
=========================================
  Files         134      138      +4     
  Lines        1994     2356    +362     
=========================================
+ Hits         1944     2304    +360     
- Misses         50       52      +2
Impacted Files Coverage Δ
cpp/turbodbc/Library/src/time_helpers.cpp 100% <100%> (ø)
cpp/turbodbc_numpy/Library/src/datetime_column.cpp 100% <100%> (ø) ⬆️
cpp/turbodbc_arrow/Library/src/python_bindings.cpp 100% <100%> (ø)
...urbodbc_arrow/Test/tests/arrow_result_set_test.cpp 100% <100%> (ø)
python/turbodbc/cursor.py 100% <100%> (ø) ⬆️
...pp/turbodbc_arrow/Library/src/arrow_result_set.cpp 98.18% <98.18%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 082d85c...1c1c87f. Read the comment docs.

@xhochy
Copy link
Collaborator Author

xhochy commented May 27, 2017

@yaxxie Also I would like make a follow-up PR soon that should be able to return Arrow dictionary arrays that would than in turn produce Pandas Categorical columns. This should show magnitudes of performance improvements in string results.

@xhochy xhochy changed the title WIP: turbodbc_arrow Add Apache Arrow result set support as turbodbc_arrow May 27, 2017
@wesm
Copy link
Contributor

wesm commented May 27, 2017

@xhochy if we move the dictionary encoding / hashing code from parquet-cpp into arrow, then we can create a typed dictionary array builder without too much effort

Copy link
Member

@MathMagique MathMagique left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the indentation everywhere (I started moving the codebase to four spaces of indentation).

install:
- pip install numpy==1.8.0 six twine pytest-cov coveralls
- pip install numpy==1.10.4 pyarrow==0.4.0 six twine pytest-cov coveralls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the move to numpy 1.10.4?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the minimal numpy requirement for Arrow & Python 3 due to some bugs in NumPy Macros. We can stay on a lower version for Python 2 if needed. But as e.g. conda packages are nowadays built against 1.11+, it would probably be ok to drop some older versions of NumPy now.

find_package(Boost REQUIRED COMPONENTS system)
include_directories(SYSTEM ${Boost_INCLUDE_DIRS})

#find_package(PythonLibs 2.7 REQUIRED)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not required, please remove it


target_link_libraries(turbodbc_arrow_support
PUBLIC ${Boost_LIBRARIES}
PUBLIC ${Odbc_LIBRARIES}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation looks weird here

case turbodbc::type_code::timestamp:
return std::unique_ptr<TimestampBuilder>(new TimestampBuilder(default_memory_pool(), arrow::timestamp(TimeUnit::MICRO)));
case turbodbc::type_code::date:
return std::unique_ptr<Date32Builder>(new Date32Builder(default_memory_pool()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird indentation

{
auto & sql_ts = *reinterpret_cast<SQL_TIMESTAMP_STRUCT const *>(data_pointer);
long const microseconds = sql_ts.fraction / 1000;
struct tm datetime = {0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, some indentation issues. Recently, I reformatted the files I touched with four spaces of indentation instead of tabs.

ASSERT_EQ(field->nullable(), true);
}

TEST(ArrowResultSetTest, AllTypesSchemaConversion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer smaller tests for individual data types. You can use a helper function to keep the implementation simple.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refactor some of the tests but this one is quite comprehensive already. Splitting it up will probably not produce any better understandable code.

ASSERT_TRUE(expected_table->Equals(*table));
}

TEST(ArrowResultSetTest, MultipleBatchMultipleColumnResultSetConversion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other tests are pretty long already. This one approaches 300 lines! Please, split it up into sensible subtests with sensible names. Use helper functions to extract common setup logic where it makes sense.

import pytest

# Skip all parquet tests if we can't import pyarrow.parquet
pa = pytest.importorskip('pyarrow')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this assignment necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with this statement the tests are skipped if pyarrow is not installed, this is needed e.g. for the unit tests on windows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work without the assignment part...


@for_each_database
@pyarrow
def test_arrow_int_column(dsn, configuration):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are very similar to each other. Try to extract a common test function.


@for_each_database
@pyarrow
def test_arrow_binary_column_with_null(dsn, configuration):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary could be confused with the BINARY SQL type

@xhochy
Copy link
Collaborator Author

xhochy commented May 29, 2017

@wesm yes, that's what I just started working on in Arrow

@xhochy
Copy link
Collaborator Author

xhochy commented May 29, 2017

@MathMagique I'll fix the indentation so that it's looks okish and after this PR is merged, I'll add clang-format to the build toolchain so that it's easier keep this in sync in future.

@MathMagique MathMagique merged commit b08aa1c into blue-yonder:master Jun 2, 2017
@MathMagique
Copy link
Member

Quite the beast of a pull request! Excellent work @xhochy, and many thanks! I plan a release for the near future with the new arrow feature listed as being in "alpha" status. This way, we can collect feedback while users know that improvements will follow.

@MathMagique
Copy link
Member

Fixes #24

@wesm
Copy link
Contributor

wesm commented Jun 2, 2017

This is great! It'd be great to get the conda-forge packages updated to make it easier for some users to kick the tires. I think this is the first usage of the pyarrow C++ API

@xhochy xhochy deleted the turbodbc_arrow branch June 2, 2017 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants