feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift #1710

davidhcoe · 2024-04-08T15:11:55Z

This PR introduces new drivers built on the Thrift protocol: Hive, Impala, and Spark. The main focus has been on Spark (including Databricks) for the initial set of tests.

…dbc into dev/introdrivers

Added implementation for table schema in SparkConnection

Adding getObjects impl

Adding back vikrant's changes

Fixing compile error

…nto dev/apache-drivers

…w-adbc into dev/apache-drivers" This reverts commit ea46162, reversing changes made to 5701c04.

…ldb/arrow-adbc into dev/apache-drivers"" This reverts commit f7ca693.

lidavidm · 2024-04-08T22:05:27Z

In regards to spark, is there a particular reason to use thrift over SparkConnect's gRPC?

Also -- to be completely honest -- I don't know that we'd noticed this API before :D. It doesn't look like a great fit for ADBC because it doesn't support either SQL or Substrait. Instead, it has it's own Protobuf-based query plans.

I think one of the nodes in the plan is just an arbitrary SQL statement. #323 (comment)

lidavidm · 2024-04-08T22:06:14Z

I'd be interested in the relative performance. But given this works, I'd also like to figure out if we can package a standalone/AOT compiled version of this and ship a new Python package...

CurtHagenlocher · 2024-04-09T14:43:03Z

I'd be interested in the relative performance. But given this works, I'd also like to figure out if we can package a standalone/AOT compiled version of this and ship a new Python package...

It looks like the current code can indeed be AOT-compiled, but the currently-checked-in "export C API" code in the base library is still experimental quality at best. I haven't had much incentive to go back and clean it up because the BigQuery dependencies don't currently work with AOT compilation so this will be the first C# implementation where AOT actually makes sense.

davidhcoe · 2024-04-10T17:06:13Z

well, that was a fun exercise to get the checks to pass. should be ready now.

Add tests for * Binary/Boolean * Date/Timestamp/Interval * Complex Types (ARRAY/MAP/STRUCT)

…ue-tests

* Removed tests for complex types. * Added tests for long and null.

Added tests for complex types

…e-tests test(csharp/test/Drivers/Apache/Spark): more coverage for value tests

CurtHagenlocher

Thanks, this is a good start! I think the single biggest piece of feedback I have is that it's hard to tell what's done/implemented and what's not yet done. This is a problem we also have for the initial C API export code, and I very much regret not having done a better job of documenting that when I implemented it rather than rediscovering it later. I think the best way to address this is with a README.md in the Apache directory which says more explicitly what still needs to be implemented.

It would also be good to mention explicitly that this only supports little-endian platforms, though I think that's actually true of the C# Arrow implementation in general.

I know the Impala functionality isn't testing yet, but has it been shown to work at all? If not, this too shoud be mentioned.

Once the code is checked in, we can turn the gaps into follow-on issues -- perhaps with a single omnibus issue listing the details -- and track their completion that way.

csharp/test/Drivers/Apache/Apache - Backup.Arrow.Adbc.Tests.Drivers.Apache.csproj

CurtHagenlocher · 2024-04-15T14:38:39Z

csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TBinaryColumn.cs

@@ -0,0 +1,237 @@
+/**


Consider adding a README.md to this directory indicating which files have been hand-edited and perhaps how the other files were generated. (If these were the files I originally generated, then I guess I'll have to remember... :/.)

Added information in the README about the manually maintained files.

CurtHagenlocher · 2024-04-15T14:45:39Z

csharp/src/Drivers/Apache/Thrift/SchemaParser.cs

+            for (int i = 0; i < thriftSchema.Columns.Count; i++)
+            {
+                TColumnDesc column = thriftSchema.Columns[i];
+                fields[i] = new Field(column.ColumnName, GetArrowType(column.TypeDesc.Types[0]), nullable: true /* ??? */);


The Thrift type doesn't include a nullable indicator?

There is no nullable metadata returned in the Thrift API call to GetResultSetMetadata. Will add a comment to clarify.

Added a code comment and documentation in README about possible inaccurate nullable indicator.

CurtHagenlocher · 2024-04-15T14:47:25Z

csharp/src/Drivers/Apache/Thrift/SchemaParser.cs

+            {
+                return GetArrowType(thriftType.PrimitiveEntry);
+            }
+            throw new InvalidOperationException();


For gaps like "doesn't support structured types", please make sure there are issues filed in GitHub. For major gaps (this may be one) it would also be good to call them out in a README.

Added documentation in README on how structured interval data types are handled.

csharp/src/Drivers/Apache/Spark/SparkStatement.cs

CurtHagenlocher · 2024-04-15T14:56:29Z

csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs

+        protected abstract TProtocol CreateProtocol();
+        protected abstract TOpenSessionReq CreateSessionRequest();
+
+        public override IArrowArrayStream GetObjects(GetObjectsDepth depth, string catalogPattern, string dbSchemaPattern, string tableNamePattern, List<string> tableTypes, string columnNamePattern)


This appears to be quite incomplete. Consider removing it from this PR and submitting separately once it's complete and tested.

we can't remove it. HiveServer2 is the base class that Spark and Impala build on, but I will add details to the readme.

I mean the body, or at least the parts of the body that aren't implemented. It could always throw a NotImplementedException for when e.g. depth != GetObjectsDepth.All.

Actually, removed the implementation here. It was not working and not useful as base functionality.

CurtHagenlocher · 2024-04-15T14:57:40Z

csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs

+            TGetOperationStatusResp statusResponse = null;
+            do
+            {
+                if (statusResponse != null) { Thread.Sleep(500); }


This is a good reminder for me that we really need a more async-friendly API, ideally a cross-process one.

Do you mean at the ADBC API definition level? CC @zeroshade who has been thinking about this. It's something I would like to tackle, but there are questions about compatibility and what happens to the sync API afterwards and if we also want to try to 'fix' other things at the same time.

I put some thoughts into #811 which is the only existing issue we have that sort of tracks async.

@CurtHagenlocher - I have some private WIP to use async as much as possible in this C# implementation. It covers end-to-end async support for ExecuteQueryAsync/ExecuteUpdateAsync. I've put it to side to work on the GetObjects implementation.

* Code Review Improvements * Fixed line ending.

birschick-bq · 2024-04-16T20:13:04Z

Thanks, this is a good start! I think the single biggest piece of feedback I have is that it's hard to tell what's done/implemented and what's not yet done. This is a problem we also have for the initial C API export code, and I very much regret not having done a better job of documenting that when I implemented it rather than rediscovering it later. I think the best way to address this is with a README.md in the Apache directory which says more explicitly what still needs to be implemented.

It would also be good to mention explicitly that this only supports little-endian platforms, though I think that's actually true of the C# Arrow implementation in general.

I know the Impala functionality isn't testing yet, but has it been shown to work at all? If not, this too shoud be mentioned.

Once the code is checked in, we can turn the gaps into follow-on issues -- perhaps with a single omnibus issue listing the details -- and track their completion that way.

@CurtHagenlocher - I believe we've responded to all your comments, so far. Let us know if there is anything else you think we should address in this PR. Thanks!

Update README

CurtHagenlocher

Okay, I think that we'll check this in as-is and then I'll start filing followup issues.

…on Thrift (apache#1710) This PR introduces new drivers built on the Thrift protocol: Hive, Impala, and Spark. The main focus has been on Spark (including Databricks) for the initial set of tests. --------- Co-authored-by: David Coe <coedavid@umich.edu> Co-authored-by: vikrantpuppala <vikrant.puppala@databricks.com> Co-authored-by: Gopal Lal <gopal.lal@databricks.com> Co-authored-by: Vikrant Puppala <vikrantpuppala@gmail.com> Co-authored-by: Gopal Lal <135012033+gopalldb@users.noreply.github.com> Co-authored-by: Jade Wang <111902719+jadewang-db@users.noreply.github.com> Co-authored-by: yunbodeng-db <104732431+yunbodeng-db@users.noreply.github.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Bruce Irschick <bruce.irschick@improving.com>

David Coe and others added 30 commits August 22, 2023 11:49

add Apache drivers; re-organize Flight SQL

b479cd0

working on getobjects

0116e21

merge

e14559f

Merge ssh://github.com/davidhcoe/arrow-adbc into dev/introdrivers

f4899bf

Merge branch 'apache:main' into dev/introdrivers

0cc6ee6

Merge branch 'dev/introdrivers' of ssh://github.com/davidhcoe/arrow-a…

2757e5c

…dbc into dev/introdrivers

include Apache drivers

e5af9f6

Merge ssh://github.com/davidhcoe/arrow-adbc into dev/introdrivers

ac9187f

update after latest merge

fb7035a

Merge ssh://github.com/davidhcoe/arrow-adbc into dev/introdrivers

45fd0c7

update to latest

9084a13

updating tests

666d012

Added implementation for table schema

6652330

use columns for data

df4fdec

Adding getObjects impl

56ac604

Merge pull request #1 from gopalldb/vp-hack

964d279

Added implementation for table schema in SparkConnection

Merge conflicts

96448f7

merge conflicts

d2fb852

Merge pull request #2 from gopalldb/list-objects

3a3a224

Adding getObjects impl

Adding back vikrant's changes

b927c31

Merge pull request #3 from gopalldb/merge2

20e81e5

Adding back vikrant's changes

Fixing compile error

8f0803f

Merge pull request #4 from gopalldb/merge2

d0e8158

Fixing compile error

make unit test working

4313cc2

Add databricks.md for testing from Yunbo

5701c04

Merge branch 'dev/apache-drivers' of github.com:gopalldb/arrow-adbc i…

ea46162

…nto dev/apache-drivers

Merge branch 'apache:main' into dev/apache-drivers

b34ff8a

Revert "Merge branch 'dev/apache-drivers' of github.com:gopalldb/arro…

f7ca693

…w-adbc into dev/apache-drivers" This reverts commit ea46162, reversing changes made to 5701c04.

Revert "Revert "Merge branch 'dev/apache-drivers' of github.com:gopal…

1949057

…ldb/arrow-adbc into dev/apache-drivers"" This reverts commit f7ca693.

initial skeleton for cloud fetch

84b39b6

davidhcoe changed the title ~~feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift~~ feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift Apr 8, 2024

Merge ssh://github.com/davidhcoe/arrow-adbc into dev/apache-drivers

60de1c0

David Coe added 2 commits April 9, 2024 23:34

attempting to fix PR check in issues

cc985a3

fix line endings on notice, license

82e5dab

birschick-bq and others added 9 commits April 10, 2024 15:02

WIP: Complex types are not working as expected, yet.

19e8c8f

Add tests for * Binary/Boolean * Date/Timestamp/Interval * Complex Types (ARRAY/MAP/STRUCT)

Merge branch 'dev/apache-drivers' into dev/birschick-bq/timestamp-val…

cbc524d

…ue-tests

Corrected line endings.

f2cdd8e

Corrected line endings.

15b7dae

Added tests for string/character values

1618429

* Corrected handling of null and double/float values

8eda11f

* Removed tests for complex types. * Added tests for long and null.

corrected line endings.

c6c5407

Set option to return string for complex types

f3d97f6

Added tests for complex types

Merge pull request #10 from davidhcoe/dev/birschick-bq/timestamp-valu…

f39a597

…e-tests test(csharp/test/Drivers/Apache/Spark): more coverage for value tests

CurtHagenlocher requested changes Apr 15, 2024

View reviewed changes

David Coe and others added 6 commits April 16, 2024 09:56

Merge ssh://github.com/davidhcoe/arrow-adbc into dev/apache-drivers

2eb1142

PR feedback

a118de1

add more details to readme

7fa0a16

feat(csharp/src/Drivers/Apache): code review improvements (#12)

0e7aa50

* Code Review Improvements * Fixed line ending.

Document unsupported Impala driver.

58d2d23

Added comment for supporting only little-endian platforms.

8373c96

Remove implementation of HiveServer2Connection.GetObjects.

226343a

Update README

CurtHagenlocher approved these changes Apr 17, 2024

View reviewed changes

CurtHagenlocher merged commit f589719 into apache:main Apr 17, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift #1710

feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift #1710

davidhcoe commented Apr 8, 2024

lidavidm commented Apr 8, 2024

lidavidm commented Apr 8, 2024

CurtHagenlocher commented Apr 9, 2024

davidhcoe commented Apr 10, 2024

CurtHagenlocher left a comment

CurtHagenlocher Apr 15, 2024

birschick-bq Apr 16, 2024

CurtHagenlocher Apr 15, 2024

birschick-bq Apr 16, 2024 •

edited

Loading

birschick-bq Apr 16, 2024

CurtHagenlocher Apr 15, 2024

birschick-bq Apr 16, 2024

CurtHagenlocher Apr 15, 2024

davidhcoe Apr 16, 2024

CurtHagenlocher Apr 16, 2024

This comment was marked as outdated.

birschick-bq Apr 16, 2024 •

edited

Loading

CurtHagenlocher Apr 15, 2024

lidavidm Apr 16, 2024

CurtHagenlocher Apr 16, 2024

birschick-bq Apr 16, 2024 •

edited

Loading

birschick-bq commented Apr 16, 2024

CurtHagenlocher left a comment

feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift #1710

feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift #1710

Conversation

davidhcoe commented Apr 8, 2024

lidavidm commented Apr 8, 2024

lidavidm commented Apr 8, 2024

CurtHagenlocher commented Apr 9, 2024

davidhcoe commented Apr 10, 2024

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

birschick-bq Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

birschick-bq Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

birschick-bq Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

birschick-bq commented Apr 16, 2024

CurtHagenlocher left a comment

Choose a reason for hiding this comment

birschick-bq Apr 16, 2024 •

edited

Loading

birschick-bq Apr 16, 2024 •

edited

Loading

birschick-bq Apr 16, 2024 •

edited

Loading