Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift #1710

Merged
merged 98 commits into from
Apr 17, 2024

Conversation

davidhcoe
Copy link
Contributor

This PR introduces new drivers built on the Thrift protocol: Hive, Impala, and Spark. The main focus has been on Spark (including Databricks) for the initial set of tests.

David Coe and others added 30 commits August 22, 2023 11:49
Added implementation for table schema in SparkConnection
Adding back vikrant's changes
…w-adbc into dev/apache-drivers"

This reverts commit ea46162, reversing
changes made to 5701c04.
…ldb/arrow-adbc into dev/apache-drivers""

This reverts commit f7ca693.
@davidhcoe davidhcoe changed the title feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift feat(csharp/src/Drivers): introduce drivers for Apache systems built on Thrift Apr 8, 2024
@lidavidm
Copy link
Member

lidavidm commented Apr 8, 2024

In regards to spark, is there a particular reason to use thrift over SparkConnect's gRPC?

Also -- to be completely honest -- I don't know that we'd noticed this API before :D. It doesn't look like a great fit for ADBC because it doesn't support either SQL or Substrait. Instead, it has it's own Protobuf-based query plans.

I think one of the nodes in the plan is just an arbitrary SQL statement. #323 (comment)

@lidavidm
Copy link
Member

lidavidm commented Apr 8, 2024

I'd be interested in the relative performance. But given this works, I'd also like to figure out if we can package a standalone/AOT compiled version of this and ship a new Python package...

@CurtHagenlocher
Copy link
Contributor

I'd be interested in the relative performance. But given this works, I'd also like to figure out if we can package a standalone/AOT compiled version of this and ship a new Python package...

It looks like the current code can indeed be AOT-compiled, but the currently-checked-in "export C API" code in the base library is still experimental quality at best. I haven't had much incentive to go back and clean it up because the BigQuery dependencies don't currently work with AOT compilation so this will be the first C# implementation where AOT actually makes sense.

@davidhcoe
Copy link
Contributor Author

well, that was a fun exercise to get the checks to pass. should be ready now.

birschick-bq and others added 9 commits April 10, 2024 15:02
Add tests for
* Binary/Boolean
* Date/Timestamp/Interval
* Complex Types (ARRAY/MAP/STRUCT)
* Removed tests for complex types.
* Added tests for long and null.
Added tests for complex types
…e-tests

test(csharp/test/Drivers/Apache/Spark): more coverage for value tests
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good start! I think the single biggest piece of feedback I have is that it's hard to tell what's done/implemented and what's not yet done. This is a problem we also have for the initial C API export code, and I very much regret not having done a better job of documenting that when I implemented it rather than rediscovering it later. I think the best way to address this is with a README.md in the Apache directory which says more explicitly what still needs to be implemented.

It would also be good to mention explicitly that this only supports little-endian platforms, though I think that's actually true of the C# Arrow implementation in general.

I know the Impala functionality isn't testing yet, but has it been shown to work at all? If not, this too shoud be mentioned.

Once the code is checked in, we can turn the gaps into follow-on issues -- perhaps with a single omnibus issue listing the details -- and track their completion that way.

@@ -0,0 +1,237 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a README.md to this directory indicating which files have been hand-edited and perhaps how the other files were generated. (If these were the files I originally generated, then I guess I'll have to remember... :/.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added information in the README about the manually maintained files.

for (int i = 0; i < thriftSchema.Columns.Count; i++)
{
TColumnDesc column = thriftSchema.Columns[i];
fields[i] = new Field(column.ColumnName, GetArrowType(column.TypeDesc.Types[0]), nullable: true /* ??? */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Thrift type doesn't include a nullable indicator?

Copy link
Contributor

@birschick-bq birschick-bq Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no nullable metadata returned in the Thrift API call to GetResultSetMetadata. Will add a comment to clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a code comment and documentation in README about possible inaccurate nullable indicator.

{
return GetArrowType(thriftType.PrimitiveEntry);
}
throw new InvalidOperationException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gaps like "doesn't support structured types", please make sure there are issues filed in GitHub. For major gaps (this may be one) it would also be good to call them out in a README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added documentation in README on how structured interval data types are handled.

csharp/src/Drivers/Apache/Spark/SparkStatement.cs Outdated Show resolved Hide resolved
csharp/src/Drivers/Apache/Spark/SparkStatement.cs Outdated Show resolved Hide resolved
csharp/src/Drivers/Apache/Spark/SparkStatement.cs Outdated Show resolved Hide resolved
protected abstract TProtocol CreateProtocol();
protected abstract TOpenSessionReq CreateSessionRequest();

public override IArrowArrayStream GetObjects(GetObjectsDepth depth, string catalogPattern, string dbSchemaPattern, string tableNamePattern, List<string> tableTypes, string columnNamePattern)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be quite incomplete. Consider removing it from this PR and submitting separately once it's complete and tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't remove it. HiveServer2 is the base class that Spark and Impala build on, but I will add details to the readme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the body, or at least the parts of the body that aren't implemented. It could always throw a NotImplementedException for when e.g. depth != GetObjectsDepth.All.

This comment was marked as outdated.

Copy link
Contributor

@birschick-bq birschick-bq Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, removed the implementation here. It was not working and not useful as base functionality.

TGetOperationStatusResp statusResponse = null;
do
{
if (statusResponse != null) { Thread.Sleep(500); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good reminder for me that we really need a more async-friendly API, ideally a cross-process one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean at the ADBC API definition level? CC @zeroshade who has been thinking about this. It's something I would like to tackle, but there are questions about compatibility and what happens to the sync API afterwards and if we also want to try to 'fix' other things at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put some thoughts into #811 which is the only existing issue we have that sort of tracks async.

Copy link
Contributor

@birschick-bq birschick-bq Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CurtHagenlocher - I have some private WIP to use async as much as possible in this C# implementation. It covers end-to-end async support for ExecuteQueryAsync/ExecuteUpdateAsync. I've put it to side to work on the GetObjects implementation.

@birschick-bq
Copy link
Contributor

Thanks, this is a good start! I think the single biggest piece of feedback I have is that it's hard to tell what's done/implemented and what's not yet done. This is a problem we also have for the initial C API export code, and I very much regret not having done a better job of documenting that when I implemented it rather than rediscovering it later. I think the best way to address this is with a README.md in the Apache directory which says more explicitly what still needs to be implemented.

It would also be good to mention explicitly that this only supports little-endian platforms, though I think that's actually true of the C# Arrow implementation in general.

I know the Impala functionality isn't testing yet, but has it been shown to work at all? If not, this too shoud be mentioned.

Once the code is checked in, we can turn the gaps into follow-on issues -- perhaps with a single omnibus issue listing the details -- and track their completion that way.

@CurtHagenlocher - I believe we've responded to all your comments, so far. Let us know if there is anything else you think we should address in this PR. Thanks!

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think that we'll check this in as-is and then I'll start filing followup issues.

@CurtHagenlocher CurtHagenlocher merged commit f589719 into apache:main Apr 17, 2024
5 checks passed
cocoa-xu pushed a commit to cocoa-xu/arrow-adbc that referenced this pull request Apr 24, 2024
…on Thrift (apache#1710)

This PR introduces new drivers built on the Thrift protocol: Hive,
Impala, and Spark. The main focus has been on Spark (including
Databricks) for the initial set of tests.

---------

Co-authored-by: David Coe <coedavid@umich.edu>
Co-authored-by: vikrantpuppala <vikrant.puppala@databricks.com>
Co-authored-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Vikrant Puppala <vikrantpuppala@gmail.com>
Co-authored-by: Gopal Lal <135012033+gopalldb@users.noreply.github.com>
Co-authored-by: Jade Wang <111902719+jadewang-db@users.noreply.github.com>
Co-authored-by: yunbodeng-db <104732431+yunbodeng-db@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Bruce Irschick <bruce.irschick@improving.com>
cocoa-xu pushed a commit to meowcraft-dev/arrow-adbc that referenced this pull request May 8, 2024
…on Thrift (apache#1710)

This PR introduces new drivers built on the Thrift protocol: Hive,
Impala, and Spark. The main focus has been on Spark (including
Databricks) for the initial set of tests.

---------

Co-authored-by: David Coe <coedavid@umich.edu>
Co-authored-by: vikrantpuppala <vikrant.puppala@databricks.com>
Co-authored-by: Gopal Lal <gopal.lal@databricks.com>
Co-authored-by: Vikrant Puppala <vikrantpuppala@gmail.com>
Co-authored-by: Gopal Lal <135012033+gopalldb@users.noreply.github.com>
Co-authored-by: Jade Wang <111902719+jadewang-db@users.noreply.github.com>
Co-authored-by: yunbodeng-db <104732431+yunbodeng-db@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Bruce Irschick <bruce.irschick@improving.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants