Skip to content

Commit

Permalink
docs(csharp/src/Drivers/Apache/Spark): document connection properties (
Browse files Browse the repository at this point in the history
…apache#2019)

Add documentation for connection properties

* updates existing documentation for Apache/Thrift-based drivers
  • Loading branch information
birschick-bq authored Jul 18, 2024
1 parent db0852c commit 2435619
Show file tree
Hide file tree
Showing 2 changed files with 90 additions and 33 deletions.
84 changes: 84 additions & 0 deletions csharp/src/Drivers/Apache/Spark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Spark Driver

## Database and Connection Properties

Properties should be passed in the call to `SparkDriver.Open`,
but can also be passed in the call to `AdbcDatabase.Connect`.

| Property | Description | Default |
| :--- | :--- | :--- |
| `adbc.spark.host` | Host name for the data source. Do no include scheme or port number. Example: `sparkserver.region.cloudapp.azure.com` | |
| `adbc.spark.port` | The port number the data source is listen on for new connections. | `443` |
| `adbc.spark.path` | The URI path on the data source server. Example: `sql/protocolv1/o/0123456789123456/01234-0123456-source` | |
| `adbc.spark.token` | For token-based authentication, the token to be authenticated on the data source. Example: `abcdef0123456789` | |
<!-- Add these properties when basic authentication is available.
| `adbc.spark.scheme` | The HTTP or HTTPS scheme to use. Allowed values: `http`, `https`. | `https` - when port is 443 or empty, `http`, otherwise. |
| `auth_type` | An indicator of the intended type of authentication. Allowed values: `basic`, `token`. This property is optional. The authentication type can be inferred from `token`, `username`, and `password`. If a `token` value is provided, token authentication is used. Otherwise, if both `username` and `password` values are provided, basic authentication is used. | |
| `username` | The user name used for basic authentication | |
| `password` | The password for the user name used for basic authentication. | |
-->

## Spark Types

The following table depicts how the Spark ADBC driver converts a Spark type to an Arrow type and a .NET type:

| Spark Type | Arrow Type | C# Type |
| :--- | :---: | :---: |
| ARRAY* | String | string |
| BIGINT | Int64 | long |
| BINARY | Binary | byte[] |
| BOOLEAN | Boolean | bool |
| CHAR | String | string |
| DATE | Date32 | DateTime |
| DECIMAL | Decimal128 | SqlDecimal |
| DOUBLE | Double | double |
| FLOAT | Float | float |
| INT | Int32 | int |
| INTERVAL_DAY_TIME+ | String | string |
| INTERVAL_YEAR_MONTH+ | String | string |
| MAP* | String | string |
| NULL | Null | null |
| SMALLINT | Int16 | short |
| STRING | String | string |
| STRUCT* | String | string |
| TIMESTAMP | Timestamp | DateTimeOffset |
| TINYINT | Int8 | sbyte |
| UNION | String | string |
| USER_DEFINED | String | string |
| VARCHAR | String | string |

\* Complex types are returned as strings<br>
\+ Interval types are returned as strings

## Supported Variants

### Spark on Databricks

Support for Spark on Databricks is the most mature.

The Spark ADBC driver supports token-based authentiation using the
[Databricks personal access token](https://docs.databricks.com/en/dev-tools/auth/pat.html).
Basic (username and password) authenication is not supported, at this time.

### Native Apache Spark

This is currently unsupported.
39 changes: 6 additions & 33 deletions csharp/src/Drivers/Apache/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
-->

# Thrift-based Apache connectors

This library contains code for ADBC drivers built on top of the Thrift protocol with Arrow support:

- Hive
Expand All @@ -27,6 +28,7 @@ This library contains code for ADBC drivers built on top of the Thrift protocol
Each driver is at a different state of implementation.

## Custom generation

Typically, [Thrift](https://thrift.apache.org/) code is generated from the Thrift compiler. And that is mostly true here as well. However, some files were further edited to include Arrow support. These contain the phrase `BUT THIS FILE HAS BEEN HAND EDITED TO SUPPORT ARROW SO REGENERATE AT YOUR OWN RISK` at the top. Some of these files include:

```
Expand All @@ -41,55 +43,26 @@ arrow-adbc/csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TStringColumn.cs
```

# Hive

The Hive classes serve as the base class for Spark and Impala, since both of those platform implement Hive capabilities.

Core functionality of the Hive classes beyond the base library implementation is under development, has limited functionality, and may produce errors.

# Impala

The Imapala classes are under development, have limited functionality, and may produce errors.

# Spark
The Spark classes are intended for use against native Spark and Spark on Databricks.

## Spark Types

The following table depicts how the Spark ADBC driver converts a Spark type to an Arrow type and a .NET type:

| Spark Type | Arrow Type | C# Type |
| :--- | :---: | :---: |
| ARRAY* | String | string |
| BIGINT | Int64 | long |
| BINARY | Binary | byte[] |
| BOOLEAN | Boolean | bool |
| CHAR | String | string |
| DATE | Date32 | DateTime |
| DECIMAL | Decimal128 | SqlDecimal |
| DOUBLE | Double | double |
| FLOAT | Float | float |
| INT | Int32 | int |
| INTERVAL_DAY_TIME+ | String | string |
| INTERVAL_YEAR_MONTH+ | String | string |
| MAP* | String | string |
| NULL | Null | null |
| SMALLINT | Int16 | short |
| STRING | String | string |
| STRUCT* | String | string |
| TIMESTAMP | Timestamp | DateTimeOffset |
| TINYINT | Int8 | sbyte |
| UNION | String | string |
| USER_DEFINED | String | string |
| VARCHAR | String | string |

\* Complex types are returned as strings<br>
\+ Interval types are returned as strings
The Spark classes are intended for use against native Spark and Spark on Databricks.

For more details, see [Spark Driver](Spark/README.md)

## Known Limitations

1. The API `SparkConnection.GetObjects` is not fully tested at this time
1. It may not return all catalogs and schema in the server.
1. It may throw an exception when returning object metadata from multiple catalog and schema.
1. API `Connection.GetTableSchema` does not return correct precision and scale for `NUMERIC`/`DECIMAL` types.
1. When a `NULL` value is returned for a `BINARY` type it is instead being returned as an empty array instead of the expected `null`.
1. Result set metadata does not provide information about the nullability of each column. They are marked as `nullable` by default, which may not be accurate.
1. The **Impala** driver is untested and is currently unsupported.
Expand Down

0 comments on commit 2435619

Please sign in to comment.