feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema #3140

toddmeng-db · 2025-07-12T01:04:18Z

Motivation

In older Databricks Runtime versions (DBR < 10.0, before support was added for UseArrowNativeTypes and DecimalAsArrow), the runtime returns decimal values as strings rather than native decimal types. This caused ArithmeticOverflow errors when the ADBC driver attempted to read decimal values using the schema type information from MetadataResp.Schema.

Problem

MetadataResp.Schema may contain type DECIMAL_TYPE, but the actual data in Arrow format can be either Decimal128 or String, depending on:

The DecimalAsArrow configuration value
Whether the runtime version supports DecimalAsArrow (DBR >= 10.0)

Since there is no way to determine from the Thrift schema alone whether decimal data will be serialized as strings or native decimals, using MetadataResp.Schema leads to type mismatches and runtime errors.

Solution

Use MetadataResp.ArrowSchema when available, as it contains the actual runtime type representation that matches the serialized Arrow data. The Arrow schema correctly shows:

utf8 type when decimals are serialized as strings (DecimalAsArrow=false)
decimal128 type when decimals are serialized as native decimals (DecimalAsArrow=true)

The implementation prefers ArrowSchema when available and falls back to the traditional Thrift schema parsing for backward compatibility. Decimal-types are now treated as Strings by default

Testing

Added end-to-end tests covering both decimal-as-string and decimal-as-native scenarios
Verified that tests fail without the fix
Manually verified that PowerBI integration functions correctly with the changes

This change maintains backward compatibility while providing accurate type information when available, resolving decimal reading issues across different Databricks Runtime versions.

toddmeng-db · 2025-07-14T21:23:58Z

csharp/src/Drivers/Apache/Spark/SparkStatement.cs

        }

        protected override void SetStatementProperties(TExecuteStatementReq statement)
        {


moving this to the appropriate layer in DatabricksStatement, not directly related to this PR

So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift?

I think OSS Spark via Hive-Thriftserver cannot. Those fields are also Databricks-specific

toddmeng-db · 2025-07-14T21:41:37Z

csharp/src/Drivers/Databricks/DatabricksSchemaParser.cs


 namespace Apache.Arrow.Adbc.Drivers.Databricks
 {
    internal class DatabricksSchemaParser : SchemaParser


A bit new to Arrow, not 100% sure if this is the correct way to handle consuming ArrowSchema (but this seems to work from manual testing). In particular, runtime populates ArrowSchema with

MessageSerializer.serialize(writeChannel, getArrowSchema());

Which from what I understand can be correctly consumed using

using var stream = new MemoryStream(schemaBytes); using var reader = new ArrowStreamReader(stream);~

Maybe @CurtHagenlocher @jadewang-db you may know better?

It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the ChunkStream approach used by DatabricksReader (as well as the ReadRowsStream used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed.

Tl;dr: this is correct.

toddmeng-db · 2025-07-14T21:54:44Z

csharp/src/Drivers/Databricks/DatabricksSchemaParser.cs

                TTypeId.FLOAT_TYPE => FloatType.Default,
                TTypeId.INT_TYPE => Int32Type.Default,
                TTypeId.NULL_TYPE => NullType.Default,
                TTypeId.SMALLINT_TYPE => Int16Type.Default,


I have a feeling Timestamp type will also need to be String before Advanced Arrow type, investigating

Looks like this needs to stay as Timestamp_type, String actually does not work for that.

Runtime:

case TType.TIMESTAMP_TYPE if SQLConf.get.arrowThriftTimestampToString => ArrowType.Utf8.INSTANCE case TType.TIMESTAMP_TYPE => new ArrowType.Timestamp(TimeUnit.MICROSECOND, timeZoneId) ```

Seems like SQLConf.get.arrowThriftTimestampToString can be configured on OpenSessionReq, but we aren't doing that. Leaving a comment here

Actually, looks like <dbr 7.0, we need Timestamp to be handled as a String, too. This is a bit different from what I expected just looking at runtime code...

This PR actually doesnt solve the issue, since dbr 7.0 - 10.0 does not actually get ArrowSchema.

Ah, just realized its because { "spark.thriftserver.arrowBasedRowSet.timestampAsString", "false" }

toddmeng-db · 2025-07-14T21:58:42Z

csharp/src/Drivers/Databricks/DatabricksStatement.cs

+                ComplexTypesAsArrow = false,
+                IntervalTypesAsArrow = false,
+            };
+


I have a feeling we can set these to true if we are using ArrowSchema now. Has there been any discussion about this?

cc @eric-wang-1990 @jadewang-db

what is the current odbc behavior? It's string? if we make it arrow, can powerbi handle?

@CurtHagenlocher

CurtHagenlocher

Thanks! This looks fine insofar as it's no less broken than the currently-checked-in code ;). I did have some questions as well as comments about a more desirable end state. Let me know whether you'd like this checked-in as-is.

Do you have a rough estimate for the percentage of compute instances whose versions would cause them to be impacted by these change?

CurtHagenlocher · 2025-07-15T15:20:14Z

csharp/src/Drivers/Apache/Spark/SparkStatement.cs

        }

        protected override void SetStatementProperties(TExecuteStatementReq statement)
        {


So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift?

CurtHagenlocher · 2025-07-15T15:28:20Z

csharp/test/Drivers/Databricks/E2E/StatementTests.cs

+
+                OutputHelper?.WriteLine($"Decimal value: {sqlDecimal.Value} (precision: {decimalType.Precision}, scale: {decimalType.Scale})");
+            }
+            else if (field.DataType is StringType)


I think it's a problem for a decimal table column to be returned as an Arrow string. We'll need to apply a conversion after reading the data. It's possible that when testing this scenario specifically with Power BI that things will work because the ADBC connector itself will (at least sometimes) convert the data to match the expected schema. However, this is a little inefficient and is clearly not the right behavior for the driver in non-Power BI contexts.

Broadly speaking, I think the right behavior here is for the driver to look at both the Thrift schema and the Arrow schema and to come up with a "final" schema as well as a collection of transformers to be applied to the individual columns. So if the Thrift schema says "decimal" and the Arrow schema says "string" then the final schema should say "decimal" and there should be a function to convert the string array into a decimal array.

agree with @CurtHagenlocher and this make me think, in GetSchemaFromMetadata should we really just return the arrow schema or we actually keep the thrift schema, we might want to confirm with runtime folks, if arrow schema is underlying dataschema we probably want to return the thrift schema as the result schema

Do you mean we want to do some conversion to thrift-type?

@CurtHagenlocher what would the risk be if we returned a Arrow String here, instead of an Arrow Decimal? You said Connector may do it's own conversion? If this didn't happen, what would the risk be?

ADBC connector itself will (at least sometimes) convert the data to match the expected schema

CurtHagenlocher · 2025-07-15T15:32:44Z

csharp/src/Drivers/Databricks/DatabricksSchemaParser.cs


 namespace Apache.Arrow.Adbc.Drivers.Databricks
 {
    internal class DatabricksSchemaParser : SchemaParser


It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the ChunkStream approach used by DatabricksReader (as well as the ReadRowsStream used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed.

Tl;dr: this is correct.

toddmeng-db · 2025-07-15T18:14:32Z

Do you have a rough estimate for the percentage of compute instances whose versions would cause them to be impacted by these change?

It would certainly be a low, the vast majority of traffic is 10.0+ (and in Databricks documentation, dbr <10 is end-of-support). However, we do have non-trivial usage for lower versions like 9.1, 7.3 that we wouldn't want to break. Seems like the concern here is that decimal as-strings require some inefficient/hacky handling, and should ideally be converted in the driver?

csharp/src/Drivers/Databricks/DatabricksStatement.cs

jadewang-db

looks like we need some follow up with runtime folks

CurtHagenlocher · 2025-07-21T19:06:10Z

I would say that there are three different ways of thinking about this, depending on who the user is. For a new user who wants to consume Databricks data into their new C# application, I imagine they would want the loaded data to represent Spark's view of the data as closely as possible. This would include preserving type information and values with as much fidelity as the Arrow format allows. For a user who is currently consuming data into a .NET application* via ODBC, I suspect they would -- at least initially -- want the loaded data to be as similar as possible to what the ODBC driver is returning. (This doesn't line up perfectly for reasons I'll get into.) Finally, for use specifically inside Power BI the user would want to get the same results whether they're using the connector with ODBC or with ADBC. The latter two are (at least at first blush) pretty well-aligned**, because in both cases there's some client code that's switching from ODBC to ADBC for which we want to minimize the transition costs.

I'm not currently in a position to test the older server version via ODBC as I've had to decommission the Databricks instance I was using to test as it wasn't compliant with internal security restrictions, but I would be extremely surprised if it was returning decimal data as a string. And at a minimum, it would need to report the type of the result column as being decimal in order to let the client application know what type it is. But the difference is that ODBC is able to report the type as decimal while still retaining an internal representation of the data as a string. That's because fetching the data with ODBC specifically requires that you say what format you want it returned as. So even if the internal buffer contains a string, the client application would see that the column type is SQL_DECIMAL and it would say "I want this data formatted as SQL_C_DECIMAL" and the driver would need to perform any necessary conversion. This possibility doesn't exist with ADBC because there is no similar distinction. The declared type has to be consistent with the type of the returned data buffer.

Earlier I had mentioned that the Power BI connector is doing some data/type translation. The context for this is that we ordinarily compute the expected type of the result set and then if the actual type doesn't match, the ADBC code in Power BI will inject a transformation. However, this only works when the user references tables in the catalog. In the scenario where the user supplies a native SQL query and we run it to discover the schema output, returning a decimal as string will mean that the original type is lost. This will make the data harder to work with in Power BI and would break backwards-compatibility.

Tl;dr: I'm afraid the driver will need to translate the data.

*Note that we still intend to make this driver work for non-.NET consumers by adding AOT compilation support. The main gap is some missing functionality in the core Arrow C# library for which there's a prototype implementation.

**That said, I think we'd love to be able to represent nested lists, records or tables in Power BI as their native types, because the conversion of all structured data into JSON is both lossy and limiting in terms of the kinds of querying we can do against the data source.

toddmeng-db · 2025-07-24T00:36:29Z

csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs

            // For GetColumns, we need to enhance the result with BASE_TYPE_NAME
            if (Connection.AreResultsAvailableDirectly && resp.DirectResults?.ResultSet?.Results != null)
            {
                // Get data from direct results


replaced all instances of consuming directly via schema to attempt to consume arrowschema first

toddmeng-db · 2025-07-24T05:38:06Z

@CurtHagenlocher I think we're considering going ahead and merging this, since we expect that this is still a correctness improvement and ArrowSchema should be the same for other types (that aren't in useArrowNativeTypes). We're still discussing internally how much effort we want to support for dbr < 10.4.

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch 3 times, most recently from d661507 to ddd1bd1 Compare July 12, 2025 01:08

Parse decimal with UseArrowNativeType

cbf4867

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from ddd1bd1 to cbf4867 Compare July 12, 2025 06:52

testing

10b7486

toddmeng-db changed the title ~~Parse decimal as Strings if not UseArrowNativeType~~ Parse decimal as Strings for Older DBR Jul 12, 2025

toddmeng-db changed the title ~~Parse decimal as Strings for Older DBR~~ feat(csharp/src/Drivers/Databricks): Parse decimal as Strings for Older DBR Jul 12, 2025

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 3378c5b to be6a5ef Compare July 12, 2025 07:46

grid test decimal test

31cd4a6

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from be6a5ef to 31cd4a6 Compare July 12, 2025 17:18

Todd Meng and others added 3 commits July 14, 2025 19:04

test branch

7cc4228

dbschemaparser

a881c03

Parse arrowschema

f1cd13f

toddmeng-db changed the title ~~feat(csharp/src/Drivers/Databricks): Parse decimal as Strings for Older DBR~~ feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Schema Jul 14, 2025

toddmeng-db added 2 commits July 14, 2025 14:10

delete md file

8b1f62f

remove useArrowNativeTypes by version

7c55ebb

toddmeng-db commented Jul 14, 2025

View reviewed changes

jadewang-db approved these changes Jul 14, 2025

View reviewed changes

toddmeng-db commented Jul 14, 2025

View reviewed changes

remove parser test

aa0767d

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 0a571e1 to aa0767d Compare July 14, 2025 23:42

toddmeng-db marked this pull request as ready for review July 15, 2025 00:15

toddmeng-db requested a review from CurtHagenlocher as a code owner July 15, 2025 00:15

github-actions bot added this to the ADBC Libraries 20 milestone Jul 15, 2025

toddmeng-db changed the title ~~feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Schema~~ feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema Jul 15, 2025

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 26a43d7 to 9b0b931 Compare July 15, 2025 00:22

lint fix

a84b5be

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 9b0b931 to a84b5be Compare July 15, 2025 00:33

CurtHagenlocher approved these changes Jul 15, 2025

View reviewed changes

jadewang-db reviewed Jul 17, 2025

View reviewed changes

csharp/src/Drivers/Databricks/DatabricksStatement.cs Show resolved Hide resolved

jadewang-db suggested changes Jul 17, 2025

View reviewed changes

toddmeng-db requested a review from CurtHagenlocher July 18, 2025 19:03

toddmeng-db requested a review from jadewang-db July 23, 2025 22:07

toddmeng-db added 2 commits July 23, 2025 17:04

use arrowschema for all metadata responses

9f03f7f

add datetime test

cf8e863

toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 015a787 to cf8e863 Compare July 24, 2025 00:35

toddmeng-db commented Jul 24, 2025

View reviewed changes

jadewang-db approved these changes Jul 24, 2025

View reviewed changes

CurtHagenlocher approved these changes Jul 24, 2025

View reviewed changes

CurtHagenlocher merged commit 04dbef5 into apache:main Jul 24, 2025
6 checks passed

serramatutu mentioned this pull request Aug 12, 2025

Sync upstream dbt-labs/arrow-adbc#52

Closed

feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema #3140

feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema #3140

Uh oh!

Conversation

toddmeng-db commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Problem

Solution

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db commented Jul 15, 2025

Uh oh!

Uh oh!

jadewang-db left a comment

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher commented Jul 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toddmeng-db commented Jul 24, 2025

Uh oh!

Uh oh!

Reviewers

toddmeng-db commented Jul 12, 2025 •

edited

Loading

toddmeng-db Jul 15, 2025 •

edited

Loading

toddmeng-db Jul 14, 2025 •

edited

Loading

toddmeng-db Jul 14, 2025 •

edited

Loading

toddmeng-db Jul 14, 2025 •

edited

Loading

toddmeng-db Jul 14, 2025 •

edited

Loading

toddmeng-db Jul 16, 2025 •

edited

Loading

toddmeng-db Jul 18, 2025 •

edited

Loading