Skip to content

Conversation

@toddmeng-db
Copy link
Contributor

@toddmeng-db toddmeng-db commented Jul 12, 2025

Motivation

In older Databricks Runtime versions (DBR < 10.0, before support was added for UseArrowNativeTypes and DecimalAsArrow), the runtime returns decimal values as strings rather than native decimal types. This caused ArithmeticOverflow errors when the ADBC driver attempted to read decimal values using the schema type information from MetadataResp.Schema.

Problem

MetadataResp.Schema may contain type DECIMAL_TYPE, but the actual data in Arrow format can be either Decimal128 or String, depending on:

  • The DecimalAsArrow configuration value
  • Whether the runtime version supports DecimalAsArrow (DBR >= 10.0)

Since there is no way to determine from the Thrift schema alone whether decimal data will be serialized as strings or native decimals, using MetadataResp.Schema leads to type mismatches and runtime errors.

Solution

Use MetadataResp.ArrowSchema when available, as it contains the actual runtime type representation that matches the serialized Arrow data. The Arrow schema correctly shows:

utf8 type when decimals are serialized as strings (DecimalAsArrow=false)
decimal128 type when decimals are serialized as native decimals (DecimalAsArrow=true)

The implementation prefers ArrowSchema when available and falls back to the traditional Thrift schema parsing for backward compatibility. Decimal-types are now treated as Strings by default

Testing

  • Added end-to-end tests covering both decimal-as-string and decimal-as-native scenarios
  • Verified that tests fail without the fix
  • Manually verified that PowerBI integration functions correctly with the changes

This change maintains backward compatibility while providing accurate type information when available, resolving decimal reading issues across different Databricks Runtime versions.

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch 3 times, most recently from d661507 to ddd1bd1 Compare July 12, 2025 01:08
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from ddd1bd1 to cbf4867 Compare July 12, 2025 06:52
@toddmeng-db toddmeng-db changed the title Parse decimal as Strings if not UseArrowNativeType Parse decimal as Strings for Older DBR Jul 12, 2025
@toddmeng-db toddmeng-db changed the title Parse decimal as Strings for Older DBR feat(csharp/src/Drivers/Databricks): Parse decimal as Strings for Older DBR Jul 12, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 3378c5b to be6a5ef Compare July 12, 2025 07:46
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from be6a5ef to 31cd4a6 Compare July 12, 2025 17:18
@toddmeng-db toddmeng-db changed the title feat(csharp/src/Drivers/Databricks): Parse decimal as Strings for Older DBR feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Schema Jul 14, 2025
}

protected override void SetStatementProperties(TExecuteStatementReq statement)
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this to the appropriate layer in DatabricksStatement, not directly related to this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift?

Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OSS Spark via Hive-Thriftserver cannot. Those fields are also Databricks-specific


namespace Apache.Arrow.Adbc.Drivers.Databricks
{
internal class DatabricksSchemaParser : SchemaParser
Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit new to Arrow, not 100% sure if this is the correct way to handle consuming ArrowSchema (but this seems to work from manual testing). In particular, runtime populates ArrowSchema with

MessageSerializer.serialize(writeChannel, getArrowSchema());

Which from what I understand can be correctly consumed using

using var stream = new MemoryStream(schemaBytes);
using var reader = new ArrowStreamReader(stream);~

Maybe @CurtHagenlocher @jadewang-db you may know better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the ChunkStream approach used by DatabricksReader (as well as the ReadRowsStream used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed.

Tl;dr: this is correct.

TTypeId.FLOAT_TYPE => FloatType.Default,
TTypeId.INT_TYPE => Int32Type.Default,
TTypeId.NULL_TYPE => NullType.Default,
TTypeId.SMALLINT_TYPE => Int16Type.Default,
Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling Timestamp type will also need to be String before Advanced Arrow type, investigating

Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this needs to stay as Timestamp_type, String actually does not work for that.

Runtime:

    case TType.TIMESTAMP_TYPE if SQLConf.get.arrowThriftTimestampToString =>
      ArrowType.Utf8.INSTANCE
    case TType.TIMESTAMP_TYPE =>
      new ArrowType.Timestamp(TimeUnit.MICROSECOND, timeZoneId)
      ```

Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like SQLConf.get.arrowThriftTimestampToString can be configured on OpenSessionReq, but we aren't doing that. Leaving a comment here

Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looks like <dbr 7.0, we need Timestamp to be handled as a String, too. This is a bit different from what I expected just looking at runtime code...

This PR actually doesnt solve the issue, since dbr 7.0 - 10.0 does not actually get ArrowSchema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just realized its because { "spark.thriftserver.arrowBasedRowSet.timestampAsString", "false" }

ComplexTypesAsArrow = false,
IntervalTypesAsArrow = false,
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we can set these to true if we are using ArrowSchema now. Has there been any discussion about this?

cc @eric-wang-1990 @jadewang-db

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the current odbc behavior? It's string? if we make it arrow, can powerbi handle?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 0a571e1 to aa0767d Compare July 14, 2025 23:42
@toddmeng-db toddmeng-db marked this pull request as ready for review July 15, 2025 00:15
@github-actions github-actions bot added this to the ADBC Libraries 20 milestone Jul 15, 2025
@toddmeng-db toddmeng-db changed the title feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Schema feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema Jul 15, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 26a43d7 to 9b0b931 Compare July 15, 2025 00:22
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 9b0b931 to a84b5be Compare July 15, 2025 00:33
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks fine insofar as it's no less broken than the currently-checked-in code ;). I did have some questions as well as comments about a more desirable end state. Let me know whether you'd like this checked-in as-is.

Do you have a rough estimate for the percentage of compute instances whose versions would cause them to be impacted by these change?

}

protected override void SetStatementProperties(TExecuteStatementReq statement)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift?


OutputHelper?.WriteLine($"Decimal value: {sqlDecimal.Value} (precision: {decimalType.Precision}, scale: {decimalType.Scale})");
}
else if (field.DataType is StringType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a problem for a decimal table column to be returned as an Arrow string. We'll need to apply a conversion after reading the data. It's possible that when testing this scenario specifically with Power BI that things will work because the ADBC connector itself will (at least sometimes) convert the data to match the expected schema. However, this is a little inefficient and is clearly not the right behavior for the driver in non-Power BI contexts.

Broadly speaking, I think the right behavior here is for the driver to look at both the Thrift schema and the Arrow schema and to come up with a "final" schema as well as a collection of transformers to be applied to the individual columns. So if the Thrift schema says "decimal" and the Arrow schema says "string" then the final schema should say "decimal" and there should be a function to convert the string array into a decimal array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @CurtHagenlocher and this make me think, in GetSchemaFromMetadata should we really just return the arrow schema or we actually keep the thrift schema, we might want to confirm with runtime folks, if arrow schema is underlying dataschema we probably want to return the thrift schema as the result schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we want to do some conversion to thrift-type?

Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CurtHagenlocher what would the risk be if we returned a Arrow String here, instead of an Arrow Decimal? You said Connector may do it's own conversion? If this didn't happen, what would the risk be?

ADBC connector itself will (at least sometimes) convert the data to match the expected schema


namespace Apache.Arrow.Adbc.Drivers.Databricks
{
internal class DatabricksSchemaParser : SchemaParser
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the ChunkStream approach used by DatabricksReader (as well as the ReadRowsStream used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed.

Tl;dr: this is correct.

@toddmeng-db
Copy link
Contributor Author

Do you have a rough estimate for the percentage of compute instances whose versions would cause them to be impacted by these change?

It would certainly be a low, the vast majority of traffic is 10.0+ (and in Databricks documentation, dbr <10 is end-of-support). However, we do have non-trivial usage for lower versions like 9.1, 7.3 that we wouldn't want to break. Seems like the concern here is that decimal as-strings require some inefficient/hacky handling, and should ideally be converted in the driver?

Copy link
Contributor

@jadewang-db jadewang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we need some follow up with runtime folks

@CurtHagenlocher
Copy link
Contributor

I would say that there are three different ways of thinking about this, depending on who the user is. For a new user who wants to consume Databricks data into their new C# application, I imagine they would want the loaded data to represent Spark's view of the data as closely as possible. This would include preserving type information and values with as much fidelity as the Arrow format allows. For a user who is currently consuming data into a .NET application* via ODBC, I suspect they would -- at least initially -- want the loaded data to be as similar as possible to what the ODBC driver is returning. (This doesn't line up perfectly for reasons I'll get into.) Finally, for use specifically inside Power BI the user would want to get the same results whether they're using the connector with ODBC or with ADBC. The latter two are (at least at first blush) pretty well-aligned**, because in both cases there's some client code that's switching from ODBC to ADBC for which we want to minimize the transition costs.

I'm not currently in a position to test the older server version via ODBC as I've had to decommission the Databricks instance I was using to test as it wasn't compliant with internal security restrictions, but I would be extremely surprised if it was returning decimal data as a string. And at a minimum, it would need to report the type of the result column as being decimal in order to let the client application know what type it is. But the difference is that ODBC is able to report the type as decimal while still retaining an internal representation of the data as a string. That's because fetching the data with ODBC specifically requires that you say what format you want it returned as. So even if the internal buffer contains a string, the client application would see that the column type is SQL_DECIMAL and it would say "I want this data formatted as SQL_C_DECIMAL" and the driver would need to perform any necessary conversion. This possibility doesn't exist with ADBC because there is no similar distinction. The declared type has to be consistent with the type of the returned data buffer.

Earlier I had mentioned that the Power BI connector is doing some data/type translation. The context for this is that we ordinarily compute the expected type of the result set and then if the actual type doesn't match, the ADBC code in Power BI will inject a transformation. However, this only works when the user references tables in the catalog. In the scenario where the user supplies a native SQL query and we run it to discover the schema output, returning a decimal as string will mean that the original type is lost. This will make the data harder to work with in Power BI and would break backwards-compatibility.

Tl;dr: I'm afraid the driver will need to translate the data.

*Note that we still intend to make this driver work for non-.NET consumers by adding AOT compilation support. The main gap is some missing functionality in the core Arrow C# library for which there's a prototype implementation.

**That said, I think we'd love to be able to represent nested lists, records or tables in Power BI as their native types, because the conversion of all structured data into JSON is both lossy and limiting in terms of the kinds of querying we can do against the data source.

@toddmeng-db toddmeng-db requested a review from jadewang-db July 23, 2025 22:07
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/decimal-fix branch from 015a787 to cf8e863 Compare July 24, 2025 00:35
// For GetColumns, we need to enhance the result with BASE_TYPE_NAME
if (Connection.AreResultsAvailableDirectly && resp.DirectResults?.ResultSet?.Results != null)
{
// Get data from direct results
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced all instances of consuming directly via schema to attempt to consume arrowschema first

@toddmeng-db
Copy link
Contributor Author

@CurtHagenlocher I think we're considering going ahead and merging this, since we expect that this is still a correctness improvement and ArrowSchema should be the same for other types (that aren't in useArrowNativeTypes). We're still discussing internally how much effort we want to support for dbr < 10.4.

@CurtHagenlocher CurtHagenlocher merged commit 04dbef5 into apache:main Jul 24, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants