-
Notifications
You must be signed in to change notification settings - Fork 161
feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema #3140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema #3140
Conversation
d661507 to
ddd1bd1
Compare
ddd1bd1 to
cbf4867
Compare
3378c5b to
be6a5ef
Compare
be6a5ef to
31cd4a6
Compare
| } | ||
|
|
||
| protected override void SetStatementProperties(TExecuteStatementReq statement) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving this to the appropriate layer in DatabricksStatement, not directly related to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think OSS Spark via Hive-Thriftserver cannot. Those fields are also Databricks-specific
|
|
||
| namespace Apache.Arrow.Adbc.Drivers.Databricks | ||
| { | ||
| internal class DatabricksSchemaParser : SchemaParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit new to Arrow, not 100% sure if this is the correct way to handle consuming ArrowSchema (but this seems to work from manual testing). In particular, runtime populates ArrowSchema with
MessageSerializer.serialize(writeChannel, getArrowSchema());
Which from what I understand can be correctly consumed using
using var stream = new MemoryStream(schemaBytes);
using var reader = new ArrowStreamReader(stream);~
Maybe @CurtHagenlocher @jadewang-db you may know better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the ChunkStream approach used by DatabricksReader (as well as the ReadRowsStream used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed.
Tl;dr: this is correct.
| TTypeId.FLOAT_TYPE => FloatType.Default, | ||
| TTypeId.INT_TYPE => Int32Type.Default, | ||
| TTypeId.NULL_TYPE => NullType.Default, | ||
| TTypeId.SMALLINT_TYPE => Int16Type.Default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling Timestamp type will also need to be String before Advanced Arrow type, investigating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this needs to stay as Timestamp_type, String actually does not work for that.
Runtime:
case TType.TIMESTAMP_TYPE if SQLConf.get.arrowThriftTimestampToString =>
ArrowType.Utf8.INSTANCE
case TType.TIMESTAMP_TYPE =>
new ArrowType.Timestamp(TimeUnit.MICROSECOND, timeZoneId)
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like SQLConf.get.arrowThriftTimestampToString can be configured on OpenSessionReq, but we aren't doing that. Leaving a comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looks like <dbr 7.0, we need Timestamp to be handled as a String, too. This is a bit different from what I expected just looking at runtime code...
This PR actually doesnt solve the issue, since dbr 7.0 - 10.0 does not actually get ArrowSchema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, just realized its because { "spark.thriftserver.arrowBasedRowSet.timestampAsString", "false" }
| ComplexTypesAsArrow = false, | ||
| IntervalTypesAsArrow = false, | ||
| }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling we can set these to true if we are using ArrowSchema now. Has there been any discussion about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the current odbc behavior? It's string? if we make it arrow, can powerbi handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0a571e1 to
aa0767d
Compare
26a43d7 to
9b0b931
Compare
9b0b931 to
a84b5be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks fine insofar as it's no less broken than the currently-checked-in code ;). I did have some questions as well as comments about a more desirable end state. Let me know whether you'd like this checked-in as-is.
Do you have a rough estimate for the percentage of compute instances whose versions would cause them to be impacted by these change?
| } | ||
|
|
||
| protected override void SetStatementProperties(TExecuteStatementReq statement) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift?
|
|
||
| OutputHelper?.WriteLine($"Decimal value: {sqlDecimal.Value} (precision: {decimalType.Precision}, scale: {decimalType.Scale})"); | ||
| } | ||
| else if (field.DataType is StringType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a problem for a decimal table column to be returned as an Arrow string. We'll need to apply a conversion after reading the data. It's possible that when testing this scenario specifically with Power BI that things will work because the ADBC connector itself will (at least sometimes) convert the data to match the expected schema. However, this is a little inefficient and is clearly not the right behavior for the driver in non-Power BI contexts.
Broadly speaking, I think the right behavior here is for the driver to look at both the Thrift schema and the Arrow schema and to come up with a "final" schema as well as a collection of transformers to be applied to the individual columns. So if the Thrift schema says "decimal" and the Arrow schema says "string" then the final schema should say "decimal" and there should be a function to convert the string array into a decimal array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with @CurtHagenlocher and this make me think, in GetSchemaFromMetadata should we really just return the arrow schema or we actually keep the thrift schema, we might want to confirm with runtime folks, if arrow schema is underlying dataschema we probably want to return the thrift schema as the result schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we want to do some conversion to thrift-type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CurtHagenlocher what would the risk be if we returned a Arrow String here, instead of an Arrow Decimal? You said Connector may do it's own conversion? If this didn't happen, what would the risk be?
ADBC connector itself will (at least sometimes) convert the data to match the expected schema
|
|
||
| namespace Apache.Arrow.Adbc.Drivers.Databricks | ||
| { | ||
| internal class DatabricksSchemaParser : SchemaParser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the ChunkStream approach used by DatabricksReader (as well as the ReadRowsStream used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed.
Tl;dr: this is correct.
It would certainly be a low, the vast majority of traffic is 10.0+ (and in Databricks documentation, dbr <10 is end-of-support). However, we do have non-trivial usage for lower versions like 9.1, 7.3 that we wouldn't want to break. Seems like the concern here is that decimal as-strings require some inefficient/hacky handling, and should ideally be converted in the driver? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we need some follow up with runtime folks
|
I would say that there are three different ways of thinking about this, depending on who the user is. For a new user who wants to consume Databricks data into their new C# application, I imagine they would want the loaded data to represent Spark's view of the data as closely as possible. This would include preserving type information and values with as much fidelity as the Arrow format allows. For a user who is currently consuming data into a .NET application* via ODBC, I suspect they would -- at least initially -- want the loaded data to be as similar as possible to what the ODBC driver is returning. (This doesn't line up perfectly for reasons I'll get into.) Finally, for use specifically inside Power BI the user would want to get the same results whether they're using the connector with ODBC or with ADBC. The latter two are (at least at first blush) pretty well-aligned**, because in both cases there's some client code that's switching from ODBC to ADBC for which we want to minimize the transition costs. I'm not currently in a position to test the older server version via ODBC as I've had to decommission the Databricks instance I was using to test as it wasn't compliant with internal security restrictions, but I would be extremely surprised if it was returning decimal data as a string. And at a minimum, it would need to report the type of the result column as being decimal in order to let the client application know what type it is. But the difference is that ODBC is able to report the type as decimal while still retaining an internal representation of the data as a string. That's because fetching the data with ODBC specifically requires that you say what format you want it returned as. So even if the internal buffer contains a string, the client application would see that the column type is SQL_DECIMAL and it would say "I want this data formatted as SQL_C_DECIMAL" and the driver would need to perform any necessary conversion. This possibility doesn't exist with ADBC because there is no similar distinction. The declared type has to be consistent with the type of the returned data buffer. Earlier I had mentioned that the Power BI connector is doing some data/type translation. The context for this is that we ordinarily compute the expected type of the result set and then if the actual type doesn't match, the ADBC code in Power BI will inject a transformation. However, this only works when the user references tables in the catalog. In the scenario where the user supplies a native SQL query and we run it to discover the schema output, returning a decimal as string will mean that the original type is lost. This will make the data harder to work with in Power BI and would break backwards-compatibility. Tl;dr: I'm afraid the driver will need to translate the data. *Note that we still intend to make this driver work for non-.NET consumers by adding AOT compilation support. The main gap is some missing functionality in the core Arrow C# library for which there's a prototype implementation. **That said, I think we'd love to be able to represent nested lists, records or tables in Power BI as their native types, because the conversion of all structured data into JSON is both lossy and limiting in terms of the kinds of querying we can do against the data source. |
015a787 to
cf8e863
Compare
| // For GetColumns, we need to enhance the result with BASE_TYPE_NAME | ||
| if (Connection.AreResultsAvailableDirectly && resp.DirectResults?.ResultSet?.Results != null) | ||
| { | ||
| // Get data from direct results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced all instances of consuming directly via schema to attempt to consume arrowschema first
|
@CurtHagenlocher I think we're considering going ahead and merging this, since we expect that this is still a correctness improvement and ArrowSchema should be the same for other types (that aren't in useArrowNativeTypes). We're still discussing internally how much effort we want to support for dbr < 10.4. |
Motivation
In older Databricks Runtime versions (DBR < 10.0, before support was added for UseArrowNativeTypes and DecimalAsArrow), the runtime returns decimal values as strings rather than native decimal types. This caused ArithmeticOverflow errors when the ADBC driver attempted to read decimal values using the schema type information from MetadataResp.Schema.
Problem
MetadataResp.Schema may contain type DECIMAL_TYPE, but the actual data in Arrow format can be either Decimal128 or String, depending on:
Since there is no way to determine from the Thrift schema alone whether decimal data will be serialized as strings or native decimals, using MetadataResp.Schema leads to type mismatches and runtime errors.
Solution
Use MetadataResp.ArrowSchema when available, as it contains the actual runtime type representation that matches the serialized Arrow data. The Arrow schema correctly shows:
utf8 type when decimals are serialized as strings (DecimalAsArrow=false)
decimal128 type when decimals are serialized as native decimals (DecimalAsArrow=true)
The implementation prefers ArrowSchema when available and falls back to the traditional Thrift schema parsing for backward compatibility. Decimal-types are now treated as Strings by default
Testing
This change maintains backward compatibility while providing accurate type information when available, resolving decimal reading issues across different Databricks Runtime versions.