-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [POC] Flight SQL #9368
[WIP] [POC] Flight SQL #9368
Conversation
Add extensions in the Apache Arrow project’s Arrow Flight modules to provide a standard way for clients and servers to communicate with SQL-like semantics. Do not pull to master. A message to the mailing list will accompany this and another proposal in the coming days for discussion.
Minor updates to fix field numbering and clarify usage in comments.
Update with initial formatting/naming feedback and add ResultsOrder enum.
Alter GetSQLCapabilities to model after Hive Thrift's TCLIService.
Update new message to snake_case.
Update comments to reflect current field names.
- Address code review comments from ryannicholson#2
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
Small style nit: Generally camel casing even for acronyms is preferred (i.e. FlightSql.proto, there is probably inconsistency within the code base but I think most new code uses this convention) |
format/FlightSQL.proto
Outdated
/* | ||
* Wrapper for values returned in ActionGetSQLInfoResult. | ||
*/ | ||
message GetSQLInfoValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: preference for Sql camel case.
format/FlightSQL.proto
Outdated
oneof value { | ||
string string_value = 1; | ||
int32 integer_value = 2; | ||
int32 integer_bitmask = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also modelled after ODBC? is long_bitmask not preferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if these types aren't used initially maybe leave them out for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A long is much more suitable for a bit mask given there are 8 bits available then.
I suggest we revisit what types can be used after we've identified a 'standardized' set of GetSqlInfo properties.
I also think it'd be better to use integral type codes instead of string values for keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per comments elsewhere. I don't think we should be introduce a new type system here. Lets please use the Arrow type system for returning metadata.
format/FlightSQL.proto
Outdated
*/ | ||
message ActionGetTablesRequest { | ||
/* | ||
* Specifies the order of result values with prescendence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Specifies the order of result values with prescendence: | |
* Specifies the order of result values with precendence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precedence (third time's the charm :)
format/FlightSQL.proto
Outdated
*/ | ||
message ActionGetTableTypesResult { | ||
/* | ||
* Indicates the type of the table. E.g. table (regular data table) , view, system table etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Indicates the type of the table. E.g. table (regular data table) , view, system table etc. | |
* Indicates the type of the table. E.g. table (regular data table), view, system table etc. |
@@ -0,0 +1,26 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: be consistent with file naming. Uppercase Flight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally shouldn't Protobuf file names be in snake_case? See https://developers.google.com/protocol-buffers/docs/style#file_structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Flight.proto itself is already an exception...we should be consistent with what we have, though, yes (so FlightSQLExample.proto
or preferably FlightSql.proto
and FlightSqlExample.proto
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I this is why I didn't comment on file naming
|
||
import org.apache.arrow.util.AutoCloseables; | ||
|
||
class PreparedStatementContext implements AutoCloseable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of placing this in tests, we should have a separate top level package for a JDBC reference implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this. Let's create a separate module FlightSQLForJDBC or something similar.
* @param scale Scale of the type. | ||
* @return The Arrow equivalent type. | ||
*/ | ||
public static ArrowType getArrowTypeFromJDBCType(int jdbcDataType, int precision, int scale) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duplicates code in the jdbc to arrow package i believe, can we consolidate it? I also thing there might be the need for custom mappings (we needed it JDBC conversion)
Thank you for the PR. I think since there isn't a fully implementation example yet for the all the features it would make this easier to review if we cut down the FlightSql proto to what is implemented and focus on each use-case in a separate PR. I think it makes sense to handle read, write and metadata operations separately?
IMO, I think we should be be trying to specify more about how client and server interact with each other in terms of error handling and retries, these are important aspects and without it I think some of the utility of standardization will be diminished. |
format/FlightSQL.proto
Outdated
* FLIGHT_SQL_SERVER_VERSION | ||
* | ||
* The Arrow version of the Flight SQL Server. | ||
* FLIGHT_SQL_SERVER_ARROW_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Library version? Or format version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, RPC protocol version is what matters here from a 'capabilities-discovery' perspective, though the Arrow library release (including the commit hash) would be good as a separate field for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By 'RPC protocol' do you mean the version of Flight and of FlightSQL? That does make sense.
format/FlightSQL.proto
Outdated
* FLIGHT_SQL_SERVER_READ_ONLY | ||
* | ||
* Indicates whether the Flight SQL Server supports both read and write. | ||
* FLIGHT_SQL_SERVER_READ_WRITE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two seem redundant.
format/FlightSQL.proto
Outdated
int64 record_count = 1; | ||
} | ||
|
||
message ResultsOrder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this message have a field?
Alternatively, if you wanted a top-level enum, you could just have
enum ResultsOrder {
RESULTS_ORDER_UNKNOWN = 0;
// ...
}
I would also suggest prefixing enum variants with the enum name since Protobuf enums aren't namespaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this concept should be removed entirely - under what circumstances would you want to specify an order? It seems cleaner to specify an order and then allow filtering to specifics.
format/FlightSQL.proto
Outdated
string table_name_filter_pattern = 4; | ||
|
||
// Specifies a filter of table types which must match. | ||
repeated string table_types = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an AND or an OR? Are these filters matched by substring or with the pattern syntax above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above for another proposal, I think this becomes more elegantly expressed with pseudotables, so we aren't trying to create arbitrary SQL semantics through protobuf. @lidavidm thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do worry it may be a bit of an implementation burden, but maybe that's the kind of thing that the standard adapter defined here can handle. And limiting the available SQL features as you do below helps. Many databases do already expose this kind of metadata in special tables, and the client can just do the necessary conversions from Arrow data to structured data.
I guess - how do we expect these metadata values to be used? Are they in turn going to back a JDBC implementation or a SQLAlchemy connector? That might help inform what we need to expose and how. I'd also be in favor of paring this to the bare minimum and exposing additional metadata as the need becomes clear, instead of trying to map all of JDBC or ODBC in one shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_types should be OR'd together (same for ODBC/JDBC). Pseudo-tables that require full SQL support may be too difficult for vendors to adopt, but if we permit implementations to support just the filters that are proposed here it reduces the burden a fair bit.
I think just submitting metadata requests via SQL has other challenges from an implementation perspective. It necessitates either having the backend validate parsing of the pseudo table (essentially making it a real table/view) or having the FlightProducer parse the query, identify the pseudo table and its filters, then translating the request to a backend call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just submitting metadata requests via SQL has other challenges from an implementation perspective. It necessitates either having the backend validate parsing of the pseudo table (essentially making it a real table/view) or having the FlightProducer parse the query, identify the pseudo table and its filters, then translating the request to a backend call.
I agree it can be complicated, I think in many cases vendors already support this, and JDBC drivers end up doing this under the hood. I haven't looked deeply but I was hoping this type of parsing would not be hard with something like Calcite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it'd be doable with Calcite. The gotchas to me are:
- Having the FlightSQLProducer parse to find pseudo-tables must be optional.
- We can provide a partial implementation of this that invokes Calcite's parser, but it's the implementer's responsibility to configure Calcite in a way that matches their SQL backend's syntax (eg identifier quote characters, LIKE escape sequences).
- We fallback silently if the table used is a non-Flight table. There should be nothing indicating a possible error since this is the primary use case.
- If the implementer uses the pre-built pseudo-table parser, its filtering and table transformation capabilities are limited to what similar ODBC/JDBC catalog functions can do (eg can't re-order columns, invoke scalar/aggregate functions or arithmetic expressions, pattern value filtering on only certain columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should expand the set of meta-tables to include more than just table discovery. For parity with ODBC/JDBC applications we should support tables that report primary key constraints, foreign key relations, statistics, and table privileges. I would consider this beyond an MVP though.
@@ -0,0 +1,26 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally shouldn't Protobuf file names be in snake_case? See https://developers.google.com/protocol-buffers/docs/style#file_structure
@@ -0,0 +1,26 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Flight.proto itself is already an exception...we should be consistent with what we have, though, yes (so FlightSQLExample.proto
or preferably FlightSql.proto
and FlightSqlExample.proto
)
format/FlightSQL.proto
Outdated
/* | ||
* Represents an instance of executing a prepared statement. Used in the | ||
* command member of FlightDescriptor for the following RPC calls: | ||
* TODO: Is the idea that a Put with parameter values would execute multiple bound versions of the prepared statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought: you could use DoExchange to combine the steps of providing parameter values and retrieving the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jduo and I were discussing, and realized that there's no allowance for output or I/O parameters. Do you happen to have any ideas about how we could integrate that into the flow without breaking anything too much? I'd hate to have to use separate calls to get the parameter data back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a stored procedure or something? DoExchange would let you both send and receive data on the same call. You could encode parameters into application metadata. That has the unfortunate requirement that you do separate serialization.
Or, you could have the input/output datatypes be unions of struct types, and encode both the data and the parameters into the union. This has overhead in generating the union indices, though at least you wouldn't be copying the actual data around.
Or, with the Deephaven people on the mailing list, it's been proposed that we allow a stream to update its schema midway. This could be used to send the data, then switch the schema and send the parameters.
I agree we should mostly avoid a separate call, however, this design already assumes statefulness on the part of the server, so that ship has sailed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that if we could change schema we could have this streamed back normally, but I'm not sure how far along that is.
I think, given the current design, I'd propose adding another command that's issued after the GetFlightInfo but before the DoGet to retrieve the parameter data, if any. I think this is the last bit to work out, if we agree on this I'll update the PR with the changes to reflect the current state and then we can update the examples around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It 'just' needs someone to write it up and propose it. I'll have to check in on the existing thread. I would very much prefer not to require more statefulness if we don't have to, since it'll complicate implementations. (Also, if it's before DoGet, that means the server might have to execute the same query in response to either of two distinct RPC calls?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - although we could also say it's after the DoGet as well if that simplified things. I'll work on updating this exempting the output parameters then so we can get this moving meaningfully again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. There's been some more progress on the schema evolution proposal so please feel free to chime in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the initial approach should be to exclude output parameters and then rev the FlightSQL version afterwards to include them once the schema change proposal has been finalized, so we can get this moving forward. As mentioned in the weekly status call, there is a JDBC implementation being built to take advantage of this so we want to ensure everything moves forward together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry if I wasn't clear - yes, that sounds good, we can add output parameters hopefully soon afterward.
*/ | ||
@Override | ||
public FlightInfo getFlightInfo(CallContext context, FlightDescriptor descriptor) { | ||
final Any command = FlightSQLUtils.parseOrThrow(descriptor.getCommand()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of Any.pack above, what if we define an explicit top-level message that has the possible messages in a oneof?
private static final boolean IS_SIGNED_FALSE = false; | ||
private static final boolean IS_SIGNED_TRUE = true; | ||
|
||
public static final ActionType FLIGHT_SQL_GETSQLINFO = new ActionType("GetSQLINFO", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: GetSqlInfo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should drop the SQL portion here and just have GetInfo
.
Thinking about this a bit more I think an alternative to a lot of the metadata querying is to define a minimal sql table schema and querying capabilities for it. I'll expand more in a little bit |
format/FlightSQL.proto
Outdated
/* | ||
* Wrapper for the result of a "GetSQLInfo" action. | ||
*/ | ||
message ActionGetSQLInfoResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message ActionGetSQLInfoResult { | |
message ActionGetSqlInfoResult { |
format/FlightSQL.proto
Outdated
* 1. Server Information: Provides basic information about the Flight SQL Server. | ||
* | ||
* The name of the Flight SQL Server. | ||
* FLIGHT_SQL_SERVER_NAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these are intended to be keys in the maps below? For some of them should they be promoted to first class concepts?
Per my comment above. I think I would have a preference for moving as much as possible to sql queries. so if we defined a special table like
__APACHE_FLIGHT_SQL_INFO__
with a schema
<
flight_sql_server_name: utf8
flight_sql_server_version: utf8
flight_sql_server_arrow_version: utf8 (or maybe modeled as a struct)
flight_sql_info: map<utf8, dense_union<string_value: string, int_value: int32, bigint_value: int64, int32_bitmask: int32 >>
>
where the int32 bitmask would also have a defined extension metadata for a custom type of bitmask.
The minimal table requirement is that it supports projection:
i.e. only select * from __APACHE_FLIGHT_SQL_INFO__
or select FLIGHT_SQL_VERSION from __APACHE_FLIGHT_SQL_INFO__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using pseudo tables is an interesting concept, with some interesting pros/cons worth discussing to decide on an approach.
Using pseudo tables allows for:
- a simple proto definition
- more expressiveness in terms of what can be done with the table, since it is then available for general SQL
- allowing later additions without needing to modify any proto
Drawbacks are mostly around two cases: those where metadata is not retrieved via a query mechanism and/or where it's more efficient not to use a query mechanism, and where you need to layer an additional parser in front of the actual database parser to support the pseudo. If we add a partial parser implementation to make it easier, we are having to implement this in all the languages that Flight SQL supports.
If we use the existing proposed proto definitions, then I see the following:
- less expressive in terms of what can be done
- less flexibility in terms of adding new metadata calls
- much less effort in that we don't need to provide a partial parser implementation to help
- It's easy to go from a message to a SQL query if that's what's used for metadata, it's hard to go the other way around
I think it would be easier all around to use what's defined here (with tweaks) vs. going the SQL table route, even though the SQL table route is more elegant from a protocol perspective. The effort to provide the hand-up for those data sources which don't have any support for querying metadata would be large, and may be a hinderance to adoption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll chime in and agree that while the pseudo tables are more elegant, it would make it drastically harder to support 'wrapping' existing databases without having to either modify the database implementation or pull in a SQL parser. Also, it sidesteps any concern or potential incompatibility about what subset of SQL is exactly supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of standardizing on information_schema proposed elsewhere, it seems a lot of DBs already support this, IIUC it is part of the SQL standard. For those that don't support, I agree it might be a high barrier. I'm a little concerned on cargo culting on one API standard or another which seems one level removed from the data we are trying to get at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this particular feature, I'm not sure if information schema would supports it let me take a look more closely at this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that, I'd agree with @lidavidm that we should probably go with the initially proposed approach vs. pseudo tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good. separately if I can find some time (probably unlikely), I'd like to see how hard it would be to make pseudo-tables work with Calcite and maybe ZetaSQL for c++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to return data as an Arrow record batch (I agree this makes sense) we should probably make all this data fetched through DoGet or DoExchange so that we're not specifying a separate encoding of Arrow data into Flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also specify the expected schema inline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use the existing DoGet mechanism.
format/FlightSQL.proto
Outdated
* Flight SQL enabled backend. | ||
* Requests a list of catalogs available in the server. | ||
*/ | ||
message ActionGetCatalogsRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the SQL approach define a table like:
__APACHE_ARROW_FLIGHT_CATALOGS
minimal schema:
<catalog_name: utf8>
Supports project and optionally sorting on catalog_name
format/FlightSQL.proto
Outdated
* - "%" means to match any substring with 0 or more characters. | ||
* - "_" means to match any one character. | ||
*/ | ||
string schema_filter_pattern = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL first approach:
__APACHE_ARROW_FLIGHT_SCHEMAS
:
catalog: utf8
schema: utf8
Minimal SQL supported. Projection, Ordering (might be limited to required catalog and schema) and LIKE or equality clauses on schema field.
i.e.
select catalog, schema from __APACHE_ARROW_FLIGHT_SCHEMAS
order by catalog, schema descending
where schema LIKE 'A%'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ODBC and JDBC define orderings for result sets like these, and I think it would likely be better to copy that rather than allow arbitrary ordering. I was trying to determine a use-case for when you'd want arbitrary ordering and wasn't able to think of a reasonable one for the additional effort required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield - rather than define new pseudo tables, would it be better to align to what's already in the INFORMATION_SCHEMA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if we can just use information_schema that seems reasonable to me.
format/FlightSQL.proto
Outdated
* Flight SQL enabled backend. | ||
* Requests a list of tables available in the server. | ||
*/ | ||
message ActionGetTablesRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL first approach:
__APACHE_ARROW_FLIGHT_TABLES
<
catalog: utf8
schema: utf8
table_type: utf8 # Implementations are recommended to dictionary encode this field
table: utf8
serialized_flatbuffer_table_schema: bytes
>
Table supports equality and like predicates on schema and table. Queries must specify catalog with an equality predicate. Support projection and ordering. Ordering might be limited to requiring field catalog, schema, table_type, table in that order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases retrieval of the schema of a table is non-trivial, but BI tools often list many tables before getting detailed metadata about one or two of the tables. This is the reason for the breakdown in ODBC/JDBC of getTables vs. getColumns. I think we should break the schemas for a table out to a separate pseudo table for efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would using information_schema combine the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - it has separate tables: TABLES and COLUMNS IIRC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be CommandGetTables
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - oversight.
I did a rough sketch on how the metadata layer could be modeled as pseudo-tables as comments on the proto. I think this addresses some uneasiner I had with the existing model:
This comes at slightly more complex for server implementations. And potential confusion on limitations for specific tables. This might be limited to some extend if we had a different GetMetadata action so implementations know to expect a query to specific pseudo-table. I expect we would provide library wrappers over JDBC to also ease implementation. |
We don't have a GetMetadata action, but we could encode it in GetFlightInfo by sending a different command payload. That would let us cleanly namespace the special tables and the actual tables. |
@kylepbit yes, agreed. Thanks for the updates! I'll try and remember to approve/kick off the CI when updated but also feel free to ping me to do so. I'll note that Micah left one comment way at the top about naming the file |
Specify a prefix to identify the custom metadata.
@lidavidm - thanks for the reminder, I'll do that since I've already fixed the casing everywhere else. |
format/FlightSQL.proto
Outdated
* The returned schema will be: | ||
* < | ||
* info_name: utf8, | ||
* value: dense_union<string_value: string, int_value: int32, bigint_value: int64, int32_bitmask: int32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm - how do you actually create a FieldType for a dense union like this? I have the following:
new ArrowType.Union(UnionMode.Dense, new int[2] {ArrowTypeID.Utf8, ArrowTypeID.Int}), /*dictionary=*/null)
, although I could not find an ArrowTypeID for a bigint, and it seems repetitive to have the int_value and int32_bitmask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly revised, but I think still incorrect: new ArrowType.Union(UnionMode.Dense, new int[] {Type.Utf8, Type.Int})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typeIds aren't actually Flatbuffer type IDs - they're user-chosen sentinel values used to distinguish the cases on the wire; they're the tag of a tagged union. So they can be whatever. The actual types of the children are defined by the child fields, like so:
arrow/java/vector/src/test/java/org/apache/arrow/vector/TestDenseUnionVector.java
Lines 328 to 343 in 41c4143
@Test | |
public void testGetFieldTypeInfo() throws Exception { | |
Map<String, String> metadata = new HashMap<>(); | |
metadata.put("key1", "value1"); | |
int[] typeIds = new int[2]; | |
typeIds[0] = 0; | |
typeIds[1] = 1; | |
List<Field> children = new ArrayList<>(); | |
children.add(new Field("int", FieldType.nullable(MinorType.INT.getType()), null)); | |
children.add(new Field("varchar", FieldType.nullable(MinorType.VARCHAR.getType()), null)); | |
final FieldType fieldType = new FieldType(false, new ArrowType.Union(UnionMode.Dense, typeIds), | |
/*dictionary=*/null, metadata); | |
final Field field = new Field("union", fieldType, children); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to grok unions initially. The docs have a page explaining the layout: https://arrow.apache.org/docs/format/Columnar.html#dense-union
It might be clearer to think of type id as type discriminant in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I think I have it now.
Rename files for SQL -> Sql. Correct compilation errors in client code, but design needs to be updated. Tests do not yet compile.
Correct some additional SQL -> Sql file renames. Reduce the test compilation problems (still more to do).
I think it's OK to have one PR with the format changes and a working client/server and test (that's how Flight itself was originally added, looking at it: db0ef22). Then we can put it up for a vote. Having the implementation will be useful so people voting can evaluate it as well. |
*/ | ||
public FlightInfo executeQuery() throws IOException { | ||
if (isClosed) { | ||
throw new IOException("Prepared statement has already been closed on the server."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This isn't an IO error and is more of an IllegalStateException.
} else if (command.is(CommandGetSqlInfo.class)) { | ||
getStreamSqlInfo(FlightSqlUtils.unpackOrThrow(command, CommandGetSqlInfo.class), context, ticket, listener); | ||
} else { | ||
throw Status.INVALID_ARGUMENT.asRuntimeException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to get set on the listener (the server can log this error though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All calls to unpackOrThrow need to be in a try/catch that writes the result to the listener, the client will hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the caller, and this is set on the listener as the caller wraps this already. I think it would be redundant to do so again.
context, flightStream, ackStream); | ||
} | ||
|
||
throw Status.INVALID_ARGUMENT.asRuntimeException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The listener needs the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the caller, and this is set on the listener as the caller wraps this already. I think it would be redundant to do so again.
*/ | ||
public final class FlightSqlUtils { | ||
|
||
private static final int BIT_WIDTH8 = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be BIT_WIDTH_8
* @param scale Scale of the type. | ||
* @return The Arrow equivalent type. | ||
*/ | ||
public static ArrowType getArrowTypeFromJDBCType(int jdbcDataType, int precision, int scale) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method leaks JDBC into Arrow. I believe an earlier comment mentioned there's a JdbcToArrow converter component that had a similar method that we can use instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's against old code - this is now within the example code only.
But, I see the JdbcToArrowConfig
which looks like it does what we want if we can leverage it.
case Types.DATE: | ||
return new ArrowType.Date(DateUnit.DAY); | ||
case Types.TIME: | ||
return new ArrowType.Time(TimeUnit.MILLISECOND, BIT_WIDTH_32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like these should be based on the number of fractional seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably to be ignored if we're using the arrow-jdbc
package.
case Types.OTHER: | ||
case Types.JAVA_OBJECT: | ||
case Types.DISTINCT: | ||
case Types.STRUCT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an Arrow struct.
case Types.ROWID: | ||
case Types.SQLXML: | ||
case Types.REF_CURSOR: | ||
case Types.TIME_WITH_TIMEZONE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be an Arrow TimetampTz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to exist?
* @param source the raw bytes source value. | ||
* @return the materialized protobuf object. | ||
*/ | ||
public static Any parseOrThrow(byte[] source) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these helpers should take in optional ServerStreamListeners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, taking into account the other comments about the callers already handling this.
final FlightInfo info = client.getInfo(descriptor); | ||
assertEquals(SCHEMA_INT_TABLE, info.getSchema()); | ||
|
||
final FlightStream stream = client.getStream(info.getEndpoints().get(0).getTicket()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try-with-resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix throughout.
assertEquals(Arrays.asList("one", "zero", "negative one"), actualStringResults); | ||
assertEquals(Arrays.asList(1, 0, -1), actualIntResults); | ||
|
||
AutoCloseables.close(preparedStatement); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try-with-resources instead of explicit close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was done intentionally to allow for testing.
listener.putNext(); | ||
} | ||
} | ||
} catch (ExecutionException | SQLException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to catch RuntimeExceptions too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to be Throwable.
Note - FlightSqlExample is not functional and needs to be updated.
Just FYI: most CI pipelines will skip so long as you have the WIP in the PR title, feel free to remove that (keeping it in Draft if desired) when you're ready for CI to run. |
@lidavidm - just as an update, work is progressing here on a separate branch that'll be merged in. I think we'll ultimately end up with a separate PR as Tiffany has left, so we can't alter this. Will put a link to the new PR once created here for tracing. |
@kylepbit Sounds good, thanks for the update. |
Closing in favor of #10906 |
Introduction
This proof-of-concept in progress illustrates how Flight SQL can be implemented. This follows designs proposed in the Flight SQL proposal socialized last year. You can find the original proposal here. Since this is still a proof-of-concept, design feedback and suggestions are very much appreciated.
An Overview of this PR
A Walkthrough of the New Module
As a proof-of-concept, this isn't complete yet. There are a few TODO items: