Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [POC] Flight SQL #9368

Closed
wants to merge 16 commits into from
Closed

Conversation

tifflhl
Copy link
Contributor

@tifflhl tifflhl commented Jan 30, 2021

Introduction
This proof-of-concept in progress illustrates how Flight SQL can be implemented. This follows designs proposed in the Flight SQL proposal socialized last year. You can find the original proposal here. Since this is still a proof-of-concept, design feedback and suggestions are very much appreciated.

An Overview of this PR

  • This PR adds a new module within Flight called flight-sql.
  • flight-sql is a new Flight API that provides a standard way for clients and servers to communicate with SQL-like semantics.
  • Like other Flight APIs, flight-sql does not provide implementation details that dictate how a client and server communicates with each other, it simply provides the SQL semantics and apply them onto the Flight API.

A Walkthrough of the New Module

  • FlightSQL.proto introduces new SQL protobuf objects.
  • FlightSQLProducer introduces a new FlightProducer API that adapts classic Flight requests into SQL operations.
  • FlighSQLExample is a sample FlightSQL server implementation.

As a proof-of-concept, this isn't complete yet. There are a few TODO items:

  • FlightSQLExample is not completed yet.
  • As of now, all flight-sql client related operations are I FlightSQLClientUtils. I would like to break this utility class apart and offer an API for a FlightSQLClient add-on.

Ryan Nicholson and others added 9 commits January 29, 2021 16:59
Add extensions in the Apache Arrow project’s Arrow Flight modules
to provide a standard way for clients and servers to communicate
with SQL-like semantics.

Do not pull to master. A message to the mailing list will accompany this
and another proposal in the coming days for discussion.
Minor updates to fix field numbering and clarify usage in comments.
Update with initial formatting/naming feedback and
add ResultsOrder enum.
Alter GetSQLCapabilities to model after Hive Thrift's
TCLIService.
Update new message to snake_case.
Update comments to reflect current field names.
- Address code review comments from ryannicholson#2
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@tifflhl tifflhl marked this pull request as draft January 30, 2021 02:19
@emkornfield
Copy link
Contributor

emkornfield commented Jan 30, 2021

Small style nit: Generally camel casing even for acronyms is preferred (i.e. FlightSql.proto, there is probably inconsistency within the code base but I think most new code uses this convention)

/*
* Wrapper for values returned in ActionGetSQLInfoResult.
*/
message GetSQLInfoValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: preference for Sql camel case.

oneof value {
string string_value = 1;
int32 integer_value = 2;
int32 integer_bitmask = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also modelled after ODBC? is long_bitmask not preferred?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if these types aren't used initially maybe leave them out for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long is much more suitable for a bit mask given there are 8 bits available then.
I suggest we revisit what types can be used after we've identified a 'standardized' set of GetSqlInfo properties.

I also think it'd be better to use integral type codes instead of string values for keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per comments elsewhere. I don't think we should be introduce a new type system here. Lets please use the Arrow type system for returning metadata.

*/
message ActionGetTablesRequest {
/*
* Specifies the order of result values with prescendence:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Specifies the order of result values with prescendence:
* Specifies the order of result values with precendence:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precedence (third time's the charm :)

*/
message ActionGetTableTypesResult {
/*
* Indicates the type of the table. E.g. table (regular data table) , view, system table etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Indicates the type of the table. E.g. table (regular data table) , view, system table etc.
* Indicates the type of the table. E.g. table (regular data table), view, system table etc.

@@ -0,0 +1,26 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: be consistent with file naming. Uppercase Flight

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally shouldn't Protobuf file names be in snake_case? See https://developers.google.com/protocol-buffers/docs/style#file_structure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Flight.proto itself is already an exception...we should be consistent with what we have, though, yes (so FlightSQLExample.proto or preferably FlightSql.proto and FlightSqlExample.proto)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I this is why I didn't comment on file naming


import org.apache.arrow.util.AutoCloseables;

class PreparedStatementContext implements AutoCloseable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc please.

Copy link
Contributor

@emkornfield emkornfield Jan 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of placing this in tests, we should have a separate top level package for a JDBC reference implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Let's create a separate module FlightSQLForJDBC or something similar.

* @param scale Scale of the type.
* @return The Arrow equivalent type.
*/
public static ArrowType getArrowTypeFromJDBCType(int jdbcDataType, int precision, int scale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates code in the jdbc to arrow package i believe, can we consolidate it? I also thing there might be the need for custom mappings (we needed it JDBC conversion)

@emkornfield
Copy link
Contributor

Thank you for the PR. I think since there isn't a fully implementation example yet for the all the features it would make this easier to review if we cut down the FlightSql proto to what is implemented and focus on each use-case in a separate PR. I think it makes sense to handle read, write and metadata operations separately?

Like other Flight APIs, flight-sql does not provide implementation details that dictate how a client and server communicates with each other, it simply provides the SQL semantics and apply them onto the Flight API.

IMO, I think we should be be trying to specify more about how client and server interact with each other in terms of error handling and retries, these are important aspects and without it I think some of the utility of standardization will be diminished.

* FLIGHT_SQL_SERVER_VERSION
*
* The Arrow version of the Flight SQL Server.
* FLIGHT_SQL_SERVER_ARROW_VERSION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Library version? Or format version?

Copy link
Member

@jduo jduo May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, RPC protocol version is what matters here from a 'capabilities-discovery' perspective, though the Arrow library release (including the commit hash) would be good as a separate field for debugging purposes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 'RPC protocol' do you mean the version of Flight and of FlightSQL? That does make sense.

* FLIGHT_SQL_SERVER_READ_ONLY
*
* Indicates whether the Flight SQL Server supports both read and write.
* FLIGHT_SQL_SERVER_READ_WRITE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two seem redundant.

int64 record_count = 1;
}

message ResultsOrder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this message have a field?

Alternatively, if you wanted a top-level enum, you could just have

enum ResultsOrder {
    RESULTS_ORDER_UNKNOWN = 0;
    // ...
}

I would also suggest prefixing enum variants with the enum name since Protobuf enums aren't namespaced.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this concept should be removed entirely - under what circumstances would you want to specify an order? It seems cleaner to specify an order and then allow filtering to specifics.

string table_name_filter_pattern = 4;

// Specifies a filter of table types which must match.
repeated string table_types = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an AND or an OR? Are these filters matched by substring or with the pattern syntax above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above for another proposal, I think this becomes more elegantly expressed with pseudotables, so we aren't trying to create arbitrary SQL semantics through protobuf. @lidavidm thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do worry it may be a bit of an implementation burden, but maybe that's the kind of thing that the standard adapter defined here can handle. And limiting the available SQL features as you do below helps. Many databases do already expose this kind of metadata in special tables, and the client can just do the necessary conversions from Arrow data to structured data.

I guess - how do we expect these metadata values to be used? Are they in turn going to back a JDBC implementation or a SQLAlchemy connector? That might help inform what we need to expose and how. I'd also be in favor of paring this to the bare minimum and exposing additional metadata as the need becomes clear, instead of trying to map all of JDBC or ODBC in one shot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_types should be OR'd together (same for ODBC/JDBC). Pseudo-tables that require full SQL support may be too difficult for vendors to adopt, but if we permit implementations to support just the filters that are proposed here it reduces the burden a fair bit.

I think just submitting metadata requests via SQL has other challenges from an implementation perspective. It necessitates either having the backend validate parsing of the pseudo table (essentially making it a real table/view) or having the FlightProducer parse the query, identify the pseudo table and its filters, then translating the request to a backend call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just submitting metadata requests via SQL has other challenges from an implementation perspective. It necessitates either having the backend validate parsing of the pseudo table (essentially making it a real table/view) or having the FlightProducer parse the query, identify the pseudo table and its filters, then translating the request to a backend call.

I agree it can be complicated, I think in many cases vendors already support this, and JDBC drivers end up doing this under the hood. I haven't looked deeply but I was hoping this type of parsing would not be hard with something like Calcite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd be doable with Calcite. The gotchas to me are:

  • Having the FlightSQLProducer parse to find pseudo-tables must be optional.
  • We can provide a partial implementation of this that invokes Calcite's parser, but it's the implementer's responsibility to configure Calcite in a way that matches their SQL backend's syntax (eg identifier quote characters, LIKE escape sequences).
  • We fallback silently if the table used is a non-Flight table. There should be nothing indicating a possible error since this is the primary use case.
  • If the implementer uses the pre-built pseudo-table parser, its filtering and table transformation capabilities are limited to what similar ODBC/JDBC catalog functions can do (eg can't re-order columns, invoke scalar/aggregate functions or arithmetic expressions, pattern value filtering on only certain columns)

Copy link
Member

@jduo jduo May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should expand the set of meta-tables to include more than just table discovery. For parity with ODBC/JDBC applications we should support tables that report primary key constraints, foreign key relations, statistics, and table privileges. I would consider this beyond an MVP though.

@@ -0,0 +1,26 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally shouldn't Protobuf file names be in snake_case? See https://developers.google.com/protocol-buffers/docs/style#file_structure

@@ -0,0 +1,26 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Flight.proto itself is already an exception...we should be consistent with what we have, though, yes (so FlightSQLExample.proto or preferably FlightSql.proto and FlightSqlExample.proto)

/*
* Represents an instance of executing a prepared statement. Used in the
* command member of FlightDescriptor for the following RPC calls:
* TODO: Is the idea that a Put with parameter values would execute multiple bound versions of the prepared statement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: you could use DoExchange to combine the steps of providing parameter values and retrieving the results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jduo and I were discussing, and realized that there's no allowance for output or I/O parameters. Do you happen to have any ideas about how we could integrate that into the flow without breaking anything too much? I'd hate to have to use separate calls to get the parameter data back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a stored procedure or something? DoExchange would let you both send and receive data on the same call. You could encode parameters into application metadata. That has the unfortunate requirement that you do separate serialization.

Or, you could have the input/output datatypes be unions of struct types, and encode both the data and the parameters into the union. This has overhead in generating the union indices, though at least you wouldn't be copying the actual data around.

Or, with the Deephaven people on the mailing list, it's been proposed that we allow a stream to update its schema midway. This could be used to send the data, then switch the schema and send the parameters.

I agree we should mostly avoid a separate call, however, this design already assumes statefulness on the part of the server, so that ship has sailed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that if we could change schema we could have this streamed back normally, but I'm not sure how far along that is.

I think, given the current design, I'd propose adding another command that's issued after the GetFlightInfo but before the DoGet to retrieve the parameter data, if any. I think this is the last bit to work out, if we agree on this I'll update the PR with the changes to reflect the current state and then we can update the examples around that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It 'just' needs someone to write it up and propose it. I'll have to check in on the existing thread. I would very much prefer not to require more statefulness if we don't have to, since it'll complicate implementations. (Also, if it's before DoGet, that means the server might have to execute the same query in response to either of two distinct RPC calls?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - although we could also say it's after the DoGet as well if that simplified things. I'll work on updating this exempting the output parameters then so we can get this moving meaningfully again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. There's been some more progress on the schema evolution proposal so please feel free to chime in there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the initial approach should be to exclude output parameters and then rev the FlightSQL version afterwards to include them once the schema change proposal has been finalized, so we can get this moving forward. As mentioned in the weekly status call, there is a JDBC implementation being built to take advantage of this so we want to ensure everything moves forward together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry if I wasn't clear - yes, that sounds good, we can add output parameters hopefully soon afterward.

*/
@Override
public FlightInfo getFlightInfo(CallContext context, FlightDescriptor descriptor) {
final Any command = FlightSQLUtils.parseOrThrow(descriptor.getCommand());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of Any.pack above, what if we define an explicit top-level message that has the possible messages in a oneof?

private static final boolean IS_SIGNED_FALSE = false;
private static final boolean IS_SIGNED_TRUE = true;

public static final ActionType FLIGHT_SQL_GETSQLINFO = new ActionType("GetSQLINFO",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: GetSqlInfo?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should drop the SQL portion here and just have GetInfo.

@emkornfield
Copy link
Contributor

Thinking about this a bit more I think an alternative to a lot of the metadata querying is to define a minimal sql table schema and querying capabilities for it. I'll expand more in a little bit

/*
* Wrapper for the result of a "GetSQLInfo" action.
*/
message ActionGetSQLInfoResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message ActionGetSQLInfoResult {
message ActionGetSqlInfoResult {

* 1. Server Information: Provides basic information about the Flight SQL Server.
*
* The name of the Flight SQL Server.
* FLIGHT_SQL_SERVER_NAME
Copy link
Contributor

@emkornfield emkornfield Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these are intended to be keys in the maps below? For some of them should they be promoted to first class concepts?

Per my comment above. I think I would have a preference for moving as much as possible to sql queries. so if we defined a special table like

__APACHE_FLIGHT_SQL_INFO__ with a schema

<
   flight_sql_server_name: utf8
   flight_sql_server_version: utf8 
   flight_sql_server_arrow_version: utf8 (or maybe modeled as a struct)
   flight_sql_info: map<utf8, dense_union<string_value: string, int_value: int32, bigint_value: int64, int32_bitmask: int32 >>
>

where the int32 bitmask would also have a defined extension metadata for a custom type of bitmask.

The minimal table requirement is that it supports projection:
i.e. only select * from __APACHE_FLIGHT_SQL_INFO__ or select FLIGHT_SQL_VERSION from __APACHE_FLIGHT_SQL_INFO__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using pseudo tables is an interesting concept, with some interesting pros/cons worth discussing to decide on an approach.

Using pseudo tables allows for:

  • a simple proto definition
  • more expressiveness in terms of what can be done with the table, since it is then available for general SQL
  • allowing later additions without needing to modify any proto

Drawbacks are mostly around two cases: those where metadata is not retrieved via a query mechanism and/or where it's more efficient not to use a query mechanism, and where you need to layer an additional parser in front of the actual database parser to support the pseudo. If we add a partial parser implementation to make it easier, we are having to implement this in all the languages that Flight SQL supports.

If we use the existing proposed proto definitions, then I see the following:

  • less expressive in terms of what can be done
  • less flexibility in terms of adding new metadata calls
  • much less effort in that we don't need to provide a partial parser implementation to help
  • It's easy to go from a message to a SQL query if that's what's used for metadata, it's hard to go the other way around

I think it would be easier all around to use what's defined here (with tweaks) vs. going the SQL table route, even though the SQL table route is more elegant from a protocol perspective. The effort to provide the hand-up for those data sources which don't have any support for querying metadata would be large, and may be a hinderance to adoption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll chime in and agree that while the pseudo tables are more elegant, it would make it drastically harder to support 'wrapping' existing databases without having to either modify the database implementation or pull in a SQL parser. Also, it sidesteps any concern or potential incompatibility about what subset of SQL is exactly supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of standardizing on information_schema proposed elsewhere, it seems a lot of DBs already support this, IIUC it is part of the SQL standard. For those that don't support, I agree it might be a high barrier. I'm a little concerned on cargo culting on one API standard or another which seems one level removed from the data we are trying to get at.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this particular feature, I'm not sure if information schema would supports it let me take a look more closely at this API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that, I'd agree with @lidavidm that we should probably go with the initially proposed approach vs. pseudo tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good. separately if I can find some time (probably unlikely), I'd like to see how hard it would be to make pseudo-tables work with Calcite and maybe ZetaSQL for c++.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to return data as an Arrow record batch (I agree this makes sense) we should probably make all this data fetched through DoGet or DoExchange so that we're not specifying a separate encoding of Arrow data into Flight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also specify the expected schema inline here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use the existing DoGet mechanism.

* Flight SQL enabled backend.
* Requests a list of catalogs available in the server.
*/
message ActionGetCatalogsRequest {
Copy link
Contributor

@emkornfield emkornfield Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SQL approach define a table like:
__APACHE_ARROW_FLIGHT_CATALOGS minimal schema:

<catalog_name: utf8>

Supports project and optionally sorting on catalog_name

* - "%" means to match any substring with 0 or more characters.
* - "_" means to match any one character.
*/
string schema_filter_pattern = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL first approach:
__APACHE_ARROW_FLIGHT_SCHEMAS:

catalog: utf8
schema: utf8

Minimal SQL supported. Projection, Ordering (might be limited to required catalog and schema) and LIKE or equality clauses on schema field.
i.e.

select catalog, schema from __APACHE_ARROW_FLIGHT_SCHEMAS
order by catalog, schema descending
where schema LIKE 'A%'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ODBC and JDBC define orderings for result sets like these, and I think it would likely be better to copy that rather than allow arbitrary ordering. I was trying to determine a use-case for when you'd want arbitrary ordering and wasn't able to think of a reasonable one for the additional effort required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield - rather than define new pseudo tables, would it be better to align to what's already in the INFORMATION_SCHEMA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if we can just use information_schema that seems reasonable to me.

* Flight SQL enabled backend.
* Requests a list of tables available in the server.
*/
message ActionGetTablesRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL first approach:
__APACHE_ARROW_FLIGHT_TABLES

<
   catalog: utf8
   schema: utf8
   table_type: utf8 # Implementations are recommended to dictionary encode this field
   table: utf8
   serialized_flatbuffer_table_schema: bytes
>

Table supports equality and like predicates on schema and table. Queries must specify catalog with an equality predicate. Support projection and ordering. Ordering might be limited to requiring field catalog, schema, table_type, table in that order.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases retrieval of the schema of a table is non-trivial, but BI tools often list many tables before getting detailed metadata about one or two of the tables. This is the reason for the breakdown in ODBC/JDBC of getTables vs. getColumns. I think we should break the schemas for a table out to a separate pseudo table for efficiency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would using information_schema combine the two?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - it has separate tables: TABLES and COLUMNS IIRC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be CommandGetTables now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - oversight.

@emkornfield
Copy link
Contributor

emkornfield commented Feb 9, 2021

Thinking about this a bit more I think an alternative to a lot of the metadata querying is to define a minimal sql table schema and querying capabilities for it. I'll expand more in a little bit

I did a rough sketch on how the metadata layer could be modeled as pseudo-tables as comments on the proto. I think this addresses some uneasiner I had with the existing model:

  1. It uses one consistent data model for tabular data (the arrow model).
  2. It makes query semantics clearer instead of trying to model them in an ad-hoc manner through protobufs

This comes at slightly more complex for server implementations. And potential confusion on limitations for specific tables. This might be limited to some extend if we had a different GetMetadata action so implementations know to expect a query to specific pseudo-table. I expect we would provide library wrappers over JDBC to also ease implementation.

@lidavidm
Copy link
Member

lidavidm commented Feb 9, 2021

This comes at slightly more complex for server implementations. And potential confusion on limitations for specific tables. This might be limited to some extend if we had a different GetMetadata action so implementations know to expect a query to specific pseudo-table. I expect we would provide library wrappers over JDBC to also ease implementation.

We don't have a GetMetadata action, but we could encode it in GetFlightInfo by sending a different command payload. That would let us cleanly namespace the special tables and the actual tables.

@lidavidm
Copy link
Member

@kylepbit yes, agreed. Thanks for the updates!

I'll try and remember to approve/kick off the CI when updated but also feel free to ping me to do so.

I'll note that Micah left one comment way at the top about naming the file FlightSql.proto instead of FlightSQL.proto for consistency (I think doing that rename might lose the comments so now is probably an OK time to do it).

Specify a prefix to identify the custom metadata.
@kylepbit
Copy link

@lidavidm - thanks for the reminder, I'll do that since I've already fixed the casing everywhere else.

* The returned schema will be:
* <
* info_name: utf8,
* value: dense_union<string_value: string, int_value: int32, bigint_value: int64, int32_bitmask: int32>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm - how do you actually create a FieldType for a dense union like this? I have the following:

new ArrowType.Union(UnionMode.Dense, new int[2] {ArrowTypeID.Utf8, ArrowTypeID.Int}), /*dictionary=*/null), although I could not find an ArrowTypeID for a bigint, and it seems repetitive to have the int_value and int32_bitmask.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly revised, but I think still incorrect: new ArrowType.Union(UnionMode.Dense, new int[] {Type.Utf8, Type.Int})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typeIds aren't actually Flatbuffer type IDs - they're user-chosen sentinel values used to distinguish the cases on the wire; they're the tag of a tagged union. So they can be whatever. The actual types of the children are defined by the child fields, like so:

@Test
public void testGetFieldTypeInfo() throws Exception {
Map<String, String> metadata = new HashMap<>();
metadata.put("key1", "value1");
int[] typeIds = new int[2];
typeIds[0] = 0;
typeIds[1] = 1;
List<Field> children = new ArrayList<>();
children.add(new Field("int", FieldType.nullable(MinorType.INT.getType()), null));
children.add(new Field("varchar", FieldType.nullable(MinorType.VARCHAR.getType()), null));
final FieldType fieldType = new FieldType(false, new ArrowType.Union(UnionMode.Dense, typeIds),
/*dictionary=*/null, metadata);
final Field field = new Field("union", fieldType, children);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to grok unions initially. The docs have a page explaining the layout: https://arrow.apache.org/docs/format/Columnar.html#dense-union

It might be clearer to think of type id as type discriminant in this case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think I have it now.

Rename files for SQL -> Sql.
Correct compilation errors in client code, but design needs to be updated.

Tests do not yet compile.
Correct some additional SQL -> Sql file renames.
Reduce the test compilation problems (still more to do).
@kylepbit
Copy link

kylepbit commented Jul 6, 2021

@lidavidm @wesm - how should we try to move this towards getting merged - do you want one PR (this one) where the Java client/server are there, and the test has a working example, or should we try and split it up?

@lidavidm
Copy link
Member

lidavidm commented Jul 6, 2021

I think it's OK to have one PR with the format changes and a working client/server and test (that's how Flight itself was originally added, looking at it: db0ef22). Then we can put it up for a vote. Having the implementation will be useful so people voting can evaluate it as well.

*/
public FlightInfo executeQuery() throws IOException {
if (isClosed) {
throw new IOException("Prepared statement has already been closed on the server.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This isn't an IO error and is more of an IllegalStateException.

} else if (command.is(CommandGetSqlInfo.class)) {
getStreamSqlInfo(FlightSqlUtils.unpackOrThrow(command, CommandGetSqlInfo.class), context, ticket, listener);
} else {
throw Status.INVALID_ARGUMENT.asRuntimeException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to get set on the listener (the server can log this error though).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All calls to unpackOrThrow need to be in a try/catch that writes the result to the listener, the client will hang.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the caller, and this is set on the listener as the caller wraps this already. I think it would be redundant to do so again.

context, flightStream, ackStream);
}

throw Status.INVALID_ARGUMENT.asRuntimeException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listener needs the error.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the caller, and this is set on the listener as the caller wraps this already. I think it would be redundant to do so again.

*/
public final class FlightSqlUtils {

private static final int BIT_WIDTH8 = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be BIT_WIDTH_8

* @param scale Scale of the type.
* @return The Arrow equivalent type.
*/
public static ArrowType getArrowTypeFromJDBCType(int jdbcDataType, int precision, int scale) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method leaks JDBC into Arrow. I believe an earlier comment mentioned there's a JdbcToArrow converter component that had a similar method that we can use instead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's against old code - this is now within the example code only.
But, I see the JdbcToArrowConfig which looks like it does what we want if we can leverage it.

case Types.DATE:
return new ArrowType.Date(DateUnit.DAY);
case Types.TIME:
return new ArrowType.Time(TimeUnit.MILLISECOND, BIT_WIDTH_32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these should be based on the number of fractional seconds.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably to be ignored if we're using the arrow-jdbc package.

case Types.OTHER:
case Types.JAVA_OBJECT:
case Types.DISTINCT:
case Types.STRUCT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an Arrow struct.

case Types.ROWID:
case Types.SQLXML:
case Types.REF_CURSOR:
case Types.TIME_WITH_TIMEZONE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an Arrow TimetampTz

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to exist?

* @param source the raw bytes source value.
* @return the materialized protobuf object.
*/
public static Any parseOrThrow(byte[] source) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these helpers should take in optional ServerStreamListeners.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, taking into account the other comments about the callers already handling this.

final FlightInfo info = client.getInfo(descriptor);
assertEquals(SCHEMA_INT_TABLE, info.getSchema());

final FlightStream stream = client.getStream(info.getEndpoints().get(0).getTicket());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix throughout.

assertEquals(Arrays.asList("one", "zero", "negative one"), actualStringResults);
assertEquals(Arrays.asList(1, 0, -1), actualIntResults);

AutoCloseables.close(preparedStatement);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try-with-resources instead of explicit close.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was done intentionally to allow for testing.

listener.putNext();
}
}
} catch (ExecutionException | SQLException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to catch RuntimeExceptions too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to be Throwable.

Note - FlightSqlExample is not functional and needs to be updated.
@lidavidm
Copy link
Member

lidavidm commented Jul 7, 2021

Just FYI: most CI pipelines will skip so long as you have the WIP in the PR title, feel free to remove that (keeping it in Draft if desired) when you're ready for CI to run.

@kylepbit
Copy link

@lidavidm - just as an update, work is progressing here on a separate branch that'll be merged in. I think we'll ultimately end up with a separate PR as Tiffany has left, so we can't alter this. Will put a link to the new PR once created here for tracing.

@lidavidm
Copy link
Member

@kylepbit Sounds good, thanks for the update.

@emkornfield
Copy link
Contributor

Closing in favor of #10906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants