Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Druid functions via INFORMATION_SCHEMA.ROUTINES table. #14346

Closed
abhishekrb19 opened this issue May 29, 2023 · 12 comments · Fixed by #14378
Closed

Expose Druid functions via INFORMATION_SCHEMA.ROUTINES table. #14346

abhishekrb19 opened this issue May 29, 2023 · 12 comments · Fixed by #14378

Comments

@abhishekrb19
Copy link
Contributor

abhishekrb19 commented May 29, 2023

Description

Expose Druid functions and operators programmatically via the Druid SQL interface.

Motivation:

  • As a Druid user, I want to be able to programmatically retrieve metadata about registered functions so that I can inspect the system and write automation/tooling that deals with various functions.
  • As a Druid client, I want to be able to retrieve the full list of available functions from a Druid cluster (even from extensions) so I don't have to write code to deal with new functions, types, etc. For example, the Druid web console relies on the documented functions to parse them at build time to provide suggestions at runtime. However, this information can be incomplete as new functions can be loaded through extensions at runtime.

Design:

Add a new table INFORMATION_SCHEMA.ROUTINES that exposes SQL functions and operators. The INFORMATION_SCHEMA.ROUTINES table will include the following columns:

  1. ROUTINE_CATALOG: Name of the database. Always set to druid.
  2. ROUTINE_SCHEMA: Name of the schema. Always set to INFORMATION_SCHEMA.
  3. ROUTINE_NAME: Name of the function or operator. E.g., APPROX_COUNT_DISTINCT_DS_THETA
  4. ROUTINE_TYPE: Type of routine. Always set to FUNCTION.
  5. IS_AGGREGATOR: Is the routine an aggregator or not. Returns YES for aggregator functions; NO for scalar functions.
  6. SIGNATURES: Possible signatures for the routine as a string.

Note that Druid-specific columns such as IS_AGGREGATOR, SIGNATURES are also included besides the standard set (columns 1 - 5).

Example usage:
To see information about all the aggregator functions, including ones loaded from extensions, run the following SQL query:

SELECT "ROUTINE_NAME", "IS_DETERMINISTIC", "SIGNATURES"
FROM "INFORMATION_SCHEMA"."ROUTINES"
WHERE "ROUTINE_TYPE" = 'FUNCTION' AND 'IS_AGGREGATOR' = 'YES'

Other alternatives: Custom sys tables were considered. However, INFORMATION_SCHEMA.ROUTINES is the SQL standard to expose stored procedures, routines, and built-in types. The proposal is based on the SQL-1999 specification - see relevant sections 20.45, 4.24, 20.69.

Implementation sketch:

Calcite knows about the registered operators, including the ones registered in extensions. So extracting this information from Calcite would be the best way. Therefore, we can use the DruidOperatorTable that implements Calcite's operator table interface. The DruidOperatorTable is aware of all the registered operators at runtime, so we can wire this up into the new INFORMATION_SCHEMA table.

Future work:

In addition to the above columns, I think we could also add more columns over time, including:

  1. Description of the routine, if available. An outline for this is mentioned below.
  2. Return type
  3. Extension/module name

Getting this information may involve adding more functionality to DruidOperatorTable.

Note that this proposal is only for the Druid functions and operators. Exposing data types would be a separate proposal on its own.

@abhishekagarwal87
Copy link
Contributor

Thank you for the proposal, Abhishek. what does a deterministic routine mean? How is the IS_DETERMINISTIC field supposed to be used?

@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented May 30, 2023

@abhishekagarwal87, IS_DETERMINISTIC in this table denotes whether a function will always produce the same output for the same set of inputs. It's a standard column from the SQL spec that a lot of vendors expose. For Druid, this information can be determined from Calcite's SqlOperator here.

AFAIK, in Druid, all the functions documented here are deterministic by default. Found an old PR that attempted to add RAND(), which would have been non-deterministic, I think, but it wasn't merged. Perhaps someone has implemented a non-deterministic function in the wild in an extension?

One usecase for IS_DETERMINISTIC: clients can decide whether the output of routines/functions can be cached or not. Generally, I think it'd be ok to cache the results of deterministic routines. It'd address one of the concerns for the RAND() function implementation in the linked PR above.

@vogievetsky
Copy link
Contributor

Could you link the the documentation that you are basing this on? I tried googling for "INFORMATION_SCHEMA.ROUTINES" and I keep finding docs from different databases that have more or less columns in the report. What is the docs you are following?

@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented May 31, 2023

@vogievetsky, the proposal is based on the SQL 1999 specification - http://web.cecs.pdx.edu/~len/sql1999.pdf. Specifically for INFORMATION_SCHEMA.ROUTINES, please see the relevant sections:

  • Section 20.45 - ROUTINES view
  • A mention in section 4.24 and section 20.69

The column definitions are sort of scattered throughout the spec. Then I was also looking at a few vendor implementations.

As a side note, the SQL 1999 spec notes several columns that may not apply to Druid. For example, CREATED and LAST_ALTERED information only applies to stored procedures. So a few follow-up questions:

  1. Should we add most/all columns from the spec to the table and mark them as "unused" (similar to what's documented for the INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.SCHEMATA tables). Worth noting that the "unused" columns we document is an incomplete list from the spec, so I wonder if they're just there for legacy reasons.
  2. Should we include the specific columns that apply to Druid selectively?

I think 2 makes sense, but let me know what your thoughts are.

@vogievetsky
Copy link
Contributor

Thank you for the link

@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented Jun 1, 2023

As a follow-up to the proposed design implementation, we can add a new column, ROUTINE_COMMENT, to INFORMATION_SCHEMA.ROUTINES table. The new column allows developers to provide additional information or context about a routine/function. A few vendor implementations that do something similar:

ROUTINE_COMMENT is optional; some functions may not have a description if it's not defined by a developer - for example, functions that are in proprietary extensions, not in Apache Druid.

So with those goals in mind, I think an implementation that will work:

  /**
   * @return an optional description about the Sql aggregator.
   */
  @Nullable
  default String functionDescription() {
    return null;
  }
  • We can override the default implementation for all aggregator functions already in Druid (including the ones defined in open-source extensions). We can reuse the descriptions from the Druid docs.
  • The default implementation will return null for functions in the wild (from custom extensions or so) that are unaware of functionDescription(). It's entirely up to the developer to override the function defined in their extension code that's not part of Apache Druid.

Applications:

A few high-level usecases that can use the description (the details are not fully fleshed out):

  • Generate documentation based on INFORMATION_SCHEMA.ROUTINES as the source of truth, such as the SQL functions page.
  • The Druid web-console or a custom web query editor can provide auto-completion support for functions that are available to provide suggestions.

Comments or feedback is appreciated!

@abhishekagarwal87
Copy link
Contributor

@abhishekrb19 - the use case behind the column IS_DETERMINISTIC is not very clear to me. We can leave out this column and add it when it's actually necessary. The decision to cache the results is a bit more involved since the underlying data is constantly changing (there is real-time data).

The functionDescription idea is interesting. But then we either commit to it fully and fill in all the implementations of existing functions or we don't do it at all. Because, if we don't fully commit to it, a tool cannot rely on this API to get the description of the function. Are you going to use this function in your client? If you are going to do the work of adding documentation to all the scalar and aggregator functions, then I am +1 for the idea. Otherwise, this enhancement can be done later.

@abhishekrb19
Copy link
Contributor Author

abhishekrb19 commented Jun 5, 2023

@abhishekagarwal87, thank you for the comments. Yes, removing IS_DETERMINISTIC sounds good. As a side note, we currently don't implement the deterministic function, so it defaults to true for all functions.

As far as functionDescription is concerned, my thought is we can fully update the description for all functions in Apache Druid. The fallback null implementation proposed is only for proprietary ones in the wild, since we can't do anything about it but gracefully return a null value for the optional column. Also, I updated the comment with a few applications. Specifically, our client won't need the function description right out of the bat; it's primarily for the Druid web-console and documentation purposes, so I will follow that up after the main work is complete.

@FrankChen021
Copy link
Member

This is very useful for client side tools. And the web-console can also benefit from this change to implement a more simplier autocompletion suggestion.

I'm wondering for ROUTINE_TYPE, what does the FUNCTION_STAR or POSTFIX mean?

@FrankChen021
Copy link
Member

As a follow-up to the proposed design implementation, we can add a new column, ROUTINE_COMMENT, to INFORMATION_SCHEMA.ROUTINES table.

I like the this proposal.

But there's one thing, what's the relationship between the description in the code and the function description in current markdown file? I don't think developers want to write these description in two different places twice.

So, maybe one way is that, during mvn source generation phase, we can extract the description from the markdown files and generate some string constants that can be referenced from the default implementation of functionDescription().

@abhishekrb19
Copy link
Contributor Author

I'm wondering for ROUTINE_TYPE, what does the FUNCTION_STAR or POSTFIX mean?

@FrankChen021, please see the updated description -- the ROUTINE_TYPE column will always be FUNCTION. Earlier, I thought we could return FUNCTION_STAR, FUNCTION_ID, SPECIAL, etc. -- information extracted from Calcite's SqlSyntax for an operator. However, those Calcite enums are not very useful, IMO. So we filter only for these Calcite function syntaxes and always return FUNCTION for the ROUTINE_TYPE column. Let me know if that makes sense.

@abhishekrb19
Copy link
Contributor Author

But there's one thing, what's the relationship between the description in the code and the function description in current markdown file? I don't think developers want to write these description in two different places twice.
So, maybe one way is that, during mvn source generation phase, we can extract the description from the markdown files and generate some string constants that can be referenced from the default implementation of functionDescription().

Good suggestion! I'll think about it more after the main implementation is in place. Also, @vtlim had some thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants