Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table. #14346

abhishekrb19 · 2023-05-29T04:21:59Z

Description

Expose Druid functions and operators programmatically via the Druid SQL interface.

Motivation:

As a Druid user, I want to be able to programmatically retrieve metadata about registered functions so that I can inspect the system and write automation/tooling that deals with various functions.
As a Druid client, I want to be able to retrieve the full list of available functions from a Druid cluster (even from extensions) so I don't have to write code to deal with new functions, types, etc. For example, the Druid web console relies on the documented functions to parse them at build time to provide suggestions at runtime. However, this information can be incomplete as new functions can be loaded through extensions at runtime.

Design:

Add a new table INFORMATION_SCHEMA.ROUTINES that exposes SQL functions and operators. The INFORMATION_SCHEMA.ROUTINES table will include the following columns:

ROUTINE_CATALOG: Name of the database. Always set to druid.
ROUTINE_SCHEMA: Name of the schema. Always set to INFORMATION_SCHEMA.
ROUTINE_NAME: Name of the function or operator. E.g., APPROX_COUNT_DISTINCT_DS_THETA
ROUTINE_TYPE: Type of routine. Always set to FUNCTION.
IS_AGGREGATOR: Is the routine an aggregator or not. Returns YES for aggregator functions; NO for scalar functions.
SIGNATURES: Possible signatures for the routine as a string.

Note that Druid-specific columns such as IS_AGGREGATOR, SIGNATURES are also included besides the standard set (columns 1 - 5).

Example usage:
To see information about all the aggregator functions, including ones loaded from extensions, run the following SQL query:

SELECT "ROUTINE_NAME", "IS_DETERMINISTIC", "SIGNATURES"
FROM "INFORMATION_SCHEMA"."ROUTINES"
WHERE "ROUTINE_TYPE" = 'FUNCTION' AND 'IS_AGGREGATOR' = 'YES'

Other alternatives: Custom sys tables were considered. However, INFORMATION_SCHEMA.ROUTINES is the SQL standard to expose stored procedures, routines, and built-in types. The proposal is based on the SQL-1999 specification - see relevant sections 20.45, 4.24, 20.69.

Implementation sketch:

Calcite knows about the registered operators, including the ones registered in extensions. So extracting this information from Calcite would be the best way. Therefore, we can use the DruidOperatorTable that implements Calcite's operator table interface. The DruidOperatorTable is aware of all the registered operators at runtime, so we can wire this up into the new INFORMATION_SCHEMA table.

Future work:

In addition to the above columns, I think we could also add more columns over time, including:

Description of the routine, if available. An outline for this is mentioned below.
Return type
Extension/module name

Getting this information may involve adding more functionality to DruidOperatorTable.

Note that this proposal is only for the Druid functions and operators. Exposing data types would be a separate proposal on its own.

The text was updated successfully, but these errors were encountered:

abhishekagarwal87 · 2023-05-30T03:55:35Z

Thank you for the proposal, Abhishek. what does a deterministic routine mean? How is the IS_DETERMINISTIC field supposed to be used?

abhishekrb19 · 2023-05-30T06:23:14Z

@abhishekagarwal87, IS_DETERMINISTIC in this table denotes whether a function will always produce the same output for the same set of inputs. It's a standard column from the SQL spec that a lot of vendors expose. For Druid, this information can be determined from Calcite's SqlOperator here.

AFAIK, in Druid, all the functions documented here are deterministic by default. Found an old PR that attempted to add RAND(), which would have been non-deterministic, I think, but it wasn't merged. Perhaps someone has implemented a non-deterministic function in the wild in an extension?

One usecase for IS_DETERMINISTIC: clients can decide whether the output of routines/functions can be cached or not. Generally, I think it'd be ok to cache the results of deterministic routines. It'd address one of the concerns for the RAND() function implementation in the linked PR above.

vogievetsky · 2023-05-30T16:27:19Z

Could you link the the documentation that you are basing this on? I tried googling for "INFORMATION_SCHEMA.ROUTINES" and I keep finding docs from different databases that have more or less columns in the report. What is the docs you are following?

abhishekrb19 · 2023-05-31T18:18:16Z

@vogievetsky, the proposal is based on the SQL 1999 specification - http://web.cecs.pdx.edu/~len/sql1999.pdf. Specifically for INFORMATION_SCHEMA.ROUTINES, please see the relevant sections:

Section 20.45 - ROUTINES view
A mention in section 4.24 and section 20.69

The column definitions are sort of scattered throughout the spec. Then I was also looking at a few vendor implementations.

As a side note, the SQL 1999 spec notes several columns that may not apply to Druid. For example, CREATED and LAST_ALTERED information only applies to stored procedures. So a few follow-up questions:

Should we add most/all columns from the spec to the table and mark them as "unused" (similar to what's documented for the INFORMATION_SCHEMA.COLUMNS and INFORMATION_SCHEMA.SCHEMATA tables). Worth noting that the "unused" columns we document is an incomplete list from the spec, so I wonder if they're just there for legacy reasons.
Should we include the specific columns that apply to Druid selectively?

I think 2 makes sense, but let me know what your thoughts are.

vogievetsky · 2023-05-31T18:41:02Z

Thank you for the link

abhishekrb19 · 2023-06-01T20:17:45Z

As a follow-up to the proposed design implementation, we can add a new column, ROUTINE_COMMENT, to INFORMATION_SCHEMA.ROUTINES table. The new column allows developers to provide additional information or context about a routine/function. A few vendor implementations that do something similar:

MySQL's info schema routine table
MariaDB's info schema routine table

ROUTINE_COMMENT is optional; some functions may not have a description if it's not defined by a developer - for example, functions that are in proprietary extensions, not in Apache Druid.

So with those goals in mind, I think an implementation that will work:

Add a default implementation to SqlAggregator interface that returns null:

  /**
   * @return an optional description about the Sql aggregator.
   */
  @Nullable
  default String functionDescription() {
    return null;
  }

We can override the default implementation for all aggregator functions already in Druid (including the ones defined in open-source extensions). We can reuse the descriptions from the Druid docs.
The default implementation will return null for functions in the wild (from custom extensions or so) that are unaware of functionDescription(). It's entirely up to the developer to override the function defined in their extension code that's not part of Apache Druid.

Applications:

A few high-level usecases that can use the description (the details are not fully fleshed out):

Generate documentation based on INFORMATION_SCHEMA.ROUTINES as the source of truth, such as the SQL functions page.
The Druid web-console or a custom web query editor can provide auto-completion support for functions that are available to provide suggestions.

Comments or feedback is appreciated!

abhishekagarwal87 · 2023-06-04T12:54:44Z

@abhishekrb19 - the use case behind the column IS_DETERMINISTIC is not very clear to me. We can leave out this column and add it when it's actually necessary. The decision to cache the results is a bit more involved since the underlying data is constantly changing (there is real-time data).

The functionDescription idea is interesting. But then we either commit to it fully and fill in all the implementations of existing functions or we don't do it at all. Because, if we don't fully commit to it, a tool cannot rely on this API to get the description of the function. Are you going to use this function in your client? If you are going to do the work of adding documentation to all the scalar and aggregator functions, then I am +1 for the idea. Otherwise, this enhancement can be done later.

abhishekrb19 · 2023-06-05T05:53:29Z

@abhishekagarwal87, thank you for the comments. Yes, removing IS_DETERMINISTIC sounds good. As a side note, we currently don't implement the deterministic function, so it defaults to true for all functions.

As far as functionDescription is concerned, my thought is we can fully update the description for all functions in Apache Druid. The fallback null implementation proposed is only for proprietary ones in the wild, since we can't do anything about it but gracefully return a null value for the optional column. Also, I updated the comment with a few applications. Specifically, our client won't need the function description right out of the bat; it's primarily for the Druid web-console and documentation purposes, so I will follow that up after the main work is complete.

FrankChen021 · 2023-06-07T05:36:41Z

This is very useful for client side tools. And the web-console can also benefit from this change to implement a more simplier autocompletion suggestion.

I'm wondering for ROUTINE_TYPE, what does the FUNCTION_STAR or POSTFIX mean?

FrankChen021 · 2023-06-07T05:42:59Z

As a follow-up to the proposed design implementation, we can add a new column, ROUTINE_COMMENT, to INFORMATION_SCHEMA.ROUTINES table.

I like the this proposal.

But there's one thing, what's the relationship between the description in the code and the function description in current markdown file? I don't think developers want to write these description in two different places twice.

So, maybe one way is that, during mvn source generation phase, we can extract the description from the markdown files and generate some string constants that can be referenced from the default implementation of functionDescription().

abhishekrb19 · 2023-06-08T04:31:08Z

I'm wondering for ROUTINE_TYPE, what does the FUNCTION_STAR or POSTFIX mean?

@FrankChen021, please see the updated description -- the ROUTINE_TYPE column will always be FUNCTION. Earlier, I thought we could return FUNCTION_STAR, FUNCTION_ID, SPECIAL, etc. -- information extracted from Calcite's SqlSyntax for an operator. However, those Calcite enums are not very useful, IMO. So we filter only for these Calcite function syntaxes and always return FUNCTION for the ROUTINE_TYPE column. Let me know if that makes sense.

abhishekrb19 · 2023-06-08T04:34:13Z

But there's one thing, what's the relationship between the description in the code and the function description in current markdown file? I don't think developers want to write these description in two different places twice.
So, maybe one way is that, during mvn source generation phase, we can extract the description from the markdown files and generate some string constants that can be referenced from the default implementation of functionDescription().

Good suggestion! I'll think about it more after the main implementation is in place. Also, @vtlim had some thoughts on this.

abhishekrb19 added the Feature/Change Description label May 29, 2023

abhishekrb19 mentioned this issue May 30, 2023

Fix EARLIEST_BY/LATEST_BY signature and include function name in signature. #14352

Merged

2 tasks

abhishekrb19 mentioned this issue Jun 6, 2023

Expose Druid functions in INFORMATION_SCHEMA.ROUTINES table. #14378

Merged

9 tasks

clintropolis added Proposal Design Review labels Jun 6, 2023

zachjsh closed this as completed in #14378 Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table. #14346

Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table. #14346

abhishekrb19 commented May 29, 2023 •

edited

Loading

abhishekagarwal87 commented May 30, 2023

abhishekrb19 commented May 30, 2023 •

edited

Loading

vogievetsky commented May 30, 2023

abhishekrb19 commented May 31, 2023 •

edited

Loading

vogievetsky commented May 31, 2023

abhishekrb19 commented Jun 1, 2023 •

edited

Loading

abhishekagarwal87 commented Jun 4, 2023

abhishekrb19 commented Jun 5, 2023 •

edited

Loading

FrankChen021 commented Jun 7, 2023

FrankChen021 commented Jun 7, 2023

abhishekrb19 commented Jun 8, 2023

abhishekrb19 commented Jun 8, 2023

Expose Druid functions via INFORMATION_SCHEMA.ROUTINES table. #14346

Expose Druid functions via INFORMATION_SCHEMA.ROUTINES table. #14346

Comments

abhishekrb19 commented May 29, 2023 • edited Loading

Description

Motivation:

Design:

Implementation sketch:

Future work:

abhishekagarwal87 commented May 30, 2023

abhishekrb19 commented May 30, 2023 • edited Loading

vogievetsky commented May 30, 2023

abhishekrb19 commented May 31, 2023 • edited Loading

vogievetsky commented May 31, 2023

abhishekrb19 commented Jun 1, 2023 • edited Loading

Applications:

abhishekagarwal87 commented Jun 4, 2023

abhishekrb19 commented Jun 5, 2023 • edited Loading

FrankChen021 commented Jun 7, 2023

FrankChen021 commented Jun 7, 2023

abhishekrb19 commented Jun 8, 2023

abhishekrb19 commented Jun 8, 2023

Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table. #14346

Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table. #14346

abhishekrb19 commented May 29, 2023 •

edited

Loading

abhishekrb19 commented May 30, 2023 •

edited

Loading

abhishekrb19 commented May 31, 2023 •

edited

Loading

abhishekrb19 commented Jun 1, 2023 •

edited

Loading

abhishekrb19 commented Jun 5, 2023 •

edited

Loading