Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support describe command in PPL #644

Closed
seankao-az opened this issue Jun 20, 2022 · 13 comments · Fixed by #646
Closed

[FEATURE] Support describe command in PPL #644

seankao-az opened this issue Jun 20, 2022 · 13 comments · Fixed by #646
Labels
enhancement New feature or request

Comments

@seankao-az
Copy link
Collaborator

seankao-az commented Jun 20, 2022

1. Overview

1.1 Introduction

Currently, users have to use either OpenSearch query DSL or SQL to get the schema for an index, and then manually analyze the query result to gather useful information.

This new feature provides users with a new command, describe, in PPL, so that they can not only retrieve metadata for an index, but also leverage PPL's pipe (|) syntax and other commands to further query the metadata.

1.2 Use Cases

Observability plugin

Queries for the observability plugin are associated with a datetime range specified in the datetime selection UI. We do this by inserting an additional where command into the PPL query, entered by the user, to filter out the datetime range. To achieve that, we would need to know the field names that contain the datetime info.

Currently, we first use OpenSearch query DSL to fetch the fields for an index, and then filter out time fields in client code.

Using this new feature would allow us to construct a PPL query to do these all at once.

2. Requirements

3. Out of Scope

  • Detect schemas conflict when querying multiple indices at once

4. Design

4.1 Proposed Design

Syntax:

describe <index>

describe is a top level command just like search. Piping into describe is not allowed.

Sample query:

describe opensearch_dashboards_sample_data_logs | where TYPE_NAME="timestamp" | fields COLUMN_NAME

4.2 Alternatives Considered

Alternative 1

Syntax:

describe source=<index>

This syntax is analogous to the search source=<index> command, but we decide not to include source= in the describe command because rather than fetching data from the index, we're fetching metadata of the index.

Alternative 2

Syntax:

<search command> [ | <some command> ] | describe [some options]

Getting metadata of an index requires reading a metadata table of the index, instead of fetching the index itself. The semantics of the syntax does not quite fit this context.

@seankao-az seankao-az added enhancement New feature or request untriaged labels Jun 20, 2022
@seankao-az
Copy link
Collaborator Author

seankao-az commented Jun 20, 2022

Regarding the command syntax, I have a couple of candidates. I need some comments and feedback.

Candidate 1

describe source=<index> [column_pattern=<column pattern>]

Examples:

  • describe source=web_logs
  • describe source=web_logs column_pattern="%time%"

Comments:

  • straightforward enough
  • distinguish admin statements (describe ...) from data query statements (search ...), which is the current SQL approach

Candidate 2

[search] source=<index> | describe [column_pattern=<column pattern>]

Examples:

  • source=web_logs | describe
  • source=web_logs | describe column_pattern="%time%"

Comments:

  • mixing admin statements and data query statements might not be a good idea

Candidate 3

describe source=<index> [column_pattern=<column pattern>] [type_name=<type name>]

Examples:

  • describe source=web_logs type_name="date"

Comments:

  • judging from the specific use case for the observability plugin, merely pattern matching in field name might not be enough. The timestamp field could have all kinds of names, e.g. timestamp, utc_time, etc., even names that don’t have time in it, e.g., date_nanos. Type filtering could help with this use case.

Candidate 4

A wilder one would be allowing piping out from the describe command, so that we can do something like this:
describe source=<index> | where type_name="date" | fields column_name
I'm not sure how feasible this is, though.

@acarbonetto
Copy link
Collaborator

My opinion:

  • I find candidate 4 to be the most PPL-like and extendable.
  • If you can't go with candidate 4, candidate 1 or 3 is fine.
  • I don't think candidate 2 is necessary, and may cause syntax issues as you mentioned.

@acarbonetto
Copy link
Collaborator

You may run into issues with the column_pattern, like: #259

@acarbonetto
Copy link
Collaborator

acarbonetto commented Jun 20, 2022

Currently, when I run SQL describe tables - if I use the columns like pattern I get no data back:

sql-cli -q "DESCRIBE TABLES LIKE \"calcs\" COLUMNS LIKE \"TABLE_NAME\";"   
fetched rows / total rows = 0/0
+-------------+---------------+--------------+---------------+-------------+-------------+---------------+-----------------+------------------+------------------+------------+-----------+--------------+-----------------+--------------------+---------------------+--------------------+---------------+-----------------+----------------+---------------+--------------------+--------------------+----------------------+
| TABLE_CAT   | TABLE_SCHEM   | TABLE_NAME   | COLUMN_NAME   | DATA_TYPE   | TYPE_NAME   | COLUMN_SIZE   | BUFFER_LENGTH   | DECIMAL_DIGITS   | NUM_PREC_RADIX   | NULLABLE   | REMARKS   | COLUMN_DEF   | SQL_DATA_TYPE   | SQL_DATETIME_SUB   | CHAR_OCTET_LENGTH   | ORDINAL_POSITION   | IS_NULLABLE   | SCOPE_CATALOG   | SCOPE_SCHEMA   | SCOPE_TABLE   | SOURCE_DATA_TYPE   | IS_AUTOINCREMENT   | IS_GENERATEDCOLUMN   |
|-------------+---------------+--------------+---------------+-------------+-------------+---------------+-----------------+------------------+------------------+------------+-----------+--------------+-----------------+--------------------+---------------------+--------------------+---------------+-----------------+----------------+---------------+--------------------+--------------------+----------------------|
+-------------+---------------+--------------+---------------+-------------+-------------+---------------+-----------------+------------------+------------------+------------+-----------+--------------+-----------------+--------------------+---------------------+--------------------+---------------+-----------------+----------------+---------------+--------------------+--------------------+----------------------+

If I don't include the COLUMNS LIKE, I get full data

@anirudha
Copy link
Collaborator

source is always the first cmd, that is a PPL syntax limitation for now which invalidates option, 4,3,1

Ideally a user should be able to return query data and describe data. or choose between it. Hence, Open 2 is best

source= | describe [column_pattern=]

Examples:

source=web_logs | describe
source=web_logs | describe column_pattern="%time%"

source=web_logs | describe pipe-passthrough=true ( i.e describe with existing query response ) [ default false ]

@anirudha
Copy link
Collaborator

@penghuo @mengweieric please review and comment

@anirudha
Copy link
Collaborator

column_pattern is good to have, may not be required for phase 1 release

@penghuo
Copy link
Collaborator

penghuo commented Jun 20, 2022

not necessary. if we consider DML. we could loose this limitation. I prefer add option 4, but remove source=

describe <index> | where type_name="date" | fields column_name

@seankao-az
Copy link
Collaborator Author

if we could allow the first cmd to be describe, I prefer option 4, as any cmd that works with search could potentially also work with describe. However, a lot of cmds only make sense for search queries, so supporting all of them for describe might be an overkill.

@acarbonetto
Copy link
Collaborator

You'd want to support anything that you get for free. Many commands would work out of the box because the describe command returns in a table format.

@seankao-az seankao-az mentioned this issue Jun 20, 2022
9 tasks
@seankao-az
Copy link
Collaborator Author

seankao-az commented Jun 22, 2022

Need feedback to decide what response format to use for PPL describe.

Currently all SQL responses are formatted by JdbcResponseFormatter (by default), while all PPL responses are formatted by SimpleJsonResponseFormatter (by default). This is what caused the mismatch in their responses for the DESCRIBE cmd.

The options are:

  1. Change protocol for PPL so that describe response matches that of SQL (both use Jdbc)
  2. Allow the mismatch

Personally leaning towards option 2

References

Relevant lines:

  • Format format = request.format();
    ResponseFormatter<QueryResult> formatter;
    if (format.equals(Format.CSV)) {
    formatter = new CsvResponseFormatter(request.sanitize());
    } else if (format.equals(Format.RAW)) {
    formatter = new RawResponseFormatter();
    } else {
    formatter = new JdbcResponseFormatter(PRETTY);
    }
    return new ResponseListener<QueryResponse>() {
    @Override
    public void onResponse(QueryResponse response) {
    sendResponse(channel, OK,
    formatter.format(new QueryResult(response.getSchema(), response.getResults())));
    }
  • Format format = pplRequest.format();
    ResponseFormatter<QueryResult> formatter;
    if (format.equals(Format.CSV)) {
    formatter = new CsvResponseFormatter(pplRequest.sanitize());
    } else if (format.equals(Format.RAW)) {
    formatter = new RawResponseFormatter();
    } else if (format.equals(Format.VIZ)) {
    formatter = new VisualizationResponseFormatter(pplRequest.style());
    } else {
    formatter = new SimpleJsonResponseFormatter(PRETTY);
    }
    return new ResponseListener<QueryResponse>() {
    @Override
    public void onResponse(QueryResponse response) {
    sendResponse(channel, OK, formatter.format(new QueryResult(response.getSchema(),
    response.getResults())));
    }

Example of mismatching results

Notice that the datarows, which contains the metadata of the queried index, are actually identical (for this example, at least).

diff --git a/ppl_result b/sql_result
index 29ae5b87..76f5f8d4 100644
--- a/ppl_result
+++ b/sql_result
@@ -2,99 +2,99 @@
   "schema": [
     {
       "name": "TABLE_CAT",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "TABLE_SCHEM",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "TABLE_NAME",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "COLUMN_NAME",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "DATA_TYPE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "TYPE_NAME",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "COLUMN_SIZE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "BUFFER_LENGTH",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "DECIMAL_DIGITS",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "NUM_PREC_RADIX",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "NULLABLE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "REMARKS",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "COLUMN_DEF",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "SQL_DATA_TYPE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "SQL_DATETIME_SUB",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "CHAR_OCTET_LENGTH",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "ORDINAL_POSITION",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "IS_NULLABLE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "SCOPE_CATALOG",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "SCOPE_SCHEMA",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "SCOPE_TABLE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "SOURCE_DATA_TYPE",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "IS_AUTOINCREMENT",
-      "type": "string"
+      "type": "keyword"
     },
     {
       "name": "IS_GENERATEDCOLUMN",
-      "type": "string"
+      "type": "keyword"
     }
   ],
   "datarows": [
@@ -802,5 +802,6 @@
     ]
   ],
   "total": 27,
-  "size": 27
+  "size": 27,
+  "status": 200
 }

@anirudha
Copy link
Collaborator

anirudha commented Jun 22, 2022

@penghuo could you review this pls

@penghuo
Copy link
Collaborator

penghuo commented Jun 22, 2022

PPL does not intend to support JDBC. The difference is only schema type, string is more make sense. By the way, most of columns is meaningless. I suppose frontend only care about COLUMN_NAME, DATA_TYPE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants