How to implement support for multi-values fields (#1300)? #1733
Replies: 25 comments 4 replies
-
Option 1: flatten valuesFlatten values as Query: SELECT * FROM dbg Response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5
],
[
"doc_2",
3
],
[
"doc_2",
4
],
[
"doc_3",
1
],
[
"doc_3",
2
],
[
"doc_3",
3
],
[
"doc_3",
4
],
[
"doc_3",
5
],
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Option 2: Expand valuesAdd extra columns to entire result set if a doc has a field with multiple values. Query: SELECT * FROM dbg Response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
},
{
"name": "myNum[0]",
"type": "long"
},
{
"name": "myNum[1]",
"type": "long"
},
{
"name": "myNum[0][0]",
"type": "long"
},
{
"name": "myNum[0][1]",
"type": "long"
},
{
"name": "myNum[1][0]",
"type": "long"
},
{
"name": "myNum[1][1]",
"type": "long"
},
{
"name": "myNum[2]",
"type": "long"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5,
null,
null,
null,
null,
null,
null,
null,
],
[
"doc_2",
null,
3,
4,
null,
null,
null,
null,
null,
],
[
"doc_3",
null,
null,
null,
1,
2,
3,
4,
5
],
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Option 3: provide value as a jsonAdd one extra column with a raw json, which contains the field. Query: SELECT * FROM dbg Response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
},
{
"name": "myNum.json",
"type": "string"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5,
"{ \"myNum\": 5 }"
],
[
"doc_2",
null,
"{ \"myNum\": [3, 4] }"
],
,
[
"doc_3",
null,
"{ \"myNum\": [[1, 2], [3, 4], 5] }"
]
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Cons:
Procs:
|
Beta Was this translation helpful? Give feedback.
-
Option 4: ignore not compatible valuesValue Query: SELECT * FROM dbg Response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5
],
[
"doc_2",
null
],
[
"doc_3",
null
]
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Cons:
Procs:
|
Beta Was this translation helpful? Give feedback.
-
Option 5: cast or convertThis implies Option 4 and Option 3, but allows user to use a function(s) to convert multi-value fields. Query: SELECT CAST(myNum AS JSON) AS `myNum.json`, myNum FROM dbg; Response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
},
{
"name": "myNum.json",
"type": "string"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5,
"{ \"myNum\": 5 }"
],
[
"doc_2",
null,
"{ \"myNum\": [3, 4] }"
],
,
[
"doc_3",
null,
"{ \"myNum\": [[1, 2], [3, 4], 5] }"
]
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Procs:
|
Beta Was this translation helpful? Give feedback.
-
Option 6: PartiQLThis implies Option 4, but allows user to use PartiQL syntax and define how to deal with multiple values. Query: SELECT CASE WHEN (myNum IS ARRAY) THEN myNum[0]
ELSE myNum END AS myNum
FROM dbg; or SELECT num AS myNum FROM dbg as d, d.myNum[0] as num Example of the response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5
],
[
"doc_2",
3
],
,
[
"doc_3",
null
]
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Cons:
Procs:
|
Beta Was this translation helpful? Give feedback.
-
Option 7: painless scriptThis implies Option 4, but allows user to submit a Query: SELECT myNum, painless_script('...') ... or SELECT *, painless_script(myNum, '...') ... or {
"query" : "SELECT * ...",
"painless_script" : "..."
} Response: {
"schema": [
{
"name": "_doc",
"type": "string"
},
{
"name": "myNum",
"type": "long"
},
{
"name": "myNum.scripted",
"type": "long"
}
],
"total": 2,
"datarows": [
[
"doc_1",
5,
null
],
[
"doc_2",
null,
4
],
,
[
"doc_3",
null,
5
]
],
"size": 2,
"status": 200
} Being converted to the table that reponse looks like:
Cons:
Procs:
|
Beta Was this translation helpful? Give feedback.
-
Option 8: error on multiple valuesThis is modification of Option 4, but once SQL engine meets a field with multiple values, like Query: SELECT * FROM dbg Response: {
"error": {
"reason": "Error occurred in SQL engine",
"details": "...",
"type": "SomeTypeOfException"
},
"status": 503
} Cons
Procs
|
Beta Was this translation helpful? Give feedback.
-
Now the highlight of the program comes. Option X: let user pick the optionThis requires following steps:
|
Beta Was this translation helpful? Give feedback.
-
Please, feel free to update/edit, add options and comment them. My choice is to implement (in order):
|
Beta Was this translation helpful? Give feedback.
-
Thank you @Yury-Fridlyand for starting this discussion. I like this layout. As to mutli-valued fields, from my perspective:
If If I'm using the REST endpoint, then I expect JSON response and dealing with this case should be straight-forward. How to map The one thing I dislike about this is the schema could change based on the data returned but I don't see anything in the JDBC documentation that expects it to be the same. This is where configurability could be helpful. If a client can say how to convert multi-valued field to a single value, then they can get a consistent schema from query to query with simple types. |
Beta Was this translation helpful? Give feedback.
-
Suggestion is that we implement #5 first: #1300 (comment) |
Beta Was this translation helpful? Give feedback.
-
We can add support for arrays by including a For existing/new mappings, we could use the For example:
In PPL, the JSON array is output. We should consider an overriding meta property that would behave like V2 in all cases (which forces the data type to be a single element). We could use the same meta property with the false value to force behaviour. For example:
Once multi-field support is available, we could have both options available to user, for example:
|
Beta Was this translation helpful? Give feedback.
-
Note: in postgresql, the columns are strictly defined. We can use
Reference: https://www.postgresql.org/docs/current/arrays.html |
Beta Was this translation helpful? Give feedback.
-
Partiql also supports querying arrays using square parenthesis for arrays of literals.
Query whole array: PartiQL> select array from root;
==='
<<
{
'array': [
[
1,
2
],
[
3,
4
],
5,
'NA'
]
}
>>
--- Query First Index Of Array: PartiQL> SELECT array[0] FROM root;
==='
<<
{
'_1': [
1,
2
]
}
>>
--- Query First Index of First Index Array: PartiQL> SELECT array[0][0] FROM root;
==='
<<
{
'_1': 1
}
>>
--- Query Third Index To Array: PartiQL> SELECT array[2] FROM root;
==='
<<
{
'_1': 5
}
>>
--- With this implementation in the V2 engine we can avoid a breaking change. Without the user specifying array indexing in their SQL query the current behaviour will be adhered to and the first value in the array will be returned. When the user uses square parenthesis the responsibility for that index to be valid falls on the user. Following are some examples and edge cases to show this implementation in the SQL plugin. Arrays of Objects "accounts": [{"id": 1}, {"id": 2}] Query: SELECT accounts[0].id FROM people; Response: Example query that fails due to not indexing array correctly and missing value is returned. Query: SELECT accounts.id[0] FROM people; Response: Example query that fails due to indexing array out of bounds and missing value is returned. Query: SELECT accounts[2].id FROM people; Response: Example query that returns whole array. Query: SELECT accounts[:] FROM people; Response: Example query that returns whole array without specifying ending parenthesis. Whole array support is returned by default when the user uses square parenthesis in any part of the paths of selected field. Query: SELECT accounts[0].id FROM people; Response: Arrays of Objects with inner arrays "accounts": [{"id": [[1, 1], 2]}, {"id": [3, 4]}] An example where we return the whole resulting array implicitly: Query: SELECT accounts[0].id FROM people; Response: An example where we return the whole resulting array implicitly: Query: SELECT accounts[0].id[0][0] FROM people; Response: Limitations: |
Beta Was this translation helpful? Give feedback.
-
There is a method If Given this data: {
"name": "value1",
"value": [1]
}
{
"name": "value2",
"value": [2, 3]
}
{
"name": "value3",
"value": [
[4,5],
[6,7],
8
]
} Can use the SQL query: SELECT * FROM test1 Or PPL query:
Can get these results: {
"schema": [
{
"name": "name",
"type": "string"
},
{
"name": "value",
"type": "long"
}
],
"datarows": [
[
"value1",
[
1
]
],
[
"value2",
[
2,
3
]
],
[
"value3",
[
[
4,
5
],
[
6,
7
],
8
]
]
],
"total": 3,
"size": 3
} Need to investigate how this would impact SQL queries in particular. May need to move the logic for collapsing an array closer to the JDBC connector. Some consumers of JSON results would be able to use the data with arrays. |
Beta Was this translation helpful? Give feedback.
-
could we priortize the basic case |
Beta Was this translation helpful? Give feedback.
-
Initially we will throw errors for operators as and functions that do not handle multi valued fields correctly. Should include the field that caused the failure in the error message. |
Beta Was this translation helpful? Give feedback.
-
The initial change has been merged in. Array values by default are preserved. They will fail with most operators and functions. At this point, a user should only specify a field with array values in the projection list. |
Beta Was this translation helpful? Give feedback.
-
Here is a list of operators that need to be updated to handle arrays.
|
Beta Was this translation helpful? Give feedback.
-
@penghuo Here are the next steps that I am thinking of:
An example is the Does this approach make sense? |
Beta Was this translation helpful? Give feedback.
-
Looking into COUNT, it looks like the aggregation can be performed in the OpenSearch engine. This produces the result of counting all elements in all array values.
|
Beta Was this translation helpful? Give feedback.
-
If the sort order (ORDER BY clause) is applied to a multi-valued field, then the OpenSearch engine will only perform the sort based on the first element is each array. |
Beta Was this translation helpful? Give feedback.
-
There are limitations in the OpenSearch server that cause aggregations to behave differently from relational databases. The SQL plugin can work around this by not pushing down aggregations on multi-valued fields. Currently the mapping does not provide a way to distinguish a single valued field from a multi-valued field. Here is an example mapping for an index. Field
The SQL plugin could make use of information in the |
Beta Was this translation helpful? Give feedback.
-
There is a Multivalued PR out now for the OpenSearch repository that allows a user to add a
For evaluating MIN, MAX, and COUNT:
|
Beta Was this translation helpful? Give feedback.
-
I opened this topic to discuss how to implement fix for #1300. Unlike issue, a discussion allows open threads to post replies to every message, which could be more useful there.
I want to share different ways to implement that below and collect your opinions and votes. Before that let me clarify some requirements and describe why legacy engine works incorrect (given that we can't implement this in
V2
the same ways as inV1
).Requirements
K
in a tableN
can't change the type.K
in a tableN
always should be in the same type.N
.Sample data
An index called
doesn't-matter-howdbg
with 3 docs, all docs have only one field.To simplify understanding the response I'll add
_doc
to the response even thoughV1
doesn't provide it.The mapping
The docs
The query
V1 response
It is errorneous even though it represents data as is. It violates the second requirement. The schema in that response is also not applicable for all rows.
V2 response
Or as a table:
This response is valid in terms of SQL, but loses the data.
Beta Was this translation helpful? Give feedback.
All reactions