Presto bugs #963

evyasonov · 2023-01-17T21:17:02Z

ok
Input: select approx_percentile(a, b, c, d) Output: The number of provided arguments (4) is greater than the maximum number of supported arguments (3). Line 1, Col: 36. select approx_percentile(a, b, c, d)
4 arguments are possible: https://prestodb.io/docs/current/functions/aggregate.html
ok
Input: select concat_ws(a, b) Output: select array_join(b, a)
convcat_ws has two signatures, replacing it with array_join is not correct for the case when all arguments are strings, such as select concat_ws('a', 'b', 'c'): https://trino.io/docs/current/functions/string.html
ok
Input: select from_unixtime(a, b, c) Output: The number of provided arguments (3) is greater than the maximum number of supported arguments (2). Line 1, Col: 29. select from_unixtime(a, b, c)
3 arguments are possible: https://prestodb.io/docs/current/functions/datetime.html
ok
Input: select from_utf8(a, b) Output: select from_utf8(a)
Two arguments are possible, the second argument can not be ignored: https://prestodb.io/docs/current/functions/string.html
ok
Input: select greatest(a) Output: Required keyword: 'expressions' missing for <class 'sqlglot.expressions.Greatest'>. Line 1, Col: 18. select greatest(a)
Input is valid both select greatest(1) and select greatest(a) FROM (values 1) as t(a) works
ok
Input: select least(a) Output: Required keyword: 'expressions' missing for <class 'sqlglot.expressions.Least'>. Line 1, Col: 15. select least(a)
Input is valid both select least(1) and select least(a) FROM (values 1) as t(a) works
ok
Input: select map() Output: Required keyword: 'keys' missing for <class 'sqlglot.expressions.Map'>. Line 1, Col: 12. select map()
Input is valid, it returns an empty map
ok
Input: select max(a, b) Output: The number of provided arguments (2) is greater than the maximum number of supported arguments (1). Line 1, Col: 16. select max(a, b)
Two arguments are possible: https://prestodb.io/docs/current/functions/aggregate.html
ok
Input: select min(a, b) Output: The number of provided arguments (2) is greater than the maximum number of supported arguments (1). Line 1, Col: 16. select min(a, b)
Two arguments are possible: https://prestodb.io/docs/current/functions/aggregate.html
ok
Input: select strpos(a, b) Output: select strpos(b, a)
Order of arguments should not be changed
ok
Input: select trim(a, b) Output: Expecting ). Line 1, Col: 14. select trim(a, b)
Unexpected error. Two arguments are not documented, but select trim('abaassqweabab', 'ab') works. And in `SHOW FUNCTIONS' it's told that

[
    {
        "Function":"trim",
        "Return Type":"char(x)",
        "Argument Types":"char(x)",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"removes whitespace from the beginning and end of a string"
    },
    {
        "Function":"trim",
        "Return Type":"char(x)",
        "Argument Types":"char(x), codepoints",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"remove the longest string containing only given characters from the beginning and end of a string"
    },
    {
        "Function":"trim",
        "Return Type":"varchar(x)",
        "Argument Types":"varchar(x)",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"removes whitespace from the beginning and end of a string"
    },
    {
        "Function":"trim",
        "Return Type":"varchar(x)",
        "Argument Types":"varchar(x), codepoints",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"remove the longest string containing only given characters from the beginning and end of a string"
    }
]

ok
Input: select truncate(a, b) Output: select truncate as (a, b)
Wrong output
ok
Input: select truncate(a) Output: select truncate as (a)
Wrong output
ok
Input: select var_pop(a) Output: select variance_pop(a)
There's no such function as variance_pop: https://prestodb.io/docs/current/functions/aggregate.html
ok
Input: select strpos(a, b, c) Output: select strpos(substr(b, c), a) + c - 1
Wrong output. select strpos('a a a a', 'a', 3) returns 5 while select strpos(substr('a', 3), 'a a a a') + 3 - 1 return 2.
The strpos should not be changed at all.
might be fixed by creating custom dialect
Input: select a | b Output: select bitwise_or(a, b)
Should rise an error, because there's no | in presto.
ok
Input print(sqlglot.transpile("select a from (select '1\n2' as a)", read="presto", write="presto", pretty=True)[0]), Output:

SELECT
  a
FROM (
  SELECT
    '1
    2' AS a
)

Should not indent multiline strings

ok
Input CREATE TABLE asd AS SELECT asd FROM asd WITH NO DATA Output: error
Should work correctly
?
Input

/*1*/ SELECT /*2*/ 1 /*3*/ AS /*4*/ a /*5*/ FROM /*6*/ asd /*7*/ INNER /*8*/ JOIN /*9*/ sdf /*10*/
ON /*11*/ a /*12*/ = /*13*/ b /*14*/ WHERE /*15*/ a /*16*/ > /*17*/ 0 /*18*/ GROUP  BY /*20*/ c /*21*/ HAVING
/*22*/ bool_or(/*23*/ a /*24*/ IS /*25*/ NOT /*26*/ DISTINCT FROM /*28*/ NULL /*29*/) ORDER  BY /*31*/
a /*32*/ DESC /*33*/

Output

/* 1 */
/* 2 */
SELECT
    1 /* 3 */ AS a /* 5 *//* 6 */

FROM asd /* 7 */
INNER JOIN sdf /* 10 */
    ON /* 13 */
    a /* 12 */ = b /* 14 *//* 15 */

WHERE
    /* 17 */
    a /* 16 */ > 0 /* 18 */
GROUP BY
    c /* 21 */
HAVING
    LOGICAL_OR(a /* 24 */ IS NOT DISTINCT FROM NULL /* 29 */)
ORDER BY
    a /* 32 */ DESC /* 33 */

Should not delete comments 4, 8, 9 11, 22, 23, 25, 26, 28, 31

declined
Input SELECT bool_or(a> 10) FROM asd AS T(a) GROUP /* asd */ BY GROUPING SETS(()) Output error
Input is valid, should work in case comments are presented beween special keywords
ok
Input SELECT bool_or(a> 10) FROM (values 1, 2, 15) AS T(a) Output error
Input is valid
might be fixed by creating custom dialect
Input print(sqlglot.transpile("SELECT 1a", read="spark", write="presto", pretty=False)[0]) works correctly, while print(sqlglot.transpile("SELECT 1a", read="presto", write="presto", pretty=False)[0]) returns wrong answer.
declined
Input SELECT row() Output error SELECT row()
Presto returns Function row not registered, so the parser also should throw an error
?
Input print(sqlglot.transpile("select 1a as 1a", read="spark", write="presto", pretty=False)[0]) Output select 1a as 1a
It transpiles valid spark sql into non-valid presto sql

Hint
You can probably use SHOW FUNCTIONS to get all available functions and check other dialects. You can use the script below

import pandas

# athena_v2_functions.csv and athena_v3_functions.csv were got by executing SHOW FUNCTIONS
v2 = pandas.read_csv('athena_v2_functions.csv').iloc[:, 1:].astype(str).applymap(lambda x: x.lower())
v3 = pandas.read_csv('athena_v3_functions.csv').iloc[:, 1:].astype(str).applymap(lambda x: x.lower())
df = v2.merge(v3, how = 'outer')

import re

def getNumberOfArguments(argTypes):
    while argTypes != re.sub('\([^\(\)]*\)', '', argTypes):
        argTypes = re.sub('\([^\(\)]*\)', '', argTypes)
    return len([x for x in argTypes.split(',') if len(x.strip()) > 0 and x.strip() != 'nan'])

df['Arguments Count'] = df['Argument Types'].map(getNumberOfArguments)

df.query("Function == 'concat'")
df.query("Function == 'rand'")

df = df[['Function', 'Arguments Count']].drop_duplicates()

df['sql'] = df.apply(
    lambda r: f"{r['Function']}({', '.join(chr(v) for v in range(ord('a'), ord('a') + r['Arguments Count']))})",
    axis = 1
)


import sqlglot

for sql in df['sql']:
    sql = f'SELECT {sql}'.lower()
    
    try:
        output = sqlglot.transpile(sql, read="presto", write="presto", pretty=False)[0].lower()
        if sql == output:
            continue
    except Exception as e:
        output = str(e).replace('\n', '')
    print(f'Input: `{sql}` Output: `{output}`\n')

The text was updated successfully, but these errors were encountered:

georgesittas · 2023-01-17T21:43:46Z

18, 21 are fixed.

20, 22 won't be fixed.

tobymao · 2023-01-17T21:48:10Z

#1 e861aa9

tobymao · 2023-01-17T21:59:05Z

#2 469bb11

evyasonov · 2023-01-17T22:08:46Z

#2 469bb11

presto also supports concat_ws, not only trino. It's not in the documentations, but it's in the code and it works: https://github.com/prestodb/presto/search?q=concat_ws

tobymao · 2023-01-17T22:09:27Z

@evyasonov thanks will update

tobymao · 2023-01-17T22:11:47Z

#2 6cf1949

tobymao · 2023-01-17T22:18:25Z

#3 c60cd3d

tobymao · 2023-01-17T22:22:34Z

#4 4f5c8a9

tobymao · 2023-01-17T22:25:10Z

5, 6, 7 cddac1b

tobymao · 2023-01-17T22:27:02Z

8, 9 650e123

tobymao · 2023-01-17T22:51:55Z

10 a1fb957

tobymao · 2023-01-17T23:02:11Z

11 57eb2ce

tobymao · 2023-01-17T23:06:17Z

12, 13 4ba4694

tobymao · 2023-01-17T23:08:05Z

14 588064e

tobymao · 2023-01-17T23:12:31Z

15 2c8a413

tobymao · 2023-01-17T23:23:28Z

16 won't fix, sqlglot is not a validator

evyasonov · 2023-01-17T23:26:06Z

16 won't fix, sqlglot is not a validator

Why not leave it as it is? I mean transform select a | b into select a | b

tobymao · 2023-01-17T23:29:05Z

because | is not valid in presto, so | makes it into the user intent

evyasonov · 2023-01-17T23:31:03Z

because | is not valid in presto, so | makes it into the user intent

Not agree. A user might easily misspell | with || which is valid in presto, so his intention is concat in this case. The best option is to leave it as it is. While bitwise_or it too unexpected,
Like I don't expect that sqlglot will try to fix my code....

tobymao · 2023-01-17T23:39:17Z

23 is won't fix,

only remaining issue is 17 which will be fixed later

evyasonov · 2023-01-17T23:40:58Z

@tobymao
What about 19?

Also, can you please explain why not to fix 22? I mean it works when it reads from spark, but does not work for presto. What is the problem?

tobymao · 2023-01-17T23:46:31Z

@georgesittas is looking at 19, but sqlglot is best effort with comments, we'll try to improve it but it won't get everything like group /* x */ by will never be supported

with 22, although presto doesn't support select 1a, sqlglot does and is more leniant

evyasonov · 2023-01-17T23:53:44Z

@tobymao
okay... but why not leaving '1a' as '1a' as you do in spark? I mean what is the logic behind different behaviour of spark vs presto?

tobymao · 2023-01-17T23:55:17Z

in spark, 1a is valid column syntax

evyasonov · 2023-01-18T00:04:56Z

@tobymao
okay... then there's another bug

24
print(sqlglot.transpile("select 1a as 1a", read="spark", write="presto", pretty=False)[0])
it transpiles valid spark sql into non-valid presto sql :)

evyasonov · 2023-01-18T00:09:51Z

@tobymao
Also, presto supports identifiers starts with digits in case it's quoted into "".
Other words, SELECT "1a" FROM (VALUES 1) AS T("1a") is a valid presto sql.
So if sqlglot is more leniant, isn't it logical to transpile 'SELECT 1a' into 'SELECT "1a"' and not into 'SELECT 1 AS a'?
What do you think?

tobymao · 2023-01-18T00:10:37Z

depends on the engine, many engines do select 1a == select 1 as a

evyasonov · 2023-01-18T00:17:19Z

In case of presto transpiling 1a into"1a" seems closer to user's intention. If this is a typo, then presto will return an error because column "1a" does not exist. If a user moves from hive/spark to presto, then it will be the same behaviour. The worst scenario is to transpile to 1 as a, because if this is a typo, presto will not return an error, but execute a query which might cost money (in case of athena or s3 and it might be a lot of money). Also, it will be much harder for a user to find this kind of bug from query data.
What do you think?

evyasonov · 2023-01-18T00:22:11Z

Also, in case if it's something like SELECT 1e1a1a... Currently it's transpiled into SELECT 1e1 AS a1a which seems weird. Why not 1 AS e1a1a? The option of "1e1a1a" looks much better in this case.

Does "other" engines do do select 1e1a1a == select 1 as e1a1a or == select 1e1 as a1a?

tobymao · 2023-01-18T00:23:57Z

transpiling SQL is extremely difficult. the current behavior matches postgres which is a pretty good standard. i'm not inclined to change this behavior right now, sorry.

in order for this project to be maintainable, we need to pick and choose our battles. in this case, i think the current behavior is fine.

evyasonov · 2023-01-18T00:28:25Z

ok,,, will create a new dialect called "my logical presto" this way:

class MyLogicalPresto(Presto):
    class Tokenizer(tokens.Tokenizer):
        IDENTIFIER_CAN_START_WITH_DIGIT = True

which will resolve risky behaviour

evyasonov · 2023-01-18T00:30:30Z

also, can you tell what should I add into "my logical presto" so it won't change | into bitwise_or?

tobymao · 2023-01-18T00:32:09Z

ok,,, will create a new dialect called "my logical presto" this way:
class MyLogicalPresto(Presto):
    class Tokenizer(tokens.Tokenizer):
        IDENTIFIER_CAN_START_WITH_DIGIT = True
which will resolve risky behaviour

that's a great solution

tobymao · 2023-01-18T00:34:03Z

also, can you tell what should I add into "my logical presto" so it won't change | into bitwise_or?

one way is to error by doing

Parser:
BITWISE.pop(TokenType.PIPE)

evyasonov · 2023-01-18T00:38:11Z

like this

class MyLogicalPresto(Presto):
    class Tokenizer(tokens.Tokenizer):
        IDENTIFIER_CAN_START_WITH_DIGIT = True
    class Parser(parser.Parser):
        BITWISE = parser.Parser.BITWISE.copy()
        BITWISE.pop(TokenType.PIPE)

?

tobymao · 2023-01-18T00:43:58Z

yea, then it should raise when you do x | y

evyasonov · 2023-01-18T00:44:18Z

ok, thx

evyasonov · 2023-01-18T00:45:31Z

P.S. and what about bug 24?)

evyasonov · 2023-01-19T20:29:45Z

@georgesittas any update regarding 19?
@tobymao any update regarding 24?

tobymao · 2023-01-19T20:37:39Z

nope

georgesittas · 2023-01-19T20:52:19Z

I've looked at 19 and I don't think it's worth fixing. Most of these comments are placed in bizarre spots that you wouldn't normally expect them to be, and others (such as the one following the AS token) could be fixed, but the code base would be cluttered unnecessarily to preserve them.

The way we've approached comments, there will always be cases that won't be handled correctly. It's a hard problem to solve completely, and as Toby mentioned in a previous message, SQLGlot follows a best-effort approach; it prioritizes preserving comments that are more frequent, e.g. in projection lists, table references in a FROM clause, etc.

evyasonov · 2023-01-19T21:44:58Z

@georgesittas
And is it possible to add a parameter which will allow to merge comments? Let's say if a user set this parameter to True, then first it reads the query, then it extracts all comments and moves it into one big comment in the beginning on the query. If it could be applied just to comments in bizarre spots, it would be perfect. But if not, then one big merge is also good.

My motivation here is that most of users don't expect that anything might be deleted from their queries without letting know. Especially if it's done in semi-automatic mode or automatic mode, so users don't review the output.

It will solve both 19 and 20. What do you think?

@tobymao
And what is the status of 24 in this case? Is "fixing cases when a user transpiles spark identifiers which starts with a digit to presto or other dialect which doesn't support identifiers which starts with a digit" in the roadmap?

georgesittas · 2023-01-19T22:50:48Z

It will solve both 19 and 20. What do you think?

That's not really any better. Especially for big queries, the comments will be moved out of context and they will essentially be useless. The point is to try and maintain the context in which they appear.

evyasonov · 2023-01-19T23:16:19Z

@georgesittas
Not agree. Because in half of cases moving comments out of context might make them useless, but in other half the comments still might be useful.

From my point of view, this is the priorities:

The Best - keep all coments in its context
Much Better - keep comments in its context as much as possible, but the rest move to other place and add context to it:

Before
line 45: GROUP /* the last grouping set is needed to maintain older versions */
line 46: BY

After:
line 1: /* GROUP / * the last grouping set is needed to maintain older versions * / BY */
line 46: GROUP BY

Better - keep comments in its context as much as possible, but the rest move to other place without adding any context to it
OK - move all comments into other place and add context to it
Worse - move all comments into other place without adding any context to it
Much Worse - keep comments in its context as much as possible, but delete the rest and let user know that X comments were deleted
The Worst - keep comments in its context as much as possible, but delete the rest and make a 🎉surpise 🎉for a user

What do you think?

georgesittas · 2023-01-20T16:09:59Z

This is not a priority right now, maybe we'll revisit this at some point in the future.

georgesittas mentioned this issue Jan 17, 2023

Fix presto values parsing #964

Merged

tobymao closed this as completed Jan 17, 2023

Presto bugs #963

Presto bugs #963

Comments

evyasonov commented Jan 17, 2023 • edited Loading

georgesittas commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

evyasonov commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023 • edited Loading

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

tobymao commented Jan 17, 2023

evyasonov commented Jan 17, 2023 • edited Loading

tobymao commented Jan 17, 2023

evyasonov commented Jan 17, 2023 • edited Loading

tobymao commented Jan 17, 2023

evyasonov commented Jan 17, 2023

tobymao commented Jan 17, 2023

evyasonov commented Jan 17, 2023

tobymao commented Jan 17, 2023

evyasonov commented Jan 18, 2023

evyasonov commented Jan 18, 2023 • edited Loading

tobymao commented Jan 18, 2023

evyasonov commented Jan 18, 2023

evyasonov commented Jan 18, 2023 • edited Loading

tobymao commented Jan 18, 2023

evyasonov commented Jan 18, 2023

evyasonov commented Jan 18, 2023

tobymao commented Jan 18, 2023

tobymao commented Jan 18, 2023

evyasonov commented Jan 18, 2023 • edited Loading

tobymao commented Jan 18, 2023

evyasonov commented Jan 18, 2023

evyasonov commented Jan 18, 2023

evyasonov commented Jan 19, 2023

tobymao commented Jan 19, 2023

georgesittas commented Jan 19, 2023

evyasonov commented Jan 19, 2023

georgesittas commented Jan 19, 2023

evyasonov commented Jan 19, 2023

georgesittas commented Jan 20, 2023 • edited Loading

evyasonov commented Jan 17, 2023 •

edited

Loading

tobymao commented Jan 17, 2023 •

edited

Loading

evyasonov commented Jan 17, 2023 •

edited

Loading

evyasonov commented Jan 17, 2023 •

edited

Loading

evyasonov commented Jan 18, 2023 •

edited

Loading

evyasonov commented Jan 18, 2023 •

edited

Loading

evyasonov commented Jan 18, 2023 •

edited

Loading

georgesittas commented Jan 20, 2023 •

edited

Loading