Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presto bugs #963

Closed
evyasonov opened this issue Jan 17, 2023 · 45 comments
Closed

Presto bugs #963

evyasonov opened this issue Jan 17, 2023 · 45 comments

Comments

@evyasonov
Copy link

evyasonov commented Jan 17, 2023

  1. ok
    Input: select approx_percentile(a, b, c, d) Output: The number of provided arguments (4) is greater than the maximum number of supported arguments (3). Line 1, Col: 36. select approx_percentile(a, b, c, d)
    4 arguments are possible: https://prestodb.io/docs/current/functions/aggregate.html

  2. ok
    Input: select concat_ws(a, b) Output: select array_join(b, a)
    convcat_ws has two signatures, replacing it with array_join is not correct for the case when all arguments are strings, such as select concat_ws('a', 'b', 'c'): https://trino.io/docs/current/functions/string.html

  3. ok
    Input: select from_unixtime(a, b, c) Output: The number of provided arguments (3) is greater than the maximum number of supported arguments (2). Line 1, Col: 29. select from_unixtime(a, b, c)
    3 arguments are possible: https://prestodb.io/docs/current/functions/datetime.html

  4. ok
    Input: select from_utf8(a, b) Output: select from_utf8(a)
    Two arguments are possible, the second argument can not be ignored: https://prestodb.io/docs/current/functions/string.html

  5. ok
    Input: select greatest(a) Output: Required keyword: 'expressions' missing for <class 'sqlglot.expressions.Greatest'>. Line 1, Col: 18. select greatest(a)
    Input is valid both select greatest(1) and select greatest(a) FROM (values 1) as t(a) works

  6. ok
    Input: select least(a) Output: Required keyword: 'expressions' missing for <class 'sqlglot.expressions.Least'>. Line 1, Col: 15. select least(a)
    Input is valid both select least(1) and select least(a) FROM (values 1) as t(a) works

  7. ok
    Input: select map() Output: Required keyword: 'keys' missing for <class 'sqlglot.expressions.Map'>. Line 1, Col: 12. select map()
    Input is valid, it returns an empty map

  8. ok
    Input: select max(a, b) Output: The number of provided arguments (2) is greater than the maximum number of supported arguments (1). Line 1, Col: 16. select max(a, b)
    Two arguments are possible: https://prestodb.io/docs/current/functions/aggregate.html

  9. ok
    Input: select min(a, b) Output: The number of provided arguments (2) is greater than the maximum number of supported arguments (1). Line 1, Col: 16. select min(a, b)
    Two arguments are possible: https://prestodb.io/docs/current/functions/aggregate.html

  10. ok
    Input: select strpos(a, b) Output: select strpos(b, a)
    Order of arguments should not be changed

  11. ok
    Input: select trim(a, b) Output: Expecting ). Line 1, Col: 14. select trim(a, b)
    Unexpected error. Two arguments are not documented, but select trim('abaassqweabab', 'ab') works. And in `SHOW FUNCTIONS' it's told that

[
    {
        "Function":"trim",
        "Return Type":"char(x)",
        "Argument Types":"char(x)",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"removes whitespace from the beginning and end of a string"
    },
    {
        "Function":"trim",
        "Return Type":"char(x)",
        "Argument Types":"char(x), codepoints",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"remove the longest string containing only given characters from the beginning and end of a string"
    },
    {
        "Function":"trim",
        "Return Type":"varchar(x)",
        "Argument Types":"varchar(x)",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"removes whitespace from the beginning and end of a string"
    },
    {
        "Function":"trim",
        "Return Type":"varchar(x)",
        "Argument Types":"varchar(x), codepoints",
        "Function Type":"scalar",
        "Deterministic":"true",
        "Description":"remove the longest string containing only given characters from the beginning and end of a string"
    }
]
  1. ok
    Input: select truncate(a, b) Output: select truncate as (a, b)
    Wrong output

  2. ok
    Input: select truncate(a) Output: select truncate as (a)
    Wrong output

  3. ok
    Input: select var_pop(a) Output: select variance_pop(a)
    There's no such function as variance_pop: https://prestodb.io/docs/current/functions/aggregate.html

  4. ok
    Input: select strpos(a, b, c) Output: select strpos(substr(b, c), a) + c - 1
    Wrong output. select strpos('a a a a', 'a', 3) returns 5 while select strpos(substr('a', 3), 'a a a a') + 3 - 1 return 2.
    The strpos should not be changed at all.

  5. might be fixed by creating custom dialect
    Input: select a | b Output: select bitwise_or(a, b)
    Should rise an error, because there's no | in presto.

  6. ok
    Input print(sqlglot.transpile("select a from (select '1\n2' as a)", read="presto", write="presto", pretty=True)[0]), Output:

SELECT
  a
FROM (
  SELECT
    '1
    2' AS a
)

Should not indent multiline strings

  1. ok
    Input CREATE TABLE asd AS SELECT asd FROM asd WITH NO DATA Output: error
    Should work correctly

  2. ?
    Input

/*1*/ SELECT /*2*/ 1 /*3*/ AS /*4*/ a /*5*/ FROM /*6*/ asd /*7*/ INNER /*8*/ JOIN /*9*/ sdf /*10*/
ON /*11*/ a /*12*/ = /*13*/ b /*14*/ WHERE /*15*/ a /*16*/ > /*17*/ 0 /*18*/ GROUP  BY /*20*/ c /*21*/ HAVING
/*22*/ bool_or(/*23*/ a /*24*/ IS /*25*/ NOT /*26*/ DISTINCT FROM /*28*/ NULL /*29*/) ORDER  BY /*31*/
a /*32*/ DESC /*33*/ 

Output

/* 1 */
/* 2 */
SELECT
    1 /* 3 */ AS a /* 5 *//* 6 */

FROM asd /* 7 */
INNER JOIN sdf /* 10 */
    ON /* 13 */
    a /* 12 */ = b /* 14 *//* 15 */

WHERE
    /* 17 */
    a /* 16 */ > 0 /* 18 */
GROUP BY
    c /* 21 */
HAVING
    LOGICAL_OR(a /* 24 */ IS NOT DISTINCT FROM NULL /* 29 */)
ORDER BY
    a /* 32 */ DESC /* 33 */

Should not delete comments 4, 8, 9 11, 22, 23, 25, 26, 28, 31

  1. declined
    Input SELECT bool_or(a> 10) FROM asd AS T(a) GROUP /* asd */ BY GROUPING SETS(()) Output error
    Input is valid, should work in case comments are presented beween special keywords

  2. ok
    Input SELECT bool_or(a> 10) FROM (values 1, 2, 15) AS T(a) Output error
    Input is valid

  3. might be fixed by creating custom dialect
    Input print(sqlglot.transpile("SELECT 1a", read="spark", write="presto", pretty=False)[0]) works correctly, while print(sqlglot.transpile("SELECT 1a", read="presto", write="presto", pretty=False)[0]) returns wrong answer.

  4. declined
    Input SELECT row() Output error SELECT row()
    Presto returns Function row not registered, so the parser also should throw an error

  5. ?
    Input print(sqlglot.transpile("select 1a as 1a", read="spark", write="presto", pretty=False)[0]) Output select 1a as 1a
    It transpiles valid spark sql into non-valid presto sql

Hint
You can probably use SHOW FUNCTIONS to get all available functions and check other dialects. You can use the script below

import pandas

# athena_v2_functions.csv and athena_v3_functions.csv were got by executing SHOW FUNCTIONS
v2 = pandas.read_csv('athena_v2_functions.csv').iloc[:, 1:].astype(str).applymap(lambda x: x.lower())
v3 = pandas.read_csv('athena_v3_functions.csv').iloc[:, 1:].astype(str).applymap(lambda x: x.lower())
df = v2.merge(v3, how = 'outer')

import re

def getNumberOfArguments(argTypes):
    while argTypes != re.sub('\([^\(\)]*\)', '', argTypes):
        argTypes = re.sub('\([^\(\)]*\)', '', argTypes)
    return len([x for x in argTypes.split(',') if len(x.strip()) > 0 and x.strip() != 'nan'])

df['Arguments Count'] = df['Argument Types'].map(getNumberOfArguments)

df.query("Function == 'concat'")
df.query("Function == 'rand'")

df = df[['Function', 'Arguments Count']].drop_duplicates()

df['sql'] = df.apply(
    lambda r: f"{r['Function']}({', '.join(chr(v) for v in range(ord('a'), ord('a') + r['Arguments Count']))})",
    axis = 1
)


import sqlglot

for sql in df['sql']:
    sql = f'SELECT {sql}'.lower()
    
    try:
        output = sqlglot.transpile(sql, read="presto", write="presto", pretty=False)[0].lower()
        if sql == output:
            continue
    except Exception as e:
        output = str(e).replace('\n', '')
    print(f'Input: `{sql}` Output: `{output}`\n')
@georgesittas
Copy link
Collaborator

18, 21 are fixed.

20, 22 won't be fixed.

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

#1 e861aa9

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

#2 469bb11

@evyasonov
Copy link
Author

#2 469bb11

presto also supports concat_ws, not only trino. It's not in the documentations, but it's in the code and it works: https://github.com/prestodb/presto/search?q=concat_ws

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

@evyasonov thanks will update

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

#2 6cf1949

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

#3 c60cd3d

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

#4 4f5c8a9

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

5, 6, 7 cddac1b

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

8, 9 650e123

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

10 a1fb957

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

11 57eb2ce

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

12, 13 4ba4694

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

14 588064e

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

15 2c8a413

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

16 won't fix, sqlglot is not a validator

@evyasonov
Copy link
Author

evyasonov commented Jan 17, 2023

16 won't fix, sqlglot is not a validator

Why not leave it as it is? I mean transform select a | b into select a | b

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

because | is not valid in presto, so | makes it into the user intent

@evyasonov
Copy link
Author

evyasonov commented Jan 17, 2023

because | is not valid in presto, so | makes it into the user intent

Not agree. A user might easily misspell | with || which is valid in presto, so his intention is concat in this case. The best option is to leave it as it is. While bitwise_or it too unexpected,
Like I don't expect that sqlglot will try to fix my code....

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

23 is won't fix,

only remaining issue is 17 which will be fixed later

@tobymao tobymao closed this as completed Jan 17, 2023
@evyasonov
Copy link
Author

@tobymao
What about 19?

Also, can you please explain why not to fix 22? I mean it works when it reads from spark, but does not work for presto. What is the problem?

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

@georgesittas is looking at 19, but sqlglot is best effort with comments, we'll try to improve it but it won't get everything like group /* x */ by will never be supported

with 22, although presto doesn't support select 1a, sqlglot does and is more leniant

@evyasonov
Copy link
Author

@tobymao
okay... but why not leaving '1a' as '1a' as you do in spark? I mean what is the logic behind different behaviour of spark vs presto?

@tobymao
Copy link
Owner

tobymao commented Jan 17, 2023

in spark, 1a is valid column syntax

@evyasonov
Copy link
Author

@tobymao
okay... then there's another bug

24
print(sqlglot.transpile("select 1a as 1a", read="spark", write="presto", pretty=False)[0])
it transpiles valid spark sql into non-valid presto sql :)

@evyasonov
Copy link
Author

evyasonov commented Jan 18, 2023

@tobymao
Also, presto supports identifiers starts with digits in case it's quoted into "".
Other words, SELECT "1a" FROM (VALUES 1) AS T("1a") is a valid presto sql.
So if sqlglot is more leniant, isn't it logical to transpile 'SELECT 1a' into 'SELECT "1a"' and not into 'SELECT 1 AS a'?
What do you think?

@tobymao
Copy link
Owner

tobymao commented Jan 18, 2023

depends on the engine, many engines do select 1a == select 1 as a

@evyasonov
Copy link
Author

In case of presto transpiling 1a into"1a" seems closer to user's intention. If this is a typo, then presto will return an error because column "1a" does not exist. If a user moves from hive/spark to presto, then it will be the same behaviour. The worst scenario is to transpile to 1 as a, because if this is a typo, presto will not return an error, but execute a query which might cost money (in case of athena or s3 and it might be a lot of money). Also, it will be much harder for a user to find this kind of bug from query data.
What do you think?

@evyasonov
Copy link
Author

evyasonov commented Jan 18, 2023

Also, in case if it's something like SELECT 1e1a1a... Currently it's transpiled into SELECT 1e1 AS a1a which seems weird. Why not 1 AS e1a1a? The option of "1e1a1a" looks much better in this case.

Does "other" engines do do select 1e1a1a == select 1 as e1a1a or == select 1e1 as a1a?

@tobymao
Copy link
Owner

tobymao commented Jan 18, 2023

transpiling SQL is extremely difficult. the current behavior matches postgres which is a pretty good standard. i'm not inclined to change this behavior right now, sorry.

in order for this project to be maintainable, we need to pick and choose our battles. in this case, i think the current behavior is fine.

@evyasonov
Copy link
Author

ok,,, will create a new dialect called "my logical presto" this way:

class MyLogicalPresto(Presto):
    class Tokenizer(tokens.Tokenizer):
        IDENTIFIER_CAN_START_WITH_DIGIT = True

which will resolve risky behaviour

@evyasonov
Copy link
Author

also, can you tell what should I add into "my logical presto" so it won't change | into bitwise_or?

@tobymao
Copy link
Owner

tobymao commented Jan 18, 2023

ok,,, will create a new dialect called "my logical presto" this way:

class MyLogicalPresto(Presto):
    class Tokenizer(tokens.Tokenizer):
        IDENTIFIER_CAN_START_WITH_DIGIT = True

which will resolve risky behaviour

that's a great solution

@tobymao
Copy link
Owner

tobymao commented Jan 18, 2023

also, can you tell what should I add into "my logical presto" so it won't change | into bitwise_or?

one way is to error by doing

Parser:
BITWISE.pop(TokenType.PIPE)

@evyasonov
Copy link
Author

evyasonov commented Jan 18, 2023

like this

class MyLogicalPresto(Presto):
    class Tokenizer(tokens.Tokenizer):
        IDENTIFIER_CAN_START_WITH_DIGIT = True
    class Parser(parser.Parser):
        BITWISE = parser.Parser.BITWISE.copy()
        BITWISE.pop(TokenType.PIPE)

?

@tobymao
Copy link
Owner

tobymao commented Jan 18, 2023

yea, then it should raise when you do x | y

@evyasonov
Copy link
Author

ok, thx

@evyasonov
Copy link
Author

P.S. and what about bug 24?)

@evyasonov
Copy link
Author

@georgesittas any update regarding 19?
@tobymao any update regarding 24?

@tobymao
Copy link
Owner

tobymao commented Jan 19, 2023

nope

@georgesittas
Copy link
Collaborator

I've looked at 19 and I don't think it's worth fixing. Most of these comments are placed in bizarre spots that you wouldn't normally expect them to be, and others (such as the one following the AS token) could be fixed, but the code base would be cluttered unnecessarily to preserve them.

The way we've approached comments, there will always be cases that won't be handled correctly. It's a hard problem to solve completely, and as Toby mentioned in a previous message, SQLGlot follows a best-effort approach; it prioritizes preserving comments that are more frequent, e.g. in projection lists, table references in a FROM clause, etc.

@evyasonov
Copy link
Author

@georgesittas
And is it possible to add a parameter which will allow to merge comments? Let's say if a user set this parameter to True, then first it reads the query, then it extracts all comments and moves it into one big comment in the beginning on the query. If it could be applied just to comments in bizarre spots, it would be perfect. But if not, then one big merge is also good.

My motivation here is that most of users don't expect that anything might be deleted from their queries without letting know. Especially if it's done in semi-automatic mode or automatic mode, so users don't review the output.

It will solve both 19 and 20. What do you think?

@tobymao
And what is the status of 24 in this case? Is "fixing cases when a user transpiles spark identifiers which starts with a digit to presto or other dialect which doesn't support identifiers which starts with a digit" in the roadmap?

@georgesittas
Copy link
Collaborator

It will solve both 19 and 20. What do you think?

That's not really any better. Especially for big queries, the comments will be moved out of context and they will essentially be useless. The point is to try and maintain the context in which they appear.

@evyasonov
Copy link
Author

@georgesittas
Not agree. Because in half of cases moving comments out of context might make them useless, but in other half the comments still might be useful.

From my point of view, this is the priorities:

  • The Best - keep all coments in its context
  • Much Better - keep comments in its context as much as possible, but the rest move to other place and add context to it:
Before
line 45: GROUP /* the last grouping set is needed to maintain older versions */
line 46: BY

After:
line 1: /* GROUP / * the last grouping set is needed to maintain older versions * / BY */
line 46: GROUP BY
  • Better - keep comments in its context as much as possible, but the rest move to other place without adding any context to it
  • OK - move all comments into other place and add context to it
  • Worse - move all comments into other place without adding any context to it
  • Much Worse - keep comments in its context as much as possible, but delete the rest and let user know that X comments were deleted
  • The Worst - keep comments in its context as much as possible, but delete the rest and make a 🎉surpise 🎉for a user

What do you think?

@georgesittas
Copy link
Collaborator

georgesittas commented Jan 20, 2023

This is not a priority right now, maybe we'll revisit this at some point in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants