Fix: set quote_identifiers in qualify, add normalize flag in schema #1701

georgesittas · 2023-05-29T21:12:31Z

I guess I could've split this PR into two separate ones, as it introduces both a fix and a feature, but anyway.

Activated the quote_identifiers flag in the qualify rule by default, because this case is currently failing:

>>> from sqlglot.optimizer import optimize
>>> optimize("select * from tbl", dialect="snowflake", schema={"tbl": {'"a"': "int"}}).sql()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sqlglot/sqlglot/optimizer/optimizer.py", line 90, in optimize
    expression = rule(expression, **rule_kwargs)
  File "sqlglot/sqlglot/optimizer/annotate_types.py", line 31, in annotate_types
    return TypeAnnotator(schema, annotators, coerces_to).annotate(expression)
  File "sqlglot/sqlglot/optimizer/annotate_types.py", line 296, in annotate
    col.type = self.schema.get_column_type(source, col)
  File "sqlglot/sqlglot/schema.py", line 289, in get_column_type
    raise SchemaError(f"Unknown column type '{column_type}'")
sqlglot.errors.SchemaError: Unknown column type 'None'

Added a new flag in MappingSchema to control whether or not the normalization logic will kick off. This might be useful when e.g. we want to add (unquoted) names that are already normalized, and a 2nd normalization pass would mess them up.
Added, improved some type hints and comments.

sqlglot/dataframe/sql/dataframe.py

georgesittas · 2023-05-29T21:21:05Z

did you add a test for this?

Yeah, I leveraged the one you wrote: https://github.com/tobymao/sqlglot/pull/1701/files#diff-739b55fbe772c7434981d6a9f1f6d0dfa499468cb5251b26c588589fb66108cdR713-R716

tobymao · 2023-05-29T21:21:22Z

what's the case for feature 2? when did you encounter a need for it?

why did annotate types fail?

georgesittas · 2023-05-29T21:37:09Z

why did annotate types fail?

>>> from sqlglot.optimizer import optimize
>>> optimize("select * from tbl", dialect="snowflake", schema={"tbl": {'"a"': "int"}}).sql()
> sqlglot/optimizer/annotate_types.py(297)annotate()
-> col.type = self.schema.get_column_type(source, col)
(Pdb) col
(COLUMN this:
  (IDENTIFIER this: a, quoted: False), table:
  (IDENTIFIER this: TBL, quoted: False))

Notice how the column a is unquoted here (due to quoted_identifiers=False in qualify and the fact that annotate_types runs before canonicalize). Passing this in get_column_type will result in a failed lookup, because a is normalized into A and hence will not match what we have in the schema anymore:

-> column_type = table_schema.get(normalized_column_name)
(Pdb) normalized_column_name
'A'
(Pdb) self.mapping
{'TBL': {'a': 'int'}}

when did you encounter a need for it?

Discussed in Slack.

tobymao · 2023-05-29T21:54:18Z

this is problematic. sqlmesh doesn’t quote things which means annotate types would fail. could we fix annotate types to work without quotes?

maybe the problem is the double normalize?

georgesittas · 2023-05-29T22:12:20Z

So the issue is not actually related directly to annotate_types -- it's related to the schema's get_column_type method, because the following raise is reached:

            column_type = table_schema.get(normalized_column_name)  # Fails because we have `a` instead of `A` in the schema

            if isinstance(column_type, exp.DataType):
                return column_type
            elif isinstance(column_type, str):
                return self._to_data_type(column_type.upper(), dialect=dialect)

            raise SchemaError(f"Unknown column type '{column_type}'")

sqlmesh doesn’t quote things which means annotate types would fail. could we fix annotate types to work without quotes?

Why can't we add quotes over there as well?

maybe the problem is the double normalize?

You mean in both normalize_identifiers and the schema? If so then yeah, that's the problem here, but I'm not sure yet if it's correct to remove the normalization logic for get_column_type.

tobymao · 2023-05-29T22:14:22Z

now that qualify normalizes everything. schema shouldn’t need to renormalize

georgesittas · 2023-05-29T22:17:55Z

Ok sounds interesting, I'll explore this idea a bit more tomorrow and see if we can get rid of the schema normalization logic altogether.

tobymao · 2023-05-29T22:23:42Z

or else just move identify before annotate types. as a stand-alone step

georgesittas · 2023-05-30T11:13:53Z

now that qualify normalizes everything. schema shouldn’t need to renormalize

So I'm not sure about this one: a user may supply whatever names they want for the schema, and they may not necessarily match the normalized names qualify produces, so the lookups would fail. We need to keep the schema names and the ones produced by the optimizer in sync.

sqlglot/optimizer/qualify.py

tests/fixtures/optimizer/canonicalize.sql

…obymao#1701) * Fix: set quote_identifiers in qualify, add normalize flag in schema * import typing as t * Fixup * PR feedback * Use new quote_identifiers rule before annotate_types * Reset quote_identifiers kwarg to False in optimize * Formatting * Set kwargs instead of positional arguments in qualify * Include quote_identifiers rule in test_canonicalize * Formatting * PR feedback * Remove copy arg from quote_identifiers

Fix: set quote_identifiers in qualify, add normalize flag in schema

2b886d6

georgesittas requested a review from tobymao May 29, 2023 21:12

import typing as t

a0bce4f

georgesittas force-pushed the jo/optimizer_fixup branch from 3bd0d6b to a0bce4f Compare May 29, 2023 21:15

tobymao reviewed May 29, 2023

View reviewed changes

sqlglot/dataframe/sql/dataframe.py Outdated Show resolved Hide resolved

Fixup

a57fa4c

PR feedback

0545ee1

georgesittas added 6 commits May 30, 2023 20:43

Use new quote_identifiers rule before annotate_types

670e3a7

Reset quote_identifiers kwarg to False in optimize

55025ab

Formatting

3e6a9ee

Set kwargs instead of positional arguments in qualify

06f064a

Include quote_identifiers rule in test_canonicalize

1dbd984

Formatting

6dd3559

tobymao reviewed May 30, 2023

View reviewed changes

sqlglot/optimizer/qualify.py Outdated Show resolved Hide resolved

tobymao reviewed May 30, 2023

View reviewed changes

tests/fixtures/optimizer/canonicalize.sql Show resolved Hide resolved

georgesittas added 2 commits May 30, 2023 21:11

PR feedback

d20e241

Remove copy arg from quote_identifiers

e0a3e3c

tobymao merged commit 910166c into main May 30, 2023

tobymao deleted the jo/optimizer_fixup branch May 30, 2023 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: set quote_identifiers in qualify, add normalize flag in schema #1701

Fix: set quote_identifiers in qualify, add normalize flag in schema #1701

georgesittas commented May 29, 2023

georgesittas commented May 29, 2023 •

edited

Loading

tobymao commented May 29, 2023

georgesittas commented May 29, 2023 •

edited

Loading

tobymao commented May 29, 2023 •

edited

Loading

georgesittas commented May 29, 2023 •

edited

Loading

tobymao commented May 29, 2023

georgesittas commented May 29, 2023

tobymao commented May 29, 2023 •

edited

Loading

georgesittas commented May 30, 2023 •

edited

Loading

Fix: set quote_identifiers in qualify, add normalize flag in schema #1701

Fix: set quote_identifiers in qualify, add normalize flag in schema #1701

Conversation

georgesittas commented May 29, 2023

georgesittas commented May 29, 2023 • edited Loading

tobymao commented May 29, 2023

georgesittas commented May 29, 2023 • edited Loading

tobymao commented May 29, 2023 • edited Loading

georgesittas commented May 29, 2023 • edited Loading

tobymao commented May 29, 2023

georgesittas commented May 29, 2023

tobymao commented May 29, 2023 • edited Loading

georgesittas commented May 30, 2023 • edited Loading

georgesittas commented May 29, 2023 •

edited

Loading

georgesittas commented May 29, 2023 •

edited

Loading

tobymao commented May 29, 2023 •

edited

Loading

georgesittas commented May 29, 2023 •

edited

Loading

tobymao commented May 29, 2023 •

edited

Loading

georgesittas commented May 30, 2023 •

edited

Loading