Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: set quote_identifiers in qualify, add normalize flag in schema #1701

Merged
merged 12 commits into from
May 30, 2023

Conversation

georgesittas
Copy link
Collaborator

I guess I could've split this PR into two separate ones, as it introduces both a fix and a feature, but anyway.

  1. Activated the quote_identifiers flag in the qualify rule by default, because this case is currently failing:
>>> from sqlglot.optimizer import optimize
>>> optimize("select * from tbl", dialect="snowflake", schema={"tbl": {'"a"': "int"}}).sql()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "sqlglot/sqlglot/optimizer/optimizer.py", line 90, in optimize
    expression = rule(expression, **rule_kwargs)
  File "sqlglot/sqlglot/optimizer/annotate_types.py", line 31, in annotate_types
    return TypeAnnotator(schema, annotators, coerces_to).annotate(expression)
  File "sqlglot/sqlglot/optimizer/annotate_types.py", line 296, in annotate
    col.type = self.schema.get_column_type(source, col)
  File "sqlglot/sqlglot/schema.py", line 289, in get_column_type
    raise SchemaError(f"Unknown column type '{column_type}'")
sqlglot.errors.SchemaError: Unknown column type 'None'
  1. Added a new flag in MappingSchema to control whether or not the normalization logic will kick off. This might be useful when e.g. we want to add (unquoted) names that are already normalized, and a 2nd normalization pass would mess them up.

  2. Added, improved some type hints and comments.

@georgesittas georgesittas requested a review from tobymao May 29, 2023 21:12
@georgesittas
Copy link
Collaborator Author

georgesittas commented May 29, 2023

@tobymao
Copy link
Owner

tobymao commented May 29, 2023

what's the case for feature 2? when did you encounter a need for it?

why did annotate types fail?

@georgesittas
Copy link
Collaborator Author

georgesittas commented May 29, 2023

why did annotate types fail?

>>> from sqlglot.optimizer import optimize
>>> optimize("select * from tbl", dialect="snowflake", schema={"tbl": {'"a"': "int"}}).sql()
> sqlglot/optimizer/annotate_types.py(297)annotate()
-> col.type = self.schema.get_column_type(source, col)
(Pdb) col
(COLUMN this:
  (IDENTIFIER this: a, quoted: False), table:
  (IDENTIFIER this: TBL, quoted: False))

Notice how the column a is unquoted here (due to quoted_identifiers=False in qualify and the fact that annotate_types runs before canonicalize). Passing this in get_column_type will result in a failed lookup, because a is normalized into A and hence will not match what we have in the schema anymore:

-> column_type = table_schema.get(normalized_column_name)
(Pdb) normalized_column_name
'A'
(Pdb) self.mapping
{'TBL': {'a': 'int'}}

when did you encounter a need for it?

Discussed in Slack.

@tobymao
Copy link
Owner

tobymao commented May 29, 2023

this is problematic. sqlmesh doesn’t quote things which means annotate types would fail. could we fix annotate types to work without quotes?

maybe the problem is the double normalize?

@georgesittas
Copy link
Collaborator Author

georgesittas commented May 29, 2023

So the issue is not actually related directly to annotate_types -- it's related to the schema's get_column_type method, because the following raise is reached:

            column_type = table_schema.get(normalized_column_name)  # Fails because we have `a` instead of `A` in the schema

            if isinstance(column_type, exp.DataType):
                return column_type
            elif isinstance(column_type, str):
                return self._to_data_type(column_type.upper(), dialect=dialect)

            raise SchemaError(f"Unknown column type '{column_type}'")

sqlmesh doesn’t quote things which means annotate types would fail. could we fix annotate types to work without quotes?

Why can't we add quotes over there as well?

maybe the problem is the double normalize?

You mean in both normalize_identifiers and the schema? If so then yeah, that's the problem here, but I'm not sure yet if it's correct to remove the normalization logic for get_column_type.

@tobymao
Copy link
Owner

tobymao commented May 29, 2023

now that qualify normalizes everything. schema shouldn’t need to renormalize

@georgesittas
Copy link
Collaborator Author

Ok sounds interesting, I'll explore this idea a bit more tomorrow and see if we can get rid of the schema normalization logic altogether.

@tobymao
Copy link
Owner

tobymao commented May 29, 2023

or else just move identify before annotate types. as a stand-alone step

@georgesittas
Copy link
Collaborator Author

georgesittas commented May 30, 2023

now that qualify normalizes everything. schema shouldn’t need to renormalize

So I'm not sure about this one: a user may supply whatever names they want for the schema, and they may not necessarily match the normalized names qualify produces, so the lookups would fail. We need to keep the schema names and the ones produced by the optimizer in sync.

@tobymao tobymao merged commit 910166c into main May 30, 2023
@tobymao tobymao deleted the jo/optimizer_fixup branch May 30, 2023 19:39
adrianisk pushed a commit to adrianisk/sqlglot that referenced this pull request Jun 21, 2023
…obymao#1701)

* Fix: set quote_identifiers in qualify, add normalize flag in schema

* import typing as t

* Fixup

* PR feedback

* Use new quote_identifiers rule before annotate_types

* Reset quote_identifiers kwarg to False in optimize

* Formatting

* Set kwargs instead of positional arguments in qualify

* Include quote_identifiers rule in test_canonicalize

* Formatting

* PR feedback

* Remove copy arg from quote_identifiers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants