Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask-sql creates all tables as lower case. #481

Open
brightsparc opened this issue Apr 20, 2022 · 5 comments · May be fixed by #482
Open

[BUG] dask-sql creates all tables as lower case. #481

brightsparc opened this issue Apr 20, 2022 · 5 comments · May be fixed by #482
Labels
bug Something isn't working needs triage Awaiting triage by a dask-sql maintainer

Comments

@brightsparc
Copy link

brightsparc commented Apr 20, 2022

The dask create_table method adds to the list of tables with lower case which means that any queries that are executed must use the lower case name.

What happened:

All tables are registered as lower case, so I am unable to include any queries with upper case table names - which requires me to convert all queries to lower().

What you expected to happen:

I would expect that the case would be preserved when adding to schemas, or at least like in the case of #177 there would be an original and lower case verison.

Minimal Complete Verifiable Example:

See the following code snippet that creates a dask dataframe, and then creates two tables, one with lower_case name, and another with UPPER_CASE.

from dask_sql import Context
c = Context()

ddf = dd.from_pandas(pd.DataFrame({"key": ["value"]}), npartitions=1)

# create table with lower case
c.create_table("lower_case", ddf)
print(c.schema["root"].tables) 
c.sql("SELECT * FROM lower_case")

# Create table with upper case
c.create_table("UPPER_CASE", ddf)
print(c.schema["root"].tables) 
c.sql("SELECT * FROM UPPER_CASE")

The second call to print the tables lists both as lower case:

{'lower_case': <dask_sql.datacontainer.DataContainer object at 0x2d8432970>, 'upper_case': <dask_sql.datacontainer.DataContainer object at 0x2d49d9670>}

And the second select fails to find the table with upper case

ParsingException: Can not parse the given SQL: From line 1, column 15 to line 1, column 24: Object 'UPPER_CASE' not found within 'root'; did you mean 'upper_case'?

Anything else we need to know?:

Perhaps there is a reason why everything was made lower() - but I can't seem to find it in the docs.

Environment:

  • dask-sql version: 2022.4.1
  • Python version: 3.9
  • Operating System: mac-osx arm64
  • Install method (conda, pip, source): conda
@brightsparc brightsparc added bug Something isn't working needs triage Awaiting triage by a dask-sql maintainer labels Apr 20, 2022
@charlesbluca
Copy link
Collaborator

Perhaps there is a reason why everything was made lower()

Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table and NEW_TABLE, as both of these map to the all lowercase new_table within the schema.

Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive - it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:

import dask.config

with dask.config.set({"sql.identifier.case_sensitive": False}):
    c.sql("SELECT * FROM UPPER_CASE")

But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)

@brightsparc
Copy link
Author

brightsparc commented Apr 20, 2022

Perhaps there is a reason why everything was made lower()

Though it wasn't my decision, I would imagine this is to be in line with Postgres (the SQL engine which we primarily test against)? Spinning up a DB here, I see that we aren't able to create two distinct tables named new_table and NEW_TABLE, as both of these map to the all lowercase new_table within the schema.

Perhaps the better change here is to reconsider the default value of True we've set for sql.identifier.case_sensitive - it seemed like table names weren't case sensitive on the DB I spun up, though by default they are for Dask-SQL. For your specific case, the following works:

import dask.config

with dask.config.set({"sql.identifier.case_sensitive": False}):
    c.sql("SELECT * FROM UPPER_CASE")

But maybe we should consider making that the default behavior (cc @ayushdg in case you have thoughts around this)

Thanks making dask case insensitive solves the issue from a query stand point.

FWIW, I think the calcite default dialect is Casing.TO_UPPER for most products, with Postgresql being the exception with Casing.TO_LOWER for unquoted casing. However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.

@charlesbluca
Copy link
Collaborator

charlesbluca commented Apr 20, 2022

However any quoted identifier should be preserved, so in that case it would be useful to allow registering a table name as it was given.

I generally agree with this sentiment if this is the expected behavior for Postgres, though I think more consideration needs to be made on if / how we should shift our case handling to this style, as this is a large change from our current handling (which essentially ignores quotes when parsing table names and uses sql.identifier.case_sensitive to decide whether casing should be respected) - I will get back to this issue and #482 after some internal sync around what the best path forward here is.

Also I will note that we are currently exploring an overhaul of our current SQL parsing machinery from Calcite to DataFusion (check out https://github.com/dask-contrib/dask-sql/tree/datafusion-sql-planner and some of the recently opened datafusion issues for some additional context). Thus, while we can discuss here what an ideal casing behavior might be for dask-sql, I wouldn't anticipate any solution being long term until we have fully fleshed out and merged the changes developing in that branch.

@brightsparc
Copy link
Author

Thanks for this insight @charlesbluca. I did investigate using sql.identifier.case_sensitive and had some success in validating that this allowed me to work around this issue. However in the last day, I upgraded my dependencies and found that this was now not being respect. I wonder if you know of any issues that might interface with dask-sql with respect to case sensitivity. (I upgrade llvm and numba but not sure these are dependencies that would affect this project).

Thanks for sharing the work on DaskFusion, this does look interesting, however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution. Not having a Java dependency and using the jpype1 interface.

@ayushdg
Copy link
Collaborator

ayushdg commented May 2, 2022

However in the last day, I upgraded my dependencies and found that this was now not being respect.

To followup does this mean that in a newer environment the initial example posted still has issues even when case_sensitive is set to False?

however I will be sticking to using Calcite parser in my stack so hopefully this will remain consistent in turns of the support for parser syntax if you switch over to this solution.

Are you able to elaborate a bit further on this? Is the expectation here for dask-sql to be able to execute sql queries that the calcite parser understands, or is it to be able to directly accept a calcite logical plan object?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting triage by a dask-sql maintainer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants