Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tsql): SPLIT_PART function and conversion to PARSENAME in tsql #4211

Merged
merged 3 commits into from
Oct 8, 2024

Conversation

daihuynh
Copy link
Contributor

@daihuynh daihuynh commented Oct 4, 2024

This PR introduces SPLIT_PART function used in Spark and Databricks as well as the corresponding conversion in TSQL.

In Spark/DataBricks
SELECT SPLIT_PART('1.2.3', '.', 1)

Becomes this in TSQL
SELECT PARSENAME('1.2.3', 3)

Or vice-verse

@VaggelisD
Copy link
Collaborator

Hey @daihuynh, thanks for the contribution!

Have you checked other dialects as well? When introducing a new expression such as exp.SplitPart this will impact all SQLGlot dialects as SPLIT_PART(...) is now parsed "implicitly"; This is a problem if in Spark all arguments are required but in some other dialect e.g. Snowflake they're optional.

@daihuynh
Copy link
Contributor Author

daihuynh commented Oct 4, 2024

Hi @VaggelisD ,

I've just check the Snowflake doc about this function and don't see any of those are optional. I think Snowflake or other Spark-family engines just use that function directly. However, I'm happy to update the PR if any SQL engine has a different implementation.
https://docs.snowflake.com/en/sql-reference/functions/split_part

@VaggelisD
Copy link
Collaborator

@daihuynh From what I remember there are other dialects that have SPIT_PART too like DuckDB. Could you please do a pass over the important dialects (Postgres, Redshift, BigQuery, DuckDB, MySQL, Presto/Trino etc) to make sure that their signature is equivalent? This would be semantics, arg count, whether they're all required like in Spark and so on.

@daihuynh
Copy link
Contributor Author

daihuynh commented Oct 4, 2024

All good @VaggelisD, I put unit tests in those dialects and update the expression's requirement if there's any variant.

Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a better look at the PR and I'm not sure if I agree with it:

  • SPLIT_PART is a function that is common across various engines and is used to choose a part after splitting any string by any delimiter
  • PARSENAME seems T-SQL specific and is used to split an object (e.g. a table name such as db.schema.foo), so the string must be (1) separated by dots and (2) have a max part count of 4.

So, representing PARSENAME as an exp.SplitPart node seems like a hacky solution and could work only as an extreme subset of it. Could we use T-SQL's STRING_SPLIT here?

Please let me know if I'm misunderstanding your approach; Also, feel free to share any context/examples regarding the need for this change.

@daihuynh
Copy link
Contributor Author

daihuynh commented Oct 4, 2024

Hi @VaggelisD,

The context behind this PR is that my team's data modeller uses PARSENAME in many scripts when he works with source databases. His approach is ETL.
The actual cloud data platform uses ELT to load first transform later so I need to transpile those scripts to Databricks dialect to run the transformation. The current result is that PARSENAME function stays the same and doesn't work in Databricks (apparently). Instead of finding and replacing it with SPLIT_PART in a Text Editor, I chose making this PR.

With this PR:
from TSQL to other dialects that support SPLIT_PART: PARSENAME becomes SPLIT_PART with '.' as the delimiter and reversed part number.
vice versa, SPLIT_PART with only '.' delimiter can be converted to PARSENAME, otherwise it stays the same and requires a user-defined function in T-SQL. I agree that this is an extreme use case of SPLIT_PART.

The reason I don't choose STRINGS_SPLIT in T-SQL is that it returns split values in rows and picking a value at index takes extra steps, which will change the structure of the original SQL query. Here is the example.

WITH CTE_SplitValues AS (
  SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS idx, value
  FROM STRING_SPLIT('1.2.3.4', '.')
)
SELECT value
FROM CTE_SplitValues
WHERE idx = 1

Hope this gives you more context.

@daihuynh
Copy link
Contributor Author

daihuynh commented Oct 4, 2024

@VaggelisD Here is the reference table for SPLIT_PART function across supported databases in Sqlglot. The first 3 parameters (string, string, integer) are required in all cases. Apache Drill supports 4 parameters, but the final parameter is optional.

DB Function Reference Link
Athena Same as Trino or Presto
Bigquery Not supported
Clickhouse Not supported/No equivalent function
DataBricks split_part(str, delim, partNum) Link
Doris VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field) Link
Drill SPLIT_PART(string, delimiter, start[, end]) Link
DuckDB split_part(string, separator, index) Link
Hive Not supported
MySQL SUBSTRING_INDEX(str,delim,count) Link
Oracle Not supported
PostgreSQL split_part(string text, delimiter text, field int) -> text Link
Presto split_part(string, delimiter, index) -> varchar() Link
PRQL Not supported
Redshift SPLIT_PART(string, delimiter, position) Link
RisingWave split_part( input_string, delimiter_string, int_n ) → varchar Link
Snowflake SPLIT_PART(string, delimiter, partNumber) Link
Spark SPLIT_PART(src, delimiter, partNumber) Link
Sqlite Not supported
StarRock VARCHAR split_part(VARCHAR content, VARCHAR delimiter, INT field) Link
Tableau SPLIT(string, delimiter, token number) Link
TeraData [TD_SYSFNLIB.] STRTOK ( instring , delimiter , tokennum ) Link
Trino split_part(string, delimiter, index) → varchar Link

@georgesittas georgesittas requested review from VaggelisD and removed request for VaggelisD October 7, 2024 07:04
Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @daihuynh,

Thanks for providing the context & the research. I can now see how your use case could be added as a best-effort transpilation attempt, but will wait for the other maintainers too to decide if this approach is good to go.

In the mean time, I did a pass over the PR and left a few comments.

sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
sqlglot/dialects/tsql.py Outdated Show resolved Hide resolved
tests/dialects/test_tsql.py Outdated Show resolved Hide resolved
@VaggelisD
Copy link
Collaborator

Hey @daihuynh, could you please let us know if you still have the bandwidth to work on this PR? Otherwise, we can take it to the finish line.

@daihuynh
Copy link
Contributor Author

daihuynh commented Oct 8, 2024

Hi @VaggelisD, I just get back from public holiday. I just push enhanced codes with @georgesittas' suggestions and yours. Hopefully, it is better this time

@VaggelisD
Copy link
Collaborator

Ah I see, thank you for the immediate response, will go over the changes shortly.

@VaggelisD
Copy link
Collaborator

@daihuynh Thanks for the PR! I will merge it and simplify it a bit wherever possible.

@VaggelisD VaggelisD merged commit 163e943 into tobymao:main Oct 8, 2024
6 checks passed
@daihuynh
Copy link
Contributor Author

daihuynh commented Oct 8, 2024

Thanks @VaggelisD! I learned something new in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants