Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MSSQL -> Snowflake] Normalization issue with ' in columns names #4729

Closed
philippeboyd opened this issue Jul 13, 2021 · 8 comments · Fixed by #5027
Closed

[MSSQL -> Snowflake] Normalization issue with ' in columns names #4729

philippeboyd opened this issue Jul 13, 2021 · 8 comments · Fixed by #5027

Comments

@philippeboyd
Copy link
Contributor

Enviroment

  • Airbyte version: 0.27.2-alpha
  • OS Version / Instance: GCP COS
  • Deployment: Docker
  • Source Connector and version: MSSQL 0.3.2
  • Destination Connector and version: Snowflake 0.3.10
  • Severity: High
  • Step where error happened: Reset

Current Behavior

It seems like the issue could be link to wrongfully prepared statements, thus SQL injection. The first syntax error appears to be syntax error line 34 at position 65 unexpected 't' which indicates that the value is probably something like don't. The single quote is not escaped in the database resulting in an SQL error during normalization.

Expected Behavior

The normalization process should be handled with proper prepared statements and not string injection inside the SQL queries.

Logs

LOG
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - �[31mCompleted with 2 errors and 0 warnings:�[0m
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - 
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - �[33mDatabase Error in model <REDACTED_TABLE_NAME_1> (models/generated/airbyte_tables/SAP_RAW_REF/<REDACTED_TABLE_NAME_1>.sql)�[0m
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   001003 (42000): SQL compilation error:
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 34 at position 65 unexpected 't'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 34 at position 67 unexpected 'Match'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 109 at position 19 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 111 at position 15 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 112 at position 19 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 114 at position 15 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 118 at position 19 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 120 at position 15 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 121 at position 19 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 123 at position 15 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 184 at position 41 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 186 at position 41 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 190 at position 41 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 192 at position 41 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 229 at position 14 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 230 at position 14 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 232 at position 14 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 233 at position 14 unexpected '02'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   parse error line 240 at position 8 near '<EOF>'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/SAP_RAW_REF/<REDACTED_TABLE_NAME_1>.sql
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - 
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - �[33mDatabase Error in model <REDACTED_TABLE_NAME_2> (models/generated/airbyte_tables/SAP_RAW_REF/<REDACTED_TABLE_NAME_2>.sql)�[0m
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   001003 (42000): SQL compilation error:
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 7 at position 58 unexpected '17'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 7 at position 69 unexpected 'Apr'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   syntax error line 9 at position 63 unexpected ')'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   parse error line 156 at position 8 near '<EOF>'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   parse error line 156 at position 8 near '<EOF>'.
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 -   compiled SQL at ../build/run/airbyte_utils/models/generated/airbyte_tables/SAP_RAW_REF/<REDACTED_TABLE_NAME_2>.sql
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - 
2021-07-13 13:53:46 INFO () LineGobbler(voidCall):85 - Done. PASS=22 WARN=0 ERROR=2 SKIP=0 TOTAL=24
2021-07-13 13:53:46 INFO () TemporalAttemptExecution(lambda$getWorkerThread$2):149 - Completing future exceptionally...
io.airbyte.workers.WorkerException: Normalization Failed.
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:69) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:36) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.airbyte.workers.WorkerException: Normalization Failed.
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:66) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	... 3 more
	Suppressed: io.airbyte.workers.WorkerException: Normalization process wasn't successful
		at io.airbyte.workers.normalization.DefaultNormalizationRunner.close(DefaultNormalizationRunner.java:151) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:60) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:36) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:832) [?:?]
2021-07-13 13:53:46 INFO () TemporalAttemptExecution(get):133 - Stopping cancellation check scheduling...
2021-07-13 13:53:46 WARN () POJOActivityTaskHandler$POJOActivityImplementation(execute):243 - Activity failure. ActivityId=6c954a2a-f6f3-3c42-ae90-5fba5637ec71, activityType=Normalize, attempt=1
java.util.concurrent.ExecutionException: io.airbyte.workers.WorkerException: Normalization Failed.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2063) ~[?:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.get(TemporalAttemptExecution.java:131) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.temporal.SyncWorkflow$NormalizationActivityImpl.normalize(SyncWorkflow.java:321) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:564) ~[?:?]
	at io.temporal.internal.sync.POJOActivityTaskHandler$POJOActivityInboundCallsInterceptor.execute(POJOActivityTaskHandler.java:277) ~[temporal-sdk-1.0.4.jar:?]
	at io.temporal.internal.sync.POJOActivityTaskHandler$POJOActivityImplementation.execute(POJOActivityTaskHandler.java:216) ~[temporal-sdk-1.0.4.jar:?]
	at io.temporal.internal.sync.POJOActivityTaskHandler.handle(POJOActivityTaskHandler.java:181) ~[temporal-sdk-1.0.4.jar:?]
	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:192) ~[temporal-sdk-1.0.4.jar:?]
	at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:154) ~[temporal-sdk-1.0.4.jar:?]
	at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:73) ~[temporal-sdk-1.0.4.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: io.airbyte.workers.WorkerException: Normalization Failed.
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:69) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:36) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	... 1 more
Caused by: io.airbyte.workers.WorkerException: Normalization Failed.
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:66) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:36) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
	... 1 more
	Suppressed: io.airbyte.workers.WorkerException: Normalization process wasn't successful
		at io.airbyte.workers.normalization.DefaultNormalizationRunner.close(DefaultNormalizationRunner.java:151) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:60) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at io.airbyte.workers.DefaultNormalizationWorker.run(DefaultNormalizationWorker.java:36) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at io.airbyte.workers.temporal.TemporalAttemptExecution.lambda$getWorkerThread$2(TemporalAttemptExecution.java:146) ~[io.airbyte-airbyte-workers-0.27.2-alpha.jar:?]
		at java.lang.Thread.run(Thread.java:832) [?:?]

Steps to Reproduce

  1. Have data containing non escaped single quote in MSSQL
  2. Reset the data with the appropriate tables from MSSQL to Snowflake
@philippeboyd philippeboyd added the type/bug Something isn't working label Jul 13, 2021
@ChristopheDuong
Copy link
Contributor

can you attach a catalog.json or log of your sync?
Basic Normalization actually handles special characters when generating the SQL files and should use quotes etc to handle those cases:

contains_non_alphanumeric = match(".*[^A-Za-z0-9_].*", input_name) is not None

@philippeboyd
Copy link
Contributor Author

@ChristopheDuong not sure what you're asking, the logs are there on the issue
image

@marcosmarxm
Copy link
Member

@philippeboyd you can access the source_catalog.json inside the /tmp/workspace/6/0/ from my example.
image

docker exec -ti airbyte-scheduler bash
cd /tmp...

root@d447b9781f87:/tmp/workspace/6/0# ls
destination_catalog.json  destination_config.json  logs.log  source_catalog.json  source_config.json

Could you share for your case?

@philippeboyd
Copy link
Contributor Author

@marcosmarxm apologies, but source_catalog.json and destination_catalog.json contain client information (table names and columns) that I cannot publicly share on Github... is there something specific that you're looking for inside those catalog json files?

@ChristopheDuong
Copy link
Contributor

@ChristopheDuong not sure what you're asking, the logs are there on the issue

Yes, I saw your logs but it's not enough info to see what was not parsed/interpreted correctly.

is there something specific that you're looking for inside those catalog json files?

it would helpful to have a glance at the names of your streams and columns to see why

was not applied properly... having the full catalog.json would let us run the normalization and see the generated SQL file to debug.

but source_catalog.json and destination_catalog.json contain client information (table names and columns)

Could you recreate a source database with a toy example without client information to illustrate your issue? where you can more easily share logs and catalogs?

Thanks

@philippeboyd
Copy link
Contributor Author

philippeboyd commented Jul 15, 2021

@ChristopheDuong from the doc of your DestinationNameTransformer.py file, it seems it's only used for the following

  • schema
  • table
  • column

My issue seems to be with the data itself (the value) of the column

What if my column's value contains a single quote '. That single quote doesn't get escaped which leads to the error specified in my logs above.

here's the SQL i'm referring too in build/run/airbyte_utils/models/generated/airbyte_tables/<redacted>.sql

image

Same issue with the 17 Apr (specified in the logs), the quote is not escaped

image

This can eventually lead to SQL injection attacks...

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jul 16, 2021

@ChristopheDuong from the doc of your DestinationNameTransformer.py file, it seems it's only used for the following

schema
table
column

My issue seems to be with the data itself (the value) of the column

  1. Yes, the python module "base-normalization"'s responsibility is to take as input a catalog.json and statically transform that into dbt SQL queries. This can be seen metaphorically as 'generating ("java") source files' for example in normalize/models/generated/airbyte_tables/<redacted>.sql
  2. Where dbt is a ("java") compiler that translates afterwards those source files into native low-level ("assembly") SQL for your target destination engine ("JVM") as observable in build/run/airbyte_utils/models/generated/airbyte_tables/<redacted>.sql

So this python script manipulates only database schema definitions, it never sees actual user data (not possible to do SQL injection attacks there)

It's only when the target destination engine executes the SQL-assembly code at runtime that it processes user data.

As a conclusion, the generated files that you posted screenshots of are really fishy because they seem to contain actual data. (thus, I am questioning if the produced catalog.json by your source is really valid?)

Do you really expect to have a table with columns named as follows?

2018 01, 02 V21 Shape Doesn't Match Finish Doesn't Match Apr '17 Apr '18 Apr '19 etc
true false true true false false true
true true false false false true true

or it should maybe rather look something like this instead:

Creation Date Reason Report day
2018 01, 02 V21 Shape Doesn't Match Apr '17
2018 01, 02 V21 Finish Doesn't Match Apr '18

@philippeboyd
Copy link
Contributor Author

@ChristopheDuong thanks for your answer as it brought up new questionings on my end... After investigating directly on our client's DB and after a few facepalms, I realized that it is indeed the columns' name that are problematic
image
image

The issue remains though. I managed to retrieve the destination_catalog.json (as it's a reset operation, I don't have the source catalog)

{
  "streams": [
    {
      "stream": {
        "name": "REDACTED",
        "json_schema": {
          "type": "object",
          "properties": {
            "Apr '17": {
              "type": "number"
            },
            "Apr '18": {
              "type": "number"
            },
            "Aug '17": {
              "type": "number"
            },
            "Dec '17": {
              "type": "number"
            },
            "Feb '17": {
              "type": "number"
            },
            "Feb '18": {
              "type": "number"
            },
            "Jan '17": {
              "type": "number"
            },
            "Jan '18": {
              "type": "number"
            },
            "Jul '17": {
              "type": "number"
            },
            "Jun '17": {
              "type": "number"
            },
            "Mar '17": {
              "type": "number"
            },
            "Mar '18": {
              "type": "number"
            },
            "May '17": {
              "type": "number"
            },
            "Nov '17": {
              "type": "number"
            },
            "Oct '17": {
              "type": "number"
            },
            "Sep '17": {
              "type": "number"
            },
            "Model Number": {
              "type": "string"
            }
          }
        },
        "supported_sync_modes": [
          "full_refresh",
          "incremental"
        ],
        "default_cursor_field": [],
        "source_defined_primary_key": [],
        "namespace": "<REDACTED>"
      },
      "sync_mode": "full_refresh",
      "cursor_field": [],
      "destination_sync_mode": "overwrite",
      "primary_key": []
    },
    ...

As to why it's done like that; no clue! Apologies for my previous comments that brought up confusion.

Hope this helps...

@ChristopheDuong ChristopheDuong added this to the Core - 2021-07-28 milestone Jul 21, 2021
@ChristopheDuong ChristopheDuong changed the title [MSSQL -> Snowflake] Normalization process wasn't successful [MSSQL -> Snowflake] Normalization issue with ' in columns names Jul 21, 2021
@ChristopheDuong ChristopheDuong self-assigned this Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants