[BUG] NULL value filtering not working correctly #23282

em-daniil-terentyev · 2021-08-01T18:09:56Z

Hi everybody!

When I import data from cosmos db into databricks, and create temporary view from this data, filter "someField IS NOT NULL" is not working correctly. The only way to make it work is to add one more condition "LOWER(CAST(someField AS STRING)) <> 'null'", but it's not correct, because for this field there is not data with string value, it contains json object or NULL.

`connectionConfig = {
"spark.cosmos.accountEndpoint" : "endpoint",
"spark.cosmos.accountKey" : "key",
"spark.cosmos.database" : "database",
"spark.cosmos.container" : "containter",
"spark.cosmos.read.inferSchema.enabled" : "true"
}

spark
.read
.format("cosmos.oltp")
.options(**connectionConfig)
.load()
.createOrReplaceTempView("tmp_cosmos_data")`

Apache Spark 3.1.1
Library: com.azure.cosmos.spark:azure-cosmos-spark_3-1_2-12:4.2.0
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.52.0.23-CA-linux64 (build 1.8.0_282-b08)

If you can't reproduce it on your side with simple execution of mentioned actions please let me know. To fill all the requirements mentioned in this bug report and to meet privacy requirements it's needed to create separate instance of Cosmos DB and so on. Please let me know if it's necessary.

Thanks in advance.

Regards,
Daniil.

joshfree · 2021-08-02T18:50:42Z

@kushagraThapar could you please help route this?

kushagraThapar · 2021-08-02T19:02:45Z

@moderakh - can please take a look at this issue ?

moderakh · 2021-08-02T19:08:48Z

@em-daniil-terentyev could you please provide more info on this?

filter "someField IS NOT NULL" is not working correctly. The only way to make it work is to add one more condition "LOWER(CAST(someField AS STRING)) <> 'null'", but it's not correct, because for this field there is not data with string value, it contains json object or NULL.

what's the expected behviour
what behaviour you see?
(you provided code for dataframe load) could you provide the code for the filtering which you are using as well?

em-daniil-terentyev · 2021-08-03T06:08:10Z

Hi, @moderakh,

Thanks for your response.

Here are answers for your questions. I hope it helps.

Expected behavior is to filter all objects, that doesn't have someField, or that assigned as null such as:
{ lala : "123", lolo : "123", someField : null }, { lala : "123", lolo : "123" }

In other words make it work in a correct way as it was in previous version of library com.microsoft.azure.cosmosdb.spark.

I see that if i just use filter "someField IS NOT NULL" query returns objects, that doesn't have someField. If i add "LOWER(CAST(someField AS STRING)) <> 'null'" it returns the same correct result as returned by com.microsoft.azure.cosmosdb.spark with only "someField IS NOT NULL" filter
SELECT COUNT(*) FROM tmp_cosmos_data WHERE someField IS NOT NULL AND LOWER(CAST(someField AS STRING)) <> 'null'

Looking forward to new version of library with correct treating of NULL values.

Thanks in advance.

Regards,
Daniil.

em-daniil-terentyev · 2021-08-22T15:47:06Z

Hi, @moderakh.

Are there any news about this issue?
May be i can help somehow with this?

Regards,
Daniil.

Fixes: #23282 cosmos DB is schema-less, spark is schema-full. when reading data from cosmos DB, spark connector translates both null and undefined values to null spark column value. hence from the spark perspective null and not defined values in cosmos db are the same. expected behaviour: if there is a null spark filter on a column value, that should be translated to either null value or undefined value on the cosmos db query pushdown

em-daniil-terentyev · 2021-09-02T16:05:58Z

@moderakh, thanks a lot! :)

joshfree added Client This issue points to a problem in the data-plane of the library. Cosmos cosmos:spark3 Cosmos DB Spark3 OLTP Connector labels Aug 2, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Aug 2, 2021

chenrujun assigned moderakh Aug 6, 2021

moderakh added a commit to moderakh/azure-sdk-for-java that referenced this issue Aug 27, 2021

Fixes cosmos db spark null filtering: Azure#23282

b82ca12

moderakh mentioned this issue Aug 27, 2021

bugfix: spark null filter pushdown to cosmos DB #23804

Merged

moderakh closed this as completed in #23804 Aug 31, 2021

github-actions bot locked and limited conversation to collaborators Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NULL value filtering not working correctly #23282

[BUG] NULL value filtering not working correctly #23282

em-daniil-terentyev commented Aug 1, 2021 •

edited

Loading

joshfree commented Aug 2, 2021

kushagraThapar commented Aug 2, 2021

moderakh commented Aug 2, 2021 •

edited

Loading

em-daniil-terentyev commented Aug 3, 2021 •

edited

Loading

em-daniil-terentyev commented Aug 22, 2021

em-daniil-terentyev commented Sep 2, 2021 •

edited

Loading

[BUG] NULL value filtering not working correctly #23282

[BUG] NULL value filtering not working correctly #23282

Comments

em-daniil-terentyev commented Aug 1, 2021 • edited Loading

joshfree commented Aug 2, 2021

kushagraThapar commented Aug 2, 2021

moderakh commented Aug 2, 2021 • edited Loading

em-daniil-terentyev commented Aug 3, 2021 • edited Loading

em-daniil-terentyev commented Aug 22, 2021

em-daniil-terentyev commented Sep 2, 2021 • edited Loading

em-daniil-terentyev commented Aug 1, 2021 •

edited

Loading

moderakh commented Aug 2, 2021 •

edited

Loading

em-daniil-terentyev commented Aug 3, 2021 •

edited

Loading

em-daniil-terentyev commented Sep 2, 2021 •

edited

Loading