Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NULL value filtering not working correctly #23282

Closed
em-daniil-terentyev opened this issue Aug 1, 2021 · 6 comments · Fixed by #23804
Closed

[BUG] NULL value filtering not working correctly #23282

em-daniil-terentyev opened this issue Aug 1, 2021 · 6 comments · Fixed by #23804
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. cosmos:spark3 Cosmos DB Spark3 OLTP Connector Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@em-daniil-terentyev
Copy link

em-daniil-terentyev commented Aug 1, 2021

Hi everybody!

When I import data from cosmos db into databricks, and create temporary view from this data, filter "someField IS NOT NULL" is not working correctly. The only way to make it work is to add one more condition "LOWER(CAST(someField AS STRING)) <> 'null'", but it's not correct, because for this field there is not data with string value, it contains json object or NULL.

`connectionConfig = {
"spark.cosmos.accountEndpoint" : "endpoint",
"spark.cosmos.accountKey" : "key",
"spark.cosmos.database" : "database",
"spark.cosmos.container" : "containter",
"spark.cosmos.read.inferSchema.enabled" : "true"
}

spark
.read
.format("cosmos.oltp")
.options(**connectionConfig)
.load()
.createOrReplaceTempView("tmp_cosmos_data")`

Apache Spark 3.1.1
Library: com.azure.cosmos.spark:azure-cosmos-spark_3-1_2-12:4.2.0
Operating System: Ubuntu 18.04.5 LTS
Java: Zulu 8.52.0.23-CA-linux64 (build 1.8.0_282-b08)

If you can't reproduce it on your side with simple execution of mentioned actions please let me know. To fill all the requirements mentioned in this bug report and to meet privacy requirements it's needed to create separate instance of Cosmos DB and so on. Please let me know if it's necessary.

Thanks in advance.

Regards,
Daniil.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Aug 1, 2021
@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Cosmos cosmos:spark3 Cosmos DB Spark3 OLTP Connector labels Aug 2, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Aug 2, 2021
@joshfree
Copy link
Member

joshfree commented Aug 2, 2021

@kushagraThapar could you please help route this?

@kushagraThapar
Copy link
Member

@moderakh - can please take a look at this issue ?

@moderakh
Copy link
Contributor

moderakh commented Aug 2, 2021

@em-daniil-terentyev could you please provide more info on this?

filter "someField IS NOT NULL" is not working correctly. The only way to make it work is to add one more condition "LOWER(CAST(someField AS STRING)) <> 'null'", but it's not correct, because for this field there is not data with string value, it contains json object or NULL.

  • what's the expected behviour
  • what behaviour you see?
  • (you provided code for dataframe load) could you provide the code for the filtering which you are using as well?

@em-daniil-terentyev
Copy link
Author

em-daniil-terentyev commented Aug 3, 2021

Hi, @moderakh,

Thanks for your response.

Here are answers for your questions. I hope it helps.

  1. Expected behavior is to filter all objects, that doesn't have someField, or that assigned as null such as:
    { lala : "123", lolo : "123", someField : null }, { lala : "123", lolo : "123" }

In other words make it work in a correct way as it was in previous version of library com.microsoft.azure.cosmosdb.spark.

  1. I see that if i just use filter "someField IS NOT NULL" query returns objects, that doesn't have someField. If i add "LOWER(CAST(someField AS STRING)) <> 'null'" it returns the same correct result as returned by com.microsoft.azure.cosmosdb.spark with only "someField IS NOT NULL" filter

  2. SELECT COUNT(*) FROM tmp_cosmos_data WHERE someField IS NOT NULL AND LOWER(CAST(someField AS STRING)) <> 'null'

Looking forward to new version of library with correct treating of NULL values.

Thanks in advance.

Regards,
Daniil.

@em-daniil-terentyev
Copy link
Author

Hi, @moderakh.

Are there any news about this issue?
May be i can help somehow with this?

Regards,
Daniil.

moderakh added a commit to moderakh/azure-sdk-for-java that referenced this issue Aug 27, 2021
moderakh added a commit that referenced this issue Aug 31, 2021
Fixes: #23282

cosmos DB is schema-less, spark is schema-full.

when reading data from cosmos DB, spark connector translates both null and undefined values to null spark column value.
hence from the spark perspective null and not defined values in cosmos db are the same.

expected behaviour:

if there is a null spark filter on a column value, that should be translated to either null value or undefined value on the cosmos db query pushdown
@em-daniil-terentyev
Copy link
Author

em-daniil-terentyev commented Sep 2, 2021

@moderakh, thanks a lot! :)

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. cosmos:spark3 Cosmos DB Spark3 OLTP Connector Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants