Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot query views with FILTER and SELECT operators, throws invalid_argument exception #354

Closed
sandeep-katta0102 opened this issue Mar 30, 2021 · 7 comments
Assignees

Comments

@sandeep-katta0102
Copy link

When I run viewdf.filter.select it throws INVALID_ARGUMENT: request failed: Row filter for table xxxx:tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 is invalid. Filter is '(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)'

Code


  val viewName = "XXXXX.tempdataset.covidconfirmview"
  val projectId = "XXXX"
  val viewDf = spark.read.format("bigquery").option("project", projectId).option("parentProject", projectId)
    .option("viewsEnabled", "true").option("table",viewName).load()

  viewDf.filter($"date" === "2021-03-29 17:00:00").show() // this works

  viewDf.filter($"date" === "2021-03-29 17:00:00").select("country").show() // throws Exception

I have created the view using the public dataset bigquery-public-data.covid19_italy.data_by_province

Saved the output of below results as view in BigQuery

SELECT
   *
FROM
  `bigquery-public-data.covid19_italy.data_by_province` covid19
WHERE
  EXTRACT(date from DATE) = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) 
ORDER BY
  confirmed_cases desc

Logs

21/03/30 10:10:32 INFO DirectBigQueryRelation: Querying table XXXXX.tempdataset.covidconfirmview, parameters sent from Spark: requiredColumns=[country], filters=[isnotnull(date#515),(date#515 = 1617037200000000)]
21/03/30 10:10:34 INFO DirectBigQueryRelation: Going to read from XXXXX.tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 columns=[country], filter='(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)'
21/03/30 10:10:34 INFO DirectBigQueryRelation: unhandledFilters:
21/03/30 10:10:34 INFO DirectBigQueryRelation: Querying table XXXXX.tempdataset.covidconfirmview, parameters sent from Spark: requiredColumns=[country], filters=[isnotnull(date#515),(date#515 = 1617037200000000)]
21/03/30 10:10:34 INFO DirectBigQueryRelation: Going to read from XXXXX.tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 columns=[country], filter='(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)'
21/03/30 10:10:34 WARN SQLExecution: Error executing delta metering
com.google.api.gax.rpc.InvalidArgumentException: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table XXXXX:tempdataset._sbc_79a9cc7eb80c419f94763550c7ead249 is invalid. Filter is '(date = TIMESTAMP '2021-03-29 17:00:00.0' AND date IS NOT NULL)'

@sandeep-katta0102
Copy link
Author

@davidrabinowitz could you please look into this, it is becoming seemingly difficult to query views using BigQuery Connector

@himanshukohli09
Copy link
Contributor

Hi Sandeep,

Please try the following query:

viewDf.filter($"date" === "2021-03-29 17:00:00").select("country", "date").show()

@sandeep-katta0102
Copy link
Author

@himanshukohli09

Thanks for the workaround, is it a bug or limitation ?

@himanshukohli09
Copy link
Contributor

himanshukohli09 commented Apr 21, 2021

This is currently the limitation. In case of views the fields in the filter should also be part of select.

@sandeep-katta0102
Copy link
Author

Could we document this limitation, I don't find any referrence which says as a limitation

https://github.com/GoogleCloudDataproc/spark-bigquery-connector#reading-data-from-a-bigquery-query

@himanshukohli09
Copy link
Contributor

himanshukohli09 commented Apr 26, 2021

Hi Sandeep,

Sure, I will add this limitation in the document(readme). Also, the reason for this limitation is as follows:

As mentioned here, the views have to be materialized before reading them. The materialized view is created with all the columns but if there is a select operator present then the materialized view is created with these columns only. Now, in this particular case the materialized view is created with a single column "country". As a result, when after materialization the filter operator is applied on column "date" it throws error saying "INVALID_ARGUMENT" as "date" column in not present in the materialized view.

Hence, in case of views if we want to use both select and filter then the columns on which filter is applied those columns should also be present in select.

himanshukohli09 added a commit to himanshukohli09/spark-bigquery-connector that referenced this issue May 6, 2021
…ew with different columns in select() and flter().
himanshukohli09 added a commit to himanshukohli09/spark-bigquery-connector that referenced this issue May 8, 2021
…ew with different columns in select() and flter().
@himanshukohli09
Copy link
Contributor

Hi Sandeep,

I have corrected this limitation and the PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants