Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significance of tableName for output in metric.yaml file #482

Open
kiranbobba opened this issue Jun 7, 2022 · 2 comments
Open

Significance of tableName for output in metric.yaml file #482

kiranbobba opened this issue Jun 7, 2022 · 2 comments

Comments

@kiranbobba
Copy link

I am trying to understand the code we currently have. It is similar to the below one.

job.yml
metrics:
  - metric.yaml
inputs:
  df_input:
    file: 
      path: s3a://bucket1/database1/table1/*.csv
      format: csv
      options:
        header: true
        delimiter: ","
output:
  file:
    dir: s3a://bucket1/

metric.yaml
steps:
  - dataFrameName: df1
    sql:
      SELECT * FROM df_input

output:
  - dataFrameName: df1
    outputType: File
    format: parquet
    outputOptions:
      saveMode: Overwrite
      path: final/hive/database1/table1
      protectFromEmptyOutput: false
      tableName: database1.table1
      partitionBy:
        - as_of_date

What is the significance of tableName under output in metric.yaml file? I saw the comment for this property as "# save output to hive metastore (or any other catalog provider)" from https://github.com/YotpoLtd/metorikku/blob/master/config/metric_config_sample.yaml. What does that mean? Does it mean that it will issue "MSCK REPAIR" or "ALTER TABLE ADD PARTITION" or something similar to update Hive metastore? What are prerequisites for this property to work. It worked for us in our old cluster but not on the new one.

Another question indirectly linked to the above one. If I have 2 metric files in my job.yaml file. If I want to access the data written to a file (on which Hive external table is defined) from first metric file in the second one is it possible with the assumption that tableName property of the output is not working in the first metric file? Is there any example that does this?

@kiranbobba
Copy link
Author

For the above one, it is throwing the following error. It works fine when I remove the tableName property. What could be the reason?

Caused by: com.yotpo.metorikku.exceptions.MetorikkuWriteFailedException: Failed to write dataFrame: df1 to output: File on metric: metric
.......
.......
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'database1' not found;

@mike-vee
Copy link

It means that you are creating a Hive external table. The error tells you that the database doesn't exists in the metastore, so you should create it beforehand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants