Iceberg table (Glue catalog) - unable to fetch, InputFormat cannot be null #359

JanKrl · 2024-03-15T08:51:31Z

Describe the bug

When reading a source table (Iceberg) I get the following error:

Database Error in model stg__countries (models\staging\stg__countries.sql)
Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''/* {"app": "dbt", "dbt_version": "1.7.8", "profile_name": "project", "target_name": "aws", "node_id": "model.profile.stg__countries"} */

create table `dbtstack_warehouse`.`stg__countries`
using iceberg

TBLPROPERTIES (
         'format-version'='2'
)

LOCATION 's3://warehouse-bucket/dbtstack_warehouse\stg__countries'
as

select
        country_name
from
        dbtstack_warehouse.countries
''', use_arrow=False, location='s3://warehouse-bucket/'), AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table countries. StorageDescriptor#InputFormat cannot be null for table: countries (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
  compiled Code at target\run\project\models\staging\stg__countries.sql

After some googling, I found suggestion to add glue_catalog before table name. This results with:

Database Error in model stg__countries (models\staging\stg__countries.sql)
Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''/* {"app": "dbt", "dbt_version": "1.7.8", "profile_name": "project", "target_name": "aws", "node_id": "model.project.stg__countries"} */

create table `dbtstack_warehouse`.`stg__countries`
using iceberg

TBLPROPERTIES (
      'format-version'='2'
)

LOCATION 's3://warehouse-bucket/dbtstack_warehouse\stg__countries'
as

select
       country_name
from
        glue_catalog.dbtstack_warehouse.countries
''', use_arrow=False, location='s3://warehouse-bucket/'), AnalysisException: spark_catalog requires a single-part namespace, but got [glue_catalog, dbtstack_warehouse]
compiled Code at target\run\project\models\staging\stg__countries.sql

Steps To Reproduce

Source table created in Athena

CREATE TABLE countries (id bigint, country_name string, category string)  LOCATION 's3://warehouse-bucket/countries'  TBLPROPERTIES ( 'table_type' = 'ICEBERG' )

Inserted some test data

INSERT INTO countries values
(1, 'NL','asdasd'),
(2, 'PL','hhyth'),
(3, 'Moon','why not?')

Table shows up in Glue Catalog:

Table format: Apache Iceberg
Input format: -
Output format: -
Serde serialization lib: -

As far as I can tell this is expected outcome.

Profiles.yml

project:
  outputs:
    aws:
      type: glue
      glue_version: "4.0"
      query-comment: DBT model
      role_arn: <<arn>>
      region: eu-central-1
      location: s3://warehouse-bucket/
      schema: dbtstack_warehouse
      session_provisioning_timeout_in_seconds: 120
      workers: 2
      worker_type: G.1X
      idle_timeout: 5
      datalake_formats: iceberg  
  target: aws

I also tried all sorts of additional configs based on what I found online:

conf: "--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.warehouse=s3://warehouse-bucket/dbtstack_warehouse/
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.sql.warehouse=s3://warehouse-bucket/dbtstack_warehouse/"

source_tables.yml

version: 2

sources:
  - name: data_source
    schema: dbtstack_warehouse
    tables:
      - name: countries

stg__countries.sql

{{ config(
	materialized='table',
	file_format='iceberg',
	table_properties={'format-version': '2'}
)}}

select
	country_name
from 
	glue_catalog.{{ source('data_source', 'countries') }}

Expected behavior

The source table should have been read and DBT should have created new table from the model.

System information

The output of dbt --version:

Core:
  - installed: 1.7.8
  - latest:    1.7.10 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - glue:  1.7.2 - Up to date!
  - spark: 1.7.1 - Up to date!

The operating system you're using:

Windows 11 Pro

The output of python --version:
Python 3.11.0

Additional context

Some names have been modified manually to remove potentially sensitive names
Intention is to run DBT in container, using Fargate scheduled task

The text was updated successfully, but these errors were encountered:

JanKrl · 2024-03-15T11:08:01Z

Another relevant finding for this issue - when creating new table from seed, a non-iceberg table is creates:
Input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization lib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

JanKrl · 2024-03-19T10:43:29Z

This article about DBT and Glue doesn't mention this specifically but seems like DBT-Glue is not able to read Iceberg tables (InputFormat cannot be null). In their setup they use Hive tables for intermediate stage and Iceberg only for final layer.
Furthermore, it doesn't work on Glue 4.0 but it seems to work in Glue 3.0.

Can anyone confirm my conclusion that Iceberg table can be used only in a final stage of the processing pipeline?

moryachok · 2024-03-28T14:08:31Z

Have the same issue with Iceberg. Maybe also related to the fact I use LakeFormation

eshetben · 2024-03-28T14:34:48Z

hi, i have the same issue (dbt and dbt-glue 1.7, glue 4.0, with lake formation), so i tried replicating the dbt code and running it in a glue notebook, and i did get the exact same error in the notebook as well.

adding glue_catalog. to table name did work for me in the notebook, but i couldn't really apply this solution to dbt, since i don't have control over that piece of code.

instead - i added these configs:

    .config("spark.sql.defaultCatalog", "glue_catalog") \
    .config("spark.sql.catalog.glue_catalog.default-namespace", "via_stage") \

that also worked in the notebook, since the job now used my catalog instead of the default one (named default).

however, i still couldn't get dbt to work, even though i added these two configs in the profiles yaml.
i have a suspicion that dbt is not using these configs properly...

to conclude - i've identified 2 problems:

session is using default catalog instead of glue_catalog
configs might not be used properly by dbt

eshetben · 2024-03-31T12:29:24Z

update - got it working, don't know why it didn't work before...

the solution was adding the default configs -

        --conf spark.sql.defaultCatalog=glue_catalog
        --conf spark.sql.catalog.glue_catalog.default-namespace=<schema>

JanKrl · 2024-04-02T11:15:24Z

update - got it working, don't know why it didn't work before...

the solution was adding the default configs -
        --conf spark.sql.defaultCatalog=glue_catalog
        --conf spark.sql.catalog.glue_catalog.default-namespace=<schema>

When trying this I get the error:
Catalog 'glue_catalog' plugin class not found: spark.sql.catalog.glue_catalog is not defined. I tried with both Glue 3.0 and 4.0

This is my conf now:

--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.defaultCatalog=glue_catalog
--conf spark.sql.catalog.glue_catalog.default-namespace=<schema-name>

eshetben · 2024-04-02T11:34:42Z

@JanKrl this is exactly what i have (only spark.sql.catalog.glue_catalog.warehouse might be missing) and it's working for me.

did you make sure to leave out the first --conf from the string? i made that mistake 😅 so no config was actually used

      conf: >
        spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
        --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
        ...

JanKrl · 2024-04-02T12:38:51Z

That did the trick!
Plus, I had to set --conf spark.sql.catalog.glue_catalog.warehouse=<s3-bucket> due to the error: IllegalArgumentException: Cannot initialize GlueCatalog because warehousePath must not be null

For sake of clarity, here is the full config:

type: glue
glue_version: "3.0"
query-comment: DBT model for Iceberg tables
role_arn: <role-arn>
region: eu-central-1
location: <s3-bucket>
schema: <schema-name>
session_provisioning_timeout_in_seconds: 120
workers: 2
worker_type: G.1X
idle_timeout: 5
datalake_formats: iceberg
conf: >
  spark.sql.defaultCatalog=glue_catalog
  --conf spark.sql.catalog.glue_catalog.warehouse=<s3-bucket>
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
  --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
  --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

JanKrl added the bug Something isn't working label Mar 15, 2024

JanKrl closed this as completed Apr 2, 2024

eshetben mentioned this issue Jun 4, 2024

Missing catalog name in CTAS #356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg table (Glue catalog) - unable to fetch, InputFormat cannot be null #359

Iceberg table (Glue catalog) - unable to fetch, InputFormat cannot be null #359

JanKrl commented Mar 15, 2024

JanKrl commented Mar 15, 2024

JanKrl commented Mar 19, 2024 •

edited

Loading

moryachok commented Mar 28, 2024

eshetben commented Mar 28, 2024 •

edited

Loading

eshetben commented Mar 31, 2024

JanKrl commented Apr 2, 2024

eshetben commented Apr 2, 2024 •

edited

Loading

JanKrl commented Apr 2, 2024

Iceberg table (Glue catalog) - unable to fetch, InputFormat cannot be null #359

Iceberg table (Glue catalog) - unable to fetch, InputFormat cannot be null #359

Comments

JanKrl commented Mar 15, 2024

Describe the bug

Steps To Reproduce

Expected behavior

System information

Additional context

JanKrl commented Mar 15, 2024

JanKrl commented Mar 19, 2024 • edited Loading

moryachok commented Mar 28, 2024

eshetben commented Mar 28, 2024 • edited Loading

eshetben commented Mar 31, 2024

JanKrl commented Apr 2, 2024

eshetben commented Apr 2, 2024 • edited Loading

JanKrl commented Apr 2, 2024

JanKrl commented Mar 19, 2024 •

edited

Loading

eshetben commented Mar 28, 2024 •

edited

Loading

eshetben commented Apr 2, 2024 •

edited

Loading