Skip to content

[Question] Symlink Facet missing when creating Hive table for first time #3558

@steven13cooper

Description

@steven13cooper

Hi,

I have been doing some exploring into OpenLineage for the Spark lineage, looking into the codebase and testing out behaviours and was wondering if the following scenario is something that can be implemented and missing, a bug, or a technical limitation of what is possible with the logical plan and Spark.

When creating a table for the first time using SaveAsTable or whenever using the overwrite mode, the symLink facet will be missing with the link to the destination hive table in the execute_insert_into_hadoop_fs_relation_command. Is this because that specific step at that time doesn't have the catalog details and can't access the previous step in the plan that contains the table creation command? Not sure if as the columnLineage process runs it is able to pick up the catalog details for the new table at the same time.

df = spark.createDataFrame([
    (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")],
    schema=["age", "name"]
)

# doesn't contain symlink facet
df.write.mode("append").saveAsTable("a_database.b_table")

# will now contain symlink facet
df.write.mode("append").saveAsTable("a_database.b_table")

# will again not contain symlink facet
df.write.mode("overwrite").saveAsTable("a_database.b_table")

InsertInto works fine but guessing because the target table needs to already exists

If it is possible or an actual bug I'll update this to a feature or bug respectively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:needs-triageRequires triage to determine kind, area and priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions