-
Notifications
You must be signed in to change notification settings - Fork 423
Description
Hi,
I have been doing some exploring into OpenLineage for the Spark lineage, looking into the codebase and testing out behaviours and was wondering if the following scenario is something that can be implemented and missing, a bug, or a technical limitation of what is possible with the logical plan and Spark.
When creating a table for the first time using SaveAsTable or whenever using the overwrite mode, the symLink facet will be missing with the link to the destination hive table in the execute_insert_into_hadoop_fs_relation_command. Is this because that specific step at that time doesn't have the catalog details and can't access the previous step in the plan that contains the table creation command? Not sure if as the columnLineage process runs it is able to pick up the catalog details for the new table at the same time.
df = spark.createDataFrame([
(100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")],
schema=["age", "name"]
)
# doesn't contain symlink facet
df.write.mode("append").saveAsTable("a_database.b_table")
# will now contain symlink facet
df.write.mode("append").saveAsTable("a_database.b_table")
# will again not contain symlink facet
df.write.mode("overwrite").saveAsTable("a_database.b_table")InsertInto works fine but guessing because the target table needs to already exists
If it is possible or an actual bug I'll update this to a feature or bug respectively.