Skip to content

Latest commit

 

History

History
9 lines (6 loc) · 957 Bytes

README.md

File metadata and controls

9 lines (6 loc) · 957 Bytes

spark-hive-recursion

This project is intended to create a recursive data set, taking in a hive select query with child and parent values, writing a flattened data set with 4 columns: child, parent, level, dp_proc_time, in the format specified.

  1. Build the package with sbt-clean-assembly
  2. Execute with spark package with spark-submit

sh //bin/spark-submit --class org.kaveh_hariri.utility.spark.hive_recursion.MainRun --master --conf --conf "SELECT child, parent FROM <hiveschema.hivetable>" "s3a:///" format (orc,parquet,etc)

This is a remake of this udf using spark -- the original udf did not function properly due to the distributed nature of these frameworks. This project works correctly because a distinct map of the child/parent values is distributed to each node using a broadcast variable. https://blog.pythian.com/recursion-in-hive/