This project is intended to create a recursive data set, taking in a hive select query with child and parent values, writing a flattened data set with 4 columns: child, parent, level, dp_proc_time, in the format specified.
- Build the package with sbt-clean-assembly
- Execute with spark package with spark-submit
sh //bin/spark-submit --class org.kaveh_hariri.utility.spark.hive_recursion.MainRun --master --conf --conf "SELECT child, parent FROM <hiveschema.hivetable>" "s3a:///" format (orc,parquet,etc)
This is a remake of this udf using spark -- the original udf did not function properly due to the distributed nature of these frameworks. This project works correctly because a distinct map of the child/parent values is distributed to each node using a broadcast variable. https://blog.pythian.com/recursion-in-hive/