Skip to content

kaveh-hariri/spark-hive-recursion

Repository files navigation

spark-hive-recursion

This project is intended to create a recursive data set, taking in a hive select query with child and parent values, writing a flattened data set with 4 columns: child, parent, level, dp_proc_time, in the format specified.

  1. Build the package with sbt-clean-assembly
  2. Execute with spark package with spark-submit

sh //bin/spark-submit --class org.kaveh_hariri.utility.spark.hive_recursion.MainRun --master --conf --conf "SELECT child, parent FROM <hiveschema.hivetable>" "s3a:///" format (orc,parquet,etc)

This is a remake of this udf using spark -- the original udf did not function properly due to the distributed nature of these frameworks. This project works correctly because a distinct map of the child/parent values is distributed to each node using a broadcast variable. https://blog.pythian.com/recursion-in-hive/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages