GitHub

Project Information

The project takes in a customer & sales data and transforms them into customer-sales data for further processing. It currently performs groupBy operations in the following criteria:

Group by State
Group by State and Year
Group by State, Year and Month
Group by State, Year, Month and Day
Group by State, Year, Month, Day and Hour.

To view the results for a particular state please use the arguement -state <List Of States by ','>.

Installation

Clone the repository and import the project into your IDE and update the Maven project to get all the jar files required for the project. Importing into a IDE like eclipse automatically does this once imported.

The project default configuration is on the Properites.java file with default location. Please modify this to your setting or use the command line parameters listed below.

Build a jar file out of the project and you can submit the jar to your spark cluster.

To Build project:

cd ~/SparkCustomerSales

mvn clean install

Following updating the maven resources, you can go ahead and import.

Input Arguements

-c is for Customer file location, can be hdfs or local file system

-s is for Sales file location, can be hdfs or local file system

-o is for output file location, can be hdfs or local file system

-state is a state filter, provide a list of states delimited by ','

The hdfs or local file system URL's can be changed below appropriately, depending on the requirement.

Running the project in Yarn

spark-submit --master yarn

--class com.project.spark.driver.SparkDriver /home/hadoop/spark.jar \

-c hdfs://localhost:8020/user/hadoop/sparkjob/test-customer.txt \

-s hdfs://localhost:8020/user/hadoop/sparkjob/test-sales.txt \

-o hdfs://localhost:8020/user/hadoop/sparkjob/output \

-state AL,AK

Running the project in standalone or local

spark-submit --master local

--class com.project.spark.driver.SparkDriver /home/hadoop/spark.jar \

-c hdfs://localhost:8020/user/hadoop/sparkjob/test-customer.txt \

-s hdfs://localhost:8020/user/hadoop/sparkjob/test-sales.txt \

-o hdfs://localhost:8020/user/hadoop/sparkjob/output \

-state AL,AK

Running the project with large dataset

spark-submit --master local

--executor-memory 4G \

--driver-memory 4G \

--class com.project.spark.driver.SparkDriver /home/hadoop/spark.jar \

-c hdfs://localhost:8020/user/hadoop/sparkjob/test-customer.txt \

-s hdfs://localhost:8020/user/hadoop/sparkjob/test-sales.txt \

-o hdfs://localhost:8020/user/hadoop/sparkjob/output \

-state AL,AK

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
sample-dataset		sample-dataset
src/main		src/main
target		target
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Information

Installation

Input Arguements

Running the project in Yarn

Running the project in standalone or local

Running the project with large dataset

About

Releases

Packages

Languages

Rakesh627/SparkCustomerSales

Folders and files

Latest commit

History

Repository files navigation

Project Information

Installation

Input Arguements

Running the project in Yarn

Running the project in standalone or local

Running the project with large dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages