bigdata projects

Big data projects that include Java Spark submit application, Spark Streaming and Spark ML.

All projects are running in a docker container and Java is used for their implementation.

The data that is being processed is uploaded to a HDFS that has one namenode and one datanode. Hadoop is also running inside a docker container.

Data is processed with the Apache spark platform running in docker container. There is one master node and two worker nodes.

Pre-reqs

To build and run this app locally you will need a few things:

Install Docker

Restore dataset

Clone project

git clone git@github.com:JKostov/bigdata.git

Create folder for the dataset

mkdir big-data

Change directory

cd big-data

Download the dataset

wget 'https://public.boxcloud.com/d/1/b1!hhQ1MT3fnKstfvZBWVvvkHFhTpV7mkLH2QahJsn_yai2r57fxzZ2l09QoqPwJFlrS87VeaL8tm828iB-eHdgUzK68BZ9D24iDMWzPMKvy1CSld6cRdViKriLwbujJQl0-yrgde6HVpjkpxc4-4xiOWF-XEVNyEac8nUpMhEp1surgk64smNRDMJ2_4rQV57QMIVbAVDJcuarYSZHt9HJ3ZKnxYcB2iOcf6a1yng_bN2XZt0XyaboAbSWzWifPgRTEpJ06J3HFPAOxKX8QJqmkaj-gMzLEEZsvIR52WDPoN7c9Nvy62S-_8eJqMN2fuIikxL-WBLIgV5V7a7iZIsoS57v8HhM2PgwAD_-Evs3TYGNlHrElzFg7Hz-dMdO9xnZfMA8fNlK3wD0IcK4ixGdvZVjSoOxZKccfHeOeTmBJjxPR-kLWJqbgKemHKIyzxxvk5NWysh4OkIH3SoLtp8bpdcyWidF1yibvOfkDL6OBrR-0Yrz7GYgs1oGtDBXu000s6r3y8TaJ6EaoNDTgH6EieKewzYwisRCP4RMfkLbBrQer9pBjBdB02ygw713w-MkpQgrh-3BJ-uDYECSxngkx3-IFqe8m-l46jXOtPjPCCL_6u5uaiD7HZlrftbY6elFVtiCJR6JnFHtyA3wVk8LUWwptfy9bF8Ip13qxmRnn1CHNVGYzrlGFOav-O9WiUFyTvg82gvDRhEM5de7ktmMthMuukE-HQJzHPHphFd_WdfRPYCmobReubx7h0vK0MuMStwiYBc679vODR5FkVOrZsoYhLX1sjdHVJF6wZhlSEayfTOWp17-6PXSm0ql0LbGB-Oo4xZVkE3cgnVV0y1Xadhr1jCOPwYuKeBX3w-We0nvdBqcH_kiP2gAnLt_nueWaLe9sMi52rPPEcxEj8Ma6ZknpJBhE0WnwSmcdc_ihfvhpQWaAFniAG8rOTeQkNWRaHuW8swb18m1sUbJVaGP6eHLG9uhwJmFUyesjN7CNlbw09wK-E874ceQoIkj5X6INl8zBvvoEX7Fbl6PX4I3pU1ag3pUdOLww4kFcNIvoyE6ztBQHW-FB1OiJSslpoVu_xqYbhC5PWN3Ve7eDlTs_RBozikoZmzw7vLo3JRlWpwX5K0wLwRQsqsdcAPBRfiq8pHoPMYHvRd-AtHAoJOpAjMGLgYi_e7RU7BZRxOtdTEb0txQ7Xi8IHJgMXkC96RkEp46xLs3ThNRpqOKy6qAV008BOdnXPTDTWl7ys12Ukqy-N6GZ1yuq4Hm3mXnBfvnmWO2hOiQVE9dDQlVGb-QdBmrAgJ5FpxB5k_cSETnQC6Wlow./download'

If the command doesn't work download from this link
Extract the dataset

tar xf download

Rename the extracted dataset

mv TrafficWeatherEvent_Aug16_June19_Publish.csv traffic-data.csv

Running the project

docker-compose up

Wait for all containers to start then run the next command
pass argument as the project number 1 or 2 or 3 or 4

./run.sh
./run.sh 1
./run.sh 2
./run.sh 3
./run.sh 4

Dataset

Contains traffic and weather events. This project is filtering and only using the traffic data. raffic event is a spatiotemporal entity, where such an entity is associated with location and time. Following is the description of available traffic event types:

Dataset desciption link - read the traffic data part only

Accident: Refers to traffic accident which can involve one or more vehicles.
Broken-Vehicle: Refers to the situation when there is one (or more) disabled vehicle(s) in a road.
Congestion: Refers to the situation when the speed of traffic is lower than the expected speed.
Construction: Refers to an on-going construction, maintenance, or re-paring project in a road.
Event: Refers to the situations such as sport event, concert, and demonstration.
Lane-Blocked: Refers to the cases when we have blocked lane(s) due to traffic or weather condition.
Flow-Incident: Refers to all other types of traffic events. Examples are broken traffic light and animal in the road.

The data is stored in CSV file with the next columns:

EventId: This is the identifier of a record
Type: The type of an event; examples are accident and congestion.
Severity: The severity of an event, wherever applicable. For a traffic event, severity is a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the event) and 4 indicates a significant impact on traffic (i.e., long delay).
TMC: Each traffic event has a Traffic Message Channel (TMC) code which provides a more detailed description on type of the event.
Description: The natural language description of an event.
StartTime (UTC): The start time of an event in UTC time zone.
EndTime (UTC): The end time of an event in UTC time zone.
TimeZone: The US-based timezone based on the location of an event (eastern, central, mountain, and pacific).
LocationLat: The latitude in GPS coordinate.
LocationLng The longitude in GPS coordinate.
Distance (mi): The length of the road extent affected by the event.
AirportCode: The closest airport station to the location of a traffic event.
Number: The street number in address record.
Street: The street name in address record.
Side: The relative side of a street (R/L) in address record.
City: The city in address record.
County: The county in address record.
State: The state in address record.
ZipCode: The zipcode in address record.

80% of the traffic data are congestion events.

Big-data 1 - Spark submit project

Implementation of the first project is in the submit folder, that is containg one Java Maven apllication. It is batch processing the whole dataset and it's extracting the next data:

The first query is showing traffic event types in one city for the given timespan that occured more the N times.
The second query is showing the distance of the affected road by the event in once city for the given timestamp.
The third query is showing city that has max and min number of accidents in the given period.
The last query is showing average congestion duration per city in the given period.

For running the first project you have to use the following command

./run.sh 1

This script inside is using the docker-compose-1.yaml file.

Big-data 2 - Spark streaming project

Implementation of the second project is in the streaming-producer and streaming-consumer folders. It is using Apache Kafka for publishing the data from the streaming-producer to the streaming-consumer application. Streaming-producer is Java Maven application that is reading the file on the HDFS line by line, filtering the traffic data and sending it to a kafka topic. Streaming-consumer is getting the data from the kafka topic and processing it. Also the processed data the streaming-consumer application is storing in Mongo database.

The application is finding the minumum, maximum and average congestion duration in a city.
The processing of each bach that streaming-consumer got from the kafka topic is distrubuted among workers. Each batch is processed independently from the previous one, so for finding the min, max and avg value there had to be a state connecting the batches. For that purpose the application is using stateful processing and the current min, max and avg values and stored in a state.

For running the second project you have to use the following command

./run.sh 2

This script inside is using the docker-compose-2.yaml file.

Big-data 3 - Spark ML project

Implementation of the third project is in the ml-batch folder. It is using SparkML.

First the data is filtered by type traffic or weather
Then weather events are join with the traffic events by date time and location within 5km
The grouped data is transformed so it can be used for a ML training and it is split in 70% for training the ML model and 30% for predictions
RandomForestClassificationModel is trained
The model is using WeatherType, WeatherSeverity, Latitude and Longitude to predict Traffic event type
After that evalutaion of the predictions is done and the result is that the model is 70% accurate.

Big-data 4 - Spark ML with streaming project

Implementation of the last project is in the ml-streaming folder. It is using SparkML trained model to predict values that are coming from kafka topic. After the third projects trains the model with the dataset, it saves the model on the hadoop HDFS. This projects is loading that model and using it to predict values from a stream.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
ml-batch		ml-batch
ml-streaming		ml-streaming
streaming-consumer		streaming-consumer
streaming-producer		streaming-producer
submit		submit
.gitignore		.gitignore
README.md		README.md
docker-compose-1.yaml		docker-compose-1.yaml
docker-compose-2.yaml		docker-compose-2.yaml
docker-compose-3.yaml		docker-compose-3.yaml
docker-compose-4.yaml		docker-compose-4.yaml
docker-compose.yaml		docker-compose.yaml
hadoop.env		hadoop.env
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigdata projects

Pre-reqs

Restore dataset

Running the project

Dataset

Big-data 1 - Spark submit project

Big-data 2 - Spark streaming project

Big-data 3 - Spark ML project

Big-data 4 - Spark ML with streaming project

About

Releases

Packages

Languages

JKostov/bigdata

Folders and files

Latest commit

History

Repository files navigation

bigdata projects

Pre-reqs

Restore dataset

Running the project

Dataset

Big-data 1 - Spark submit project

Big-data 2 - Spark streaming project

Big-data 3 - Spark ML project

Big-data 4 - Spark ML with streaming project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages