Spatial Query on NYC Cab Data

This repo provides scripts to run multiple spatial queries on the large database that contains geographic data as well as real-time location data of the customer for-hire vehicle Companies(Uber, Lyft, etc.).Most of the raw data comes from the NYC Taxi & Limousine Commission from 2009.

Statistics through December 31, 2019:

2.63 billion total trips
1.69 billion taxi
935 million for-hire vehicle
291 GB of raw data Database takes up 391 GB on disk with minimal indexes

Spatial Query

A spatial query is a special type of query supported by geodatabases and spatial databases. The queries differ from traditional SQL queries in that they allow for the use of points, lines, and polygons. The spatial queries also consider the relationship between these geometries.

Why to use Apache Spark

The database is large and mostly unstructured, So ist better to use SparkSQL. The goal of the project is to extract data from this database that can be used for operational (day-to-day) and strategic level (long term) decisions.

What Spatial Queries are used

Assumption

A rectangle R represents a geographical boundary in a town or city, and a set of points P represents customers who request taxi cab service using your client firm’s app.

. Range query: Given a query rectangle R and a set of points P, find all the points within R. You need to use the ‘ST_Contains’ function in this query.
Range join query: Given a set of rectangles R and a set of points P, find all (point, rectangle) pairs such that the point is within the rectangle.
Distance query: Given a fixed point location P and distance D (in kilometers), find all points that lie within a distance D from P. You need to use the ‘ST_Within’ function in this query.
Distance join query: Given two sets of points P1 and P2, and a distance D (in kilometers), find all (p1, p2) pairs such that p1 is within a distance D from p2 (i.e., p1 belongs to P1 and p2 belongs to P2). You need to use the ‘ST_Within’ function in this query.

Requirements

Scala Version: 2.11.12
JDK Version: 11 (To know more on this, you can refer https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html)
Apache Spark and Apache SQL.(A detailed guide of its installation is https://spark.apache.org/docs/latest/ ).

Installation

Use IntelliJ Idea with Scala plug-in or any other Scala IDE.
Replace the logic of User Defined Functions ST_Contains and ST_Within in SparkSQLExample.scala.
Append .master("local[*]") after .config("spark.some.config.option", "some-value") to tell IDE the master IP is localhost.
In some cases, you may need to go to "build.sbt" file and change % "provided" to % "compile" in order to debug your code in IDE
Run your code in IDE
If you want to run the project jar file on a Spark Cluster use command "./bin/spark-submit ". To know more on this, refer this-Link.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
project		project
result/output0		result/output0
src		src
target		target
README.md		README.md
build.sbt		build.sbt
exampleanswer		exampleanswer
exampleinput		exampleinput
phase2-requirement.pdf		phase2-requirement.pdf
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial Query on NYC Cab Data

Statistics through December 31, 2019:

Spatial Query

Why to use Apache Spark

What Spatial Queries are used

Assumption

Requirements

Installation

About

Releases

Packages

Languages

rajat641/Spatial-Query-Cab-NYC

Folders and files

Latest commit

History

Repository files navigation

Spatial Query on NYC Cab Data

Statistics through December 31, 2019:

Spatial Query

Why to use Apache Spark

What Spatial Queries are used

Assumption

Requirements

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages