GitHub - lefthandmagic/analytics

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
analytics-engine		analytics-engine
README		README

Repository files navigation

Uber Analytics Application
--------------------------

This is a project that tries to provide analytics over uber trips

Build
------
Pre-reqs are java & maven.
The project is mavenized, and uses the maven build process.
To start the application run Main.class as a java application.

Simple Steps to build
1) After cloning the project, navigate to the analytics-engine/ folder and run
mvn clean compile assembly:single
2) To start the service, from the generated target/ folder run the following command
java -cp analytics-engine-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.uber.analytics.Main

The above invokes the server and it waits, for http requests to ingest/request responses

Scripts
--------
Various helper scripts are present under the script directory.

The scripts are in ruby, and you will need to have ruby installed. 
All the scripts have comments to describe their purpose, and they are briefly explained here.

1) seed_data.rb -> generates a sample set seed data to work with. 

From the scripts folder Run as,

ruby seed_data.rb 

This generates the seed data into tmp/seed_data.txt (a sample test seed data is attached in test_data folder).

2) ingest.rb -> ingest a sample set of seed data in test_data folder into the server.

From the scripts folder Run as,

ruby ingest.rb

3) query.rb -> runs a test for all types of rest calls exposed through the application 
(should run ingest.rb with the provided seed data before this)

From the scripts folder Run as,

ruby query.rb

Read the comments on this file for more details.
This contains multiple http rest queries to test the data ingested and the proper working of the application.


@note:while executing ingest.rb and query.rb the application should be started.

Implementation
--------------

1) total number of trips -> uses an atomiclong to track total count across multiple threads 
2) total number of trips with date range -> uses concurrent hash map to store count of trips for 
   individual day granularity(we trim the time component) ~ does quick sum between dates O(n) if n days
3) total number of trips in last hour -> uses a sorted set to keep the trips in the last hour, 
   on every ingest older entries are removed, retrieval uses a subset from sorted set ~ O(1)
4) total number of clients -> we track a unique concurrent hash set of client ids, size of set 
   is number of clients ~ O(1)
5) total number of clients with date range -> track unique concurrent hash set by date, 
   compute unique hash set across date range and return size of that set ~ O(n) for n days
6) total miles per client -> keeps a running sum of miles per unique client ~ O(1)
7) total miles per client by date range -> keep a running sum of miles per unique client per day. 
   Aggregate across date range. ~ O(n) for n days
8) Average fare for a specific city -> I've broken down the overall space into 18 * 36 square which provides 
   totally 648 cities, considering the natural order of (-90, +90) [18 blocks of 10 each] latitude   
   and (-180 + 180) [36 blocks of 10 each] longitude. We compute the city from the latitude longitute and 
   track the average fare for a specific city. ~ O(1) lookup
9) Average fare for a specific city by date range -> Same as above, but the average is tracked per city per day 
   and aggregated on request ~ O(n) for n days
10) Median Rating for a Driver -> We store the list of ratings using a balanced min and max heap. 
    Access to median is O(1) by just looking at the heads of the heap.


Future Work/Ideas
-----------------
The solution does not currently horizontally scale across machines as it's limited to one service.
This could be bypassed by hashing out the keyspace to multiple machines.
For the sake of simplicity, all the data ingested is stored in memory, assuming 
infinite memory(which is obviously no production design).
The right thing to do is to store all these hashes in a nosql database in sorted order(Hbase like). 
This would allow easy date range scans, and the queries can be optimized by distributing key space 
across a cluster of machines.