The purpose of the project is to create an analytical database for statistical analysis of immigration data. This data can be sourced from National Travel and Tourism Office (NTTO) website. The data is collected under The International Visitor Arrivals Program which provides the U.S. government and the public with the official monthly and annual overseas visitor arrivals to the United States. NTTO manages the program in cooperation with the Department of Homeland Security, Customs and Border Protection (CBP). Apart from the U.S. Government, the summarized data can be used by airlines, airports, U.S. destinations - state and city, major attractions, state governments and consultants.
Once initial data in transformed into data model described in the sections below, it will be possible to aggregate the immigration data by various fields and to present it to all interested parties. Potential aggregation fields include
- port of entry to the U.S.
- mode of arrival/transportation
- the first destination in the U.S.
- visa type
- visitors' age
Dataset | Udacity | Additional Data from Internet |
---|---|---|
Non-resident arrivals to the U.S. | yes | |
Airports and ports | yes | https://datahub.io/core/airport-codes#data |
Country and region of residence | yes | https://travel.trade.gov/research/programs/i94/1999-2018%20Region%20Dictionary.xlsx |
Arrival mode | yes | |
U.S. destination (state only) | yes | |
Visa category | yes | |
Visa type | no | https://www.uscis.gov/ilink/docView/AFM/HTML/AFM/0-0-0-1/0-0-0-27276/0-0-0-32908.html |
https://travel.trade.gov/research/programs/i94/methodology.asp | ||
- Ports of Entry: data provided by Udacity was combined with supplementary data using the code.
- Country and region of residence: region was added to the country data provided by Udacity.
- Visa type: only the data from 2 online data sources was used.
- Database is normalized to the third normal form.
- Database is represented by a Star schema
- i94records - Non-Immigrant I-94 Arrival/Departure Records as of 2016
- country - visitor's country of residence
- arrival_port - airport/port of arrival
- arrival_mode - means of transportation (air/land/sea)
- first_intend_addr - destination US state
- visa_category - general visa category (business/pleasure/student)
- visa_type - specific visa types and supporting sections of law
- dates - date definitions
Reading, processing and saving data is done by Apache Spark via PySpark library. Spark was selected as a tool for this project due to the following features:
- quick installation
- fairly easy to use
- low-latency in-memory data processing
- highly scalable
- support for muptiple programming languages inlcuding Python that was used for this project
- incorporating SQL directly into the code
- functional programming
- Read data for all dimenstion tables except dates table from corresponding csv files.
- Merge 2 datasets for ports of entry via outer join on port_id. Pick a non-null port location from either source. If location in one of the sources has No Port or Collapsed then take a valid port location from another source.
- Clean up dimension tables:
- remove duplicate rows
- check primary key column for NULL values
- remove records with primary key = NULL if any
- Check uniqueness of primary key by counting total and distinct key values
- Save dimension tables as json files.
-
Read monthly data files in sas format. For each input file perform the following steps.
-
Count number of rows to be compared to the number of output rows at the end.
-
Clean up primary key:
- remove duplicate rows
- check primary key column for NULL values
- remove records with primary key = NULL if any
-
Clean up foreign keys. For each foreign key:
- count NULL values, save this count to be compared to the similar count after the data was transformed
- replace NULL values with a default value shown in data dictionary
- check correctness of values
- for non-date columns:
- perform a left join with a corresponding dimenstion table
- any values that were not found in dimension table are considered incorrect and should be replaced by a default value
- for date columns:
- compare date against upper date limit. Arrival date should be before today's date and departure date should be before 1/1/2060 (an arbitrary date in the future)
- arrival date cannot be Null while departure date can be
- for non-date columns:
-
Fix incorrect values in foreign key columns found in step 4. Replace them with correct default values outlined in the data dictionary.
-
Convert arrival and departure date columns from SAS format to regular date format (date = 1/1/1960 + SAS date)
-
Drop the original foreign key columns since checking correctness, fixing incorrect values and converting dates to a new format resulted in creating new foreign key columns.
-
Validate foreign key columns after data was processed
- Count NULL values again
- Only departure date column can contain NULL values
-
Create date dimenstion table by selecting distinct arrival dates from fact table
- add year, month, day of the month column by extracting them from the date column
- save date table as json file
-
Count number of rows to be compared to the original number of input rows as a completeness test.
-
Save fact table as json file
- Dimension tables:
- duplicate rows
- primary key: non-null and uniqueness conditions
- Fact table:
- completeness check, rows count before and after
- primary key: non-null condition
- foreign keys:
- non-null if applicable
- consistency against referenced dimension table (incorrect values are replaced with a default value)
- dates are within meaningful limits
- dates are converted from SAS to regular format
-
If the data was increased by 100x.
Spark should be able to handle the data size increase. The output data should be saved in Redshift or other database in the cloud
-
If the pipelines were run on a daily basis by 7am.
Airflow can be used to schedule Service Level Agreement (SLA) data delivery time. Each pipeline function call can become an Airflow operator and the sequence of function calls can be converted into a DAG
-
If the database needed to be accessed by 100+ people.
The output data can be saved in Redshift or another database that controls data access
File Name | Description |
---|---|
etl.py |
complete ETL process |
dl.cfg |
config file with AWS credentials |
README.md |
|
star_schema.jpg |
image for README file |
data_dict.jpg |
image for README file |
dim_data folder |
input data for all dimension tables |
fact_data folder |
input data for fact tables (TEST only) |
output_data folder |
output data for all tables (TEST only) |
- Add AWS credentials to dl.cfg
- Specify S3 bucket and S3 folder in main() tos save output data in S3 if using 'onefile' or 'allfiles' parameter
- Command line:
python ./etl.py test
- to run using a fact table with 1,000 rowspython ./etl.py onefile
- to run using a fact table with ~4 million rowspython ./etl.py allfiles
- to run using 12 fact table with ~2-4 million rows in each file