Insight Data Engineering Coding Challenge
A task to find trends in the immigration data of H1B visa application processing over the years. This task analyzes the past years data and calculates two metrics - Top 10 Occupations and Top 10 States for certified visa applications.
The data is downloaded from the Office of Foreign Labor Certification Performance Data and converted into a semicolon(;) separated CSV file.
The task analyzes the data sent as h1b_input.csv and creates a simple dictionary to store the values (book-keeping).
Language - Python3
The semicolon separated CSV file is read by the program and only the essential column headers are identified. Since each year can have different data/column headers, the column headers are initially read and generalized through the words that can be found in the header. The column headers that are required are:
- CASE_NUMBER - To uniquely identify the visa application
CSV file can have duplicates of the case_number, and this data is avoided. A set is created to store the unique case numbers that were encountered.
- STATUS - Status of the visa application to be CERTIFIED
- SOC_NAME - Standard Occupational Classification name that denotes the occupation of the visa applicant
- WORKSITE_STATE - State in which the user is intended to work
If the header columns have multiple instances, the first is chosen. For eg, there could be worksite_state1 and worksite_state2 and is implemented as link.
The required columns are identified, the script parses through each row of the file to extract data. If the visa status is certified, a dictionary is created to store the worksite_state. The state becomes the key of the dictionary and its occurrence is the value. The occupation dictionary is also created in the same way. The dictionaries are sorted and the top 10 occupations and states are identified.
The directory structure of the repo looks like:
├── README.md
├── run.sh
├── src
│ └──h1b_analytics.py
├── input
│ └──h1b_input.csv
├── output
| └── top_10_occupations.txt
| └── top_10_states.txt
├── insight_testsuite
└── run_tests.sh
└── tests
└── test_1
| ├── input
| │ └── h1b_input.csv
| |__ output
| | └── top_10_occupations.txt
| | └── top_10_states.txt
├── test_2
├── input
│ └── h1b_input.csv
|── output
| | └── top_10_occupations.txt
| | └── top_10_states.txt
The top 10 occupations and top 10 states of certified visa applications is written into top_10_occupations.txt and top_10_states.txt respectively.
The format of the result files are:
./output/top_10_occupations.txt
:
TOP_OCCUPATIONS;NUMBER_CERTIFIED_APPLICATIONS;PERCENTAGE
SOFTWARE DEVELOPERS, APPLICATIONS;6;60.0%
ACCOUNTANTS AND AUDITORS;1;10.0%
COMPUTER OCCUPATIONS, ALL OTHER;1;10.0%
COMPUTER SYSTEMS ANALYST;1;10.0%
DATABASE ADMINISTRATORS;1;10.0%
./output/top_10_states.txt
:
TOP_STATES;NUMBER_CERTIFIED_APPLICATIONS;PERCENTAGE
FL;2;20.0%
AL;1;10.0%
CA;1;10.0%
DE;1;10.0%
GA;1;10.0%
MD;1;10.0%
NJ;1;10.0%
TX;1;10.0%
WA;1;10.0%
To run h1b_analytics.py, the run.sh file is invoked. The h1b_input.csv is placed inside the input folder (as shown in the directory structure).
run.sh contains the command to run the python script with the input and output file paths as arguments
python3 ./src/h1b_analytics.py ./input/h1b_input.csv ./output/top_10_occupations.txt ./output/top_10_states.txt