H1B Statistics

Insight Data Engineering Coding Challenge

The semicolon separated CSV file is read by the program and only the essential column headers are identified. Since each year can have different data/column headers, the column headers are initially read and generalized through the words that can be found in the header. The column headers that are required are:

CASE_NUMBER - To uniquely identify the visa application

CSV file can have duplicates of the case_number, and this data is avoided. A set is created to store the unique case numbers that were encountered.

STATUS - Status of the visa application to be CERTIFIED
SOC_NAME - Standard Occupational Classification name that denotes the occupation of the visa applicant
WORKSITE_STATE - State in which the user is intended to work

If the header columns have multiple instances, the first is chosen. For eg, there could be worksite_state1 and worksite_state2 and is implemented as link.

Implementation

The required columns are identified, the script parses through each row of the file to extract data. If the visa status is certified, a dictionary is created to store the worksite_state. The state becomes the key of the dictionary and its occurrence is the value. The occupation dictionary is also created in the same way. The dictionaries are sorted and the top 10 occupations and states are identified.

Directory Structure

The directory structure of the repo looks like:

      ├── README.md 
      ├── run.sh
      ├── src
      │   └──h1b_analytics.py
      ├── input
      │   └──h1b_input.csv
      ├── output
      |   └── top_10_occupations.txt
      |   └── top_10_states.txt
      ├── insight_testsuite
          └── run_tests.sh
          └── tests
              └── test_1
              |   ├── input
              |   │   └── h1b_input.csv
              |   |__ output
              |   |   └── top_10_occupations.txt
              |   |   └── top_10_states.txt
              ├── test_2
                  ├── input
                  │   └── h1b_input.csv
                  |── output
                  |   |   └── top_10_occupations.txt
                  |   |   └── top_10_states.txt

Result

The top 10 occupations and top 10 states of certified visa applications is written into top_10_occupations.txt and top_10_states.txt respectively.

The format of the result files are:

./output/top_10_occupations.txt:

TOP_OCCUPATIONS;NUMBER_CERTIFIED_APPLICATIONS;PERCENTAGE
SOFTWARE DEVELOPERS, APPLICATIONS;6;60.0%
ACCOUNTANTS AND AUDITORS;1;10.0%
COMPUTER OCCUPATIONS, ALL OTHER;1;10.0% 
COMPUTER SYSTEMS ANALYST;1;10.0%
DATABASE ADMINISTRATORS;1;10.0%

./output/top_10_states.txt:

TOP_STATES;NUMBER_CERTIFIED_APPLICATIONS;PERCENTAGE
FL;2;20.0%
AL;1;10.0%
CA;1;10.0%
DE;1;10.0%
GA;1;10.0%
MD;1;10.0%
NJ;1;10.0%
TX;1;10.0%
WA;1;10.0%

Run Instructions

To run h1b_analytics.py, the run.sh file is invoked. The h1b_input.csv is placed inside the input folder (as shown in the directory structure).

run.sh contains the command to run the python script with the input and output file paths as arguments

python3 ./src/h1b_analytics.py ./input/h1b_input.csv ./output/top_10_occupations.txt ./output/top_10_states.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

H1B Statistics

Table of Contents

Challenge Summary

Approach

Data

Implementation

Directory Structure

Result

Run Instructions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
input		input
insight_testsuite		insight_testsuite
output		output
src		src
README.md		README.md
run.sh		run.sh

ddevaraj/insight-data-engineering

Folders and files

Latest commit

History

Repository files navigation

H1B Statistics

Table of Contents

Challenge Summary

Approach

Data

Implementation

Directory Structure

Result

Run Instructions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages