IDS 706 - Career Catalyst

Adil Gazder, Diego Rodriguez, Meron Gedrago, Yirang Liu

Submitted as part of the final project for IDS 706 (Data Engineering), Fall 2024, Duke University.

Youtube Demo

In this link you can find a walkthrough using the application and explaining the main Data engineering features.

About the Project

Overview: This project leverages Natural Language Processing (NLP) techniques to analyze and classify job descriptions from a comprehensive dataset of over 124,000 LinkedIn job postings spanning 2023-2024. By employing document classification algorithms such as Latent Dirichlet Allocation (LDA) and pre-trained language models, we aim to identify key role-specific requirements, including skills, experience, and industry focus. The project’s ultimate goal is to provide actionable insights into the evolving demands of various roles within the U.S. labor market, highlighting trends across job titles, industries, and locations.

Problem Statement: Current job seekers often face challenges in understanding and categorizing the critical attributes required in the modern U.S. job market. These include essential skills and industry-specific requirements. This lack of clarity can hinder effective job searches and career planning, necessitating a data-driven approach to elucidate these attributes.

Proposed Solution: We implemented an NLP framework to analyze and classify job descriptions using the following methodologies:

Document Classification Algorithms: Utilize LDA to uncover latent topics within job descriptions.
Pre-trained Language Models: Leverage advanced NLP models (Google Gemini) to extract and categorize critical job attributes with high accuracy.

Architecture

Languages: Python, HTML

Infrastructure: Makefile, Docker, YAML, AWS CLI

Performance

The deployed website (using AWS App Runner) can be accessed here: https://jdwhnfkggk.us-west-2.awsapprunner.com/

We also tested the website's performance using Locust, and were able to achieve 564 requests per second (RPS) with our deployment. This limit was attributed to the restrictions on API rate limits from Google Gemini API's and AppRunner response imitations.

Limitations and Areas of Improvement

Since there are multiple Language models running along with API calls (Google Gemini)there are instances where the load time maybe more than normal as the document size increases. Additionally, considering we are using an AWS resource, on occasion the memory capacity may be consumed rapidly.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
AWS		AWS
data		data
mylib		mylib
static/fig		static/fig
templates		templates
test_resume		test_resume
.gitignore		.gitignore
Diagram.png		Diagram.png
Dockerfile		Dockerfile
IDS 706 Team reflection (MADY).pdf		IDS 706 Team reflection (MADY).pdf
Locust Performance Report.html		Locust Performance Report.html
LocustPerformance.png		LocustPerformance.png
Locustoutput.png		Locustoutput.png
Makefile		Makefile
README.md		README.md
Word_embedding_report.pdf		Word_embedding_report.pdf
app.py		app.py
locust_test.py		locust_test.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IDS 706 - Career Catalyst

Youtube Demo

About the Project

Architecture

Performance

Limitations and Areas of Improvement

About

Releases

Packages

Languages

nogibjj/ag825_individual_project_3

Folders and files

Latest commit

History

Repository files navigation

IDS 706 - Career Catalyst

Youtube Demo

About the Project

Architecture

Performance

Limitations and Areas of Improvement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages