Welcome to DataHacks 2020! Out of hundreds of applicants, you’ve been selected because you display true potential for solving complex problems and exude a passion for comprehending and transforming data. Let’s begin the Hackathon!
- Anaconda: Python 3.7, Graphical Installer recommended but not required
- Slack: Mac Windows
- Live site: Schedule is listed here.
- Slack Channel: Our main tool of communication.
- Devpost: The website to submit your report
- Visual Studio Code: one text editor
- Github Desktop: If you don’t know Github or don’t have a Github account, please look at this post first
- Each team consists of up to THREE people (≤ 3)
- For beginner track, ONLY beginners (students who have not taken DSC 80 or any CSE/COGS/DSC upper-div class) can form a team
- Each team pick a track to work on
- You have 24 hours until Sunday noon to work on your dataset
- Follow the prompt/README file for each track
- Prepare for a report with all of your findings in a reasonable length
- Zip your report (pdf) and code and submit as a group to Devpost (link above, come up with an appropriate team name!).
- Judges will read through your reports and pick the top three teams from each track
- Selected nine teams will go on stage and present their findings (maximum five minutes per team)
- Judges will announce one winner per track based on the presentations
This is a beginner-friendly track! We will give you a dataset that contains San Diego Housing information from the 1970s. You may work on problems such as the relationship between housing quality and ethnic groups/genders, predictions of housing prices based on given conditions, etc. Don't worry if you don't have any knowledge in data science (including python, pandas, EDA, machine learning...). We'll have a series of workshops to help you build your project!
Find more information here.
In this challenge, we’re interested in using Data Visualization and NLP (Natural Language Processing) to analyze chronic illnesses through accumulated survey data. The data was retrieved from the CDC website. The data is real-world data and can be messy, preprocessing may be required to extract trends and patterns in the data. The end goal is to create a report with at least 3 data visualizations and incorporate NLP to send an important message about a specific chronic illness to an audience. Also, make sure your report contains what you did (cleaning, processing, any modeling, etc) and is submitted as well to ensure good data science practices. This prompt is relatively open-ended: other data may be incorporated as deemed necessary and the message you decide to convey is up to you (however, do make sure to back it up with evidence and visuals).
Find more information here.
Over the past decade, the transportation industry has become one of the most promising areas for careers in Data Science and/or Data Engineering. At UBER, the world’s most popular ride-share service, data scientists have access to billions of rows of data and are expected to showcase mastery over-processing, visualizing, and analyzing the company’s data. In this track, you will have the opportunity to work with real-world UBER time-series data from the San Francisco area, spanning across the first and second quarters of 2019. The time-series data will be centered on travel times for UBER trips in the overall San Francisco area.
Find more information here.