-
Notifications
You must be signed in to change notification settings - Fork 7
Challenge
The goal for today is somewhat open ended, but follows a logical sequence of stages not all too unfamiliar with what people in the data science community would recognize as the typical order of affairs in any new exploratory data analysis project.
The expectation is that participants will not only demonstrate their skill in data analysis and coding, but their knowledge and ingenuity with regard to the possible directions to take a project.
The following are stages to be completed in this competition. They follow roughly a progressive sequence, but are not necessarily chronological. It is upon the participant to assess whether sufficient work has been done in each area.
The details of how the data are structured is provided by the Dataset wiki page. Even with the impressive computational power afforded by a dedicated HPC system, loading the entire dataset into memory is not likely an option.
In this stage of the competition, participants are expected to filter and extract relevant portions of the dataset into smaller, logically composed subsets (e.g., by region, by variable, or by time). This stage is likely going to be revisited throughout the competition as you explore new avenues of inquiry.
Given the mere eight hours available, points will be awarded for practicality and creativity. Sequences of command line tools used or custom code developed for these purposes should be documented appropriately.
One of the more difficult aspects of data science is in assessing the quality of data available for a given purpose. The expectation is that participants will investigate the quality of data present, either in full or in part.
Any specific line of inquiry (in visualization or in pursuing particular insights) is expected to have been assessed for missing data. Specifically, the degree to which missing data is imputed, points will be awarded for reasoning about why it is or isn't appropriate to use a particular method.
This includes informative summary statistics relevant to the portion of the dataset in question. Given that the dataset itself is a summary statistic, this would be a higher level of aggregation, e.g., by region or over time.
This stage is a catch all for any number of insights you might pursue within the data. Stage 5 will target something specific.
Out of the billions of numerical values that compose the raw data, charts and visualizations; both classic and exotic, can be the most compelling output of a data science project. Especially to the degree that it allows one to convey a story to an audience that would otherwise be inaccessible.
This dataset provides multiple dimensions across both space and time. Points will be awarded for the density of information carried by a visualization and in particular its quality and elegance. Better graphics capture the attention of more people.
In an effort to provide some concrete dimension to the competition; the following questions may be answered for points:
Here in Indiana,
-
What is the first recorded observation in the data?
-
What is the highest recorded temperature, wind gust speed, and total precipitation in a single day?
-
Where and when did these extreme values occur (referring to the previous question) and to what degree were they anomolous?
-
Visualize the spread of temperature observations in the state of Indiana and how they related to the extremal values on record for those calendar days of the year:
a) over the last 50 years.
b) over the last 10 years.
c) over the last 6 months.
Previous: Dataset | Next: Submission
AITP Computing Challenge Day 2019 | Data Science Challenge | Research Computing |
---|