This is the workspace for the Microsoft 2018 OneWeek Hackathon project Find Arctic Seals with Deep Learning. Other background materials (presentations, etc.) can be found in our Arctic Seals Hackathon Team.
To get write access to this repo, submit a request here.
The data
directory contains the following dataset files from NOAA:
train.csv
(5,256 records): Hotspot detection data for which we have all corresponding imagery data (see below). Currently all of these hotspots refer to images in dataset ArcticSealsData01.test.csv
(1,368 records): Same format and distrbution oftrain.csv
, suitable for cross-validation.
Each record in the CSV files refers to a hotspot that the NOAA thermal detection system picked up and that was classified by a human into either "Animal" (true positive) or "Anomaly" (false positive). Each hotspot is unique (no duplicates). The column schema is as follows:
hotspot_id
: unique IDtimestamp
: GMT/UTC timestamp (always corresponds to thermal image timestamp)filt_thermal16
: Filename of the 16-bit PNG containing the raw FLIR image datafilt_thermal8
: Filename of the 8-bit JPG containing the annotated FLIR image data (hotspots circled)filt_color
: Filename of the 8-bit JPG containing a color image taken at or near the same time as the thermal image. The timestamp encoded in the filename may be different from the thermal timestamp by up to 60 seconds (but typically less than 1 second).x_pos
/y_pos
: Location of the hotspot in the thermal imagethumb_*
: Bounding box of the hotspot in the color image. NOTE: some of these values are negative, as the bounding box is always 512x512 even if the hotspot is at the edge of the image.hotspot_type
: "Animal" or "Anomaly"species_id
: "Bearded Seal", "Ringed Seal", "UNK Seal", "Polar Bear" or "NA" (for anomalies)
In the data
directory there is also a raw.csv
(15,454 records) containing all hotspot detections from the NOAA 2016 survey flights (includes more seals but also more types of animals, more anomalies, hotspots marked as duplicates, etc.). We do not yet have the imagery corresponding to all of these hotspots, only about 2.5TB out of 19TB.
The actual image files are located in Azure storage, grouped into datasets each containing thousands of either color or thermal images. You can get these as .tar archives or .vhdx virtual disks; each contains the same data.
ArcticSealsData01_Color
(88GB): tar vhdxArcticSealsData02_Color
(89GB): tar vhdxArcticSealsData03_Color
(269GB): tar vhdxArcticSealsData04_Color
(648GB): tar vhdxArcticSealsData05_Color
(627GB): tar vhdxArcticSealsData06_Color
(535GB): tar vhdxArcticSealsData07_Color
(219GB): tar vhdx
The thermal data, since it's relatively small, has been combined into fewer files. Note that there is more thermal data than we have corresponding color data for.
ArcticSealsData01_Thermal
(1GB): tar vhdxArcticSealsData02-07_Thermal
(31GB): tar vhdxArcticSealsData08-99_Thermal
(41GB): tar vhdx
In Windows, you can easily mount the .vhdx files on your machine by double-clicking them.
The timestamp pattern embedded in the filenames has two possible forms - you may see, for example, either 160408_020848.724
or 20160408020848.724GMT
. In all cases you should use the filename-embedded timestamp to sequence/correlate images, not whatever timestamp your file system claims.
We also have the ArcticSealsData01 files as individual files in Azure storage that can be accessed as shown below. However, if you are going to do any bulk operations it's more efficient to download the tar/vhdx files.
Finally, if you want to use Azure Storage Explorer (for example) to access the entire blob container, use this connection string:
BlobEndpoint=https://arcticseals.blob.core.windows.net/;SharedAccessSignature=sv=2017-11-09&ss=b&srt=sco&sp=rl&se=2019-06-13T07:12:17Z&st=2018-06-13T23:12:17Z&spr=https&sig=2v7zAzhq2cw1%2BWseuNAKiTp5Qc4zzBclw3LqdDnANYg%3D
The project is meant to accomodate many different approaches, frameworks, languages, etc. Linux is the primary supported dev environment, though some GUI tools are Windows-only.
Hackathon members are welcome to add whatever code you like to this repo, but please follow these guidelines:
- Put your source code in its own directory inside the
src
directory. - Add a
README.md
file to your code directory explaining what your code does and how it works. - If there are dependencies that need to be installed and/or build steps that need to be performed, add any necessary code to the
build.bat
script to run the relevant package manager commands, compile steps, etc., to ensure your code is fully runnable locally.- Alternatively, it is also ok if your code only builds from within an IDE; if so just make a note of that in your
README.md
.
- Alternatively, it is also ok if your code only builds from within an IDE; if so just make a note of that in your
- If applicable, add a script that runs your code to the root directory. If it takes command line arguments, please show help text if it is run without arguments.
Additional notes from 10/16/2018 NOAA sync meeting.
- Filename timestamps with very small millisecond values are actually for the previous second. Metadata for a given image will have the correct timestamp.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.