-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #34 from neurodata/dalabengba
Complete pre-print draft
- Loading branch information
Showing
31 changed files
with
3,625 additions
and
187 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,9 @@ | ||
# rf-dn-paper | ||
Random forests (RF) and deep networks (DN) are two of the primary machine | ||
learning methods in current literature, yet they are known to yield different | ||
levels of performance on different data modalities. In particular, RF (and | ||
other tree based ensemble methods) is one of the leading means for tabular | ||
data, while DNs are known to perform best on structured data. We wish to | ||
further explore and establish the conditions and domains in which each | ||
approach excels, particularly in the context of sample size. To address these | ||
issues, our team is analyzing the performance of these models across different | ||
settings using varying model parameters and architectures. For our data, we | ||
are focusing on well-known image, tabular, and audio datasets to observe | ||
performance from well-documented sources. For image data, we used CIFAR-10/100 | ||
with fixed sample sizes across different numbers of classes. For tabular data, | ||
we used OpenML-CC18, a collection of datasets representing a benchmark for | ||
machine learning analysis. For audio data, we used UrbanSound8k dataset across | ||
ten classes employing Mel-frequency cepstrum as a feature representation for | ||
audio. The goal of this project is to observe novel trends in model | ||
classification accuracy visible across a range of sample sizes. | ||
# When are Deep Networks really better than Random Forests at small sample sizes? | ||
|
||
[![arXiv](https://img.shields.io/badge/arXiv-2108.13637-red.svg?style=flat)](https://arxiv.org/abs/2108.13637) | ||
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) | ||
[![License](https://img.shields.io/badge/License-MIT-blue)](https://opensource.org/licenses/MIT) | ||
|
||
**This is preliminary work. More details will be available.** | ||
|
||
Random forests (RF) and deep networks (DN) are two of the most popular machine learning methods in the current scientific literature and yield differing levels of performance on different data modalities. We wish to further explore and establish the conditions and domains in which each approach excels, particularly in the context of sample size and feature dimension. To address these issues, we tested the performance of these approaches across tabular, image, and audio settings using varying model parameters and architectures. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found RF to excel at tabular and structured data (image and audio) with small sample sizes, whereas DN performed better on structured data with larger sample sizes. Although we plan to continue updating this technical report in the coming months, we believe the current preliminary results may be of interest to others. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.