forked from trangdata/pmlblite
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
143 lines (102 loc) · 5.49 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
output: rmarkdown::github_document
---
[![vignette](https://img.shields.io/badge/-Vignette-green?logo=spinnaker)](https://epistasislab.github.io/pmlb/using-r.html)
[![documentation](https://img.shields.io/badge/-Documentation-purple?logo=read-the-docs)](https://epistasislab.github.io/pmlb/r-ref.html)
`r badger::badge_cran_download("pmlbr", "grand-total", "blue")`
```{r, include = FALSE}
knitr::opts_chunk$set(
fig.path = "man/figures/"
)
```
# pmlbr <img src="man/figures/logo.png" align="right" height="139"/>
![Lifecycle
](https://img.shields.io/badge/lifecycle-maturing-blue.svg?style=flat)
![R
%>%= 3.1.0](https://img.shields.io/badge/R->%3D3.1.0-blue.svg?style=flat)
![Dependencies
](https://img.shields.io/badge/dependencies-none-brightgreen.svg?style=flat)
[![R build status](https://github.com/EpistasisLab/pmlbr/workflows/R-CMD-check/badge.svg)](https://github.com/EpistasisLab/pmlbr/actions)
**pmlbr** is an R interface to the [Penn Machine Learning Benchmarks](https://epistasislab.github.io/pmlb/) (PMLB) data repository, a large collection of curated benchmark datasets for evaluating and comparing supervised machine learning algorithms.
These datasets cover a broad range of applications including binary/multi-class classification and regression problems as well as combinations of categorical, ordinal, and continuous features.
This repository is originally forked from [makeyourownmaker/pmlblite](https://github.com/makeyourownmaker/pmlblite).
We thank the **pmlblite**'s author for releasing the source code under the [GPL-2 license](https://github.com/makeyourownmaker/pmlblite/blob/be763f7011b21e71e3eaf6d3ca6b794d405507cd/LICENSE) so that others could reuse the software.
## Install
This package works for any recent version of R.
You can install the released version of **pmlbr** from CRAN with:
```{r, eval=FALSE}
install.packages("pmlbr")
```
Or the development version from GitHub with remotes:
```{r, eval=FALSE}
# install.packages('remotes') # uncomment to install remotes
library(remotes)
remotes::install_github("EpistasisLab/pmlbr")
```
## Usage
The core function of this package is `fetch_data` that allows us to download data from the PMLB repository.
For example:
``` {r}
library(pmlbr)
# Download features and labels for penguins dataset in single data frame
penguins <- fetch_data("penguins")
str(penguins)
# Download features and labels for penguins dataset in separate data structures
penguins <- fetch_data("penguins", return_X_y = TRUE)
head(penguins$x) # data frame
head(penguins$y) # vector
```
Let's check other available datasets and their summary statistics:
``` {r}
# Dataset names
head(classification_dataset_names, 9)
head(regression_dataset_names, 9)
# Dataset summaries
head(summary_stats)
```
Selecting a subset of datasets that satisfy certain conditions is straight forward with `dplyr`.
For example, if we need datasets with fewer than 100 observations for a classification task:
```{r warning=FALSE, message=FALSE}
library(dplyr)
summary_stats %>%
filter(n_instances < 100, task == "classification") %>%
pull(dataset)
```
### Dataset format
All data sets are stored in a common format:
* First row is the column names
* Each following row corresponds to an individual observation
* The target column is named `target`
* All columns are tab (`\t`) separated
* All files are compressed with `gzip` to conserve space
This R library includes summaries of the classification and regression data sets but does **not**
store any of the PMLB data sets. The data sets can be downloaded using the `fetch_data` function which
is similar to the corresponding PMLB python function.
Further info:
``` {r}
?fetch_data
?summary_stats
```
### Citing
If you use PMLB in a scientific publication, please consider citing one of the following papers:
Joseph D. Romano, Le, Trang T., William La Cava, John T. Gregg, Daniel J. Goldberg, Praneel Chakraborty, Natasha L. Ray, Daniel Himmelstein, Weixuan Fu, and Jason H. Moore.
[PMLB v1.0: an open source dataset collection for benchmarking machine learning methods.](https://arxiv.org/abs/2012.00058) _arXiv preprint_ arXiv:2012.00058 (2020).
Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore (2017).
[PMLB: a large benchmark suite for machine learning evaluation and comparison](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0154-4).
BioData Mining 10, page 36.
## Roadmap
* Add tests
## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Integration of other data repositories are particularly welcome.
## Alternatives
* [Penn Machine Learning Benchmarks](https://github.com/EpistasisLab/pmlb)
* [OpenML](https://www.openml.org/search?type=data)
Approximately 2,500 datasets - available for download using [R module](https://github.com/openml/openml-r)
* [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/datasets)
* [mlbench: Machine Learning Benchmark Problems](https://cran.r-project.org/package=mlbench)
* [Rdatasets: An archive of datasets distributed with R](https://vincentarelbundock.github.io/Rdatasets/)
* [datasets.load: Visual interface for loading datasets in RStudio from all installed (unloaded) packages](https://cran.r-project.org/package=datasets.load)
* [stackoverflow: How do I get a list of built-in data sets in R?](https://stackoverflow.com/questions/33797666/how-do-i-get-a-list-of-built-in-data-sets-in-r)
## License
[GPL-2](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html)