A DataVil project.
FrameX is a light-weight, dataset fetching library for fast prototyping, tutorial creation, and experimenting.
Built on top of Polars.
To get started, install the library with:
pip install framex
import framex as fx
iris = fx.load("iris")
which returns a polars DataFrame
Therefore, you can use all the polars functions and methods on the returned DataFrame.
iris.head()
shape: (5, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f32 ┆ f32 ┆ f32 ┆ f32 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘
iris = fx.load("iris", lazy=True)
which returns a polars LazyFrame
Both these operations create local copies of the datasets by default cache=True
.
To see the list of available datasets, run:
fx.available()
{'remote': ['iris', 'mpg', 'netflix', 'starbucks', 'titanic'], 'local': ['titanic']}
PS, shorthened for clarity
which returns a dictionary of both locally and remotely available datasets.
To see only local or remote datasets, run:
fx.available("local")
fx.available("remote")
{'local': ['titanic']}
{'remote': ['iris', 'mpg', 'netflix', 'starbucks', 'titanic']}
To get information on a dataset, run:
fx.about("mpg") # basically the same as `fx.about("mpg", mode="print")`
which will print the information on the dataset as the following:
NAME : mpg
SOURCE : https://www.kaggle.com/datasets/uciml/autompg-dataset
LICENSE : CC0: Public Domain
ORIGIN : Kaggle
OG NAME : autompg-dataset
Or you can get the information as a single row polars.DataFrame by running:
row = fx.about("mpg", mode="row")
print(row)
which will print the information on the dataset ASCII art as the following:
shape: (1, 4)
┌──────┬─────────────────────────────────┬────────────────────┬────────┐
│ name ┆ source ┆ license ┆ origin │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪═════════════════════════════════╪════════════════════╪════════╡
│ mpg ┆ https://www.kaggle.com/dataset… ┆ CC0: Public Domain ┆ Kaggle │
└──────┴─────────────────────────────────┴────────────────────┴────────┘
or you can simply treat row
as a polars DataFrame in your code.
In case you need the file links.
url_pokemon = fx.get_url("pokemon")
by default, the format is " feather".
Optionally, you can specify the format of the dataset.
url_pokemon_csv = fx.get_url("pokemon", format="csv")
framex CLI has a slight overhead of around 400 milliseconds due to imports. However, operations still take less than a second, unless bottlenecked by the download speed.
TO see all the available commands, run:
fx -h
usage: fx [-h] [--version]
{get,bring,about,list,show,describe} ...
Framex CLI
positional arguments:
{get,bring,about,list,show,describe}
get Get dataset(s)
bring Bring dataset(s) from the cache to the
current working directory or to a
specified directory.
about Info about dataset(s)
list List available datasets
show Show a preview of a single dataset
describe Describe (or summarize) a dataset
options:
-h, --help show this help message and exit
--version, -v Show version
Get a single dataset (to the current directory):
fx get iris
or get multiple datasets:
fx get iris mpg titanic
which will download dataset(s) to the current directory.
to get the datasets into cache directory:
fx get iris mpg titanic --cache
or to a specific directory:
fx get iris mpg titanic --dir data
To get the name of the available datasets on the remote server.
fx list
this will list all available datasets on the remote server.
to get the names of the available datasets that includes "dia"
fx list dia
Locally available datasets: (feather, parquet, csv, other)
Remote datasets:
diamonds
To get information on a dataset or datasets, run:
fx about mpg iris
To show a preview of a single dataset
fx show iris
To describe (or summarize) a dataset
fx describe iris
For more parameters
fx get --help
Bring a dataset to the current directory from cache:
fx bring iris
or bring multiple datasets:
fx bring iris mpg titanic
which will bring dataset(s) to the current directory from cache directory.