ClearML introducing Dataset management!

Decoupling Data from Code - The Dataset Paradigm

The ultimate goal of `clearml-data` is to transform datasets into configuration parameters

Just like any other argument, the dataset argument should retrieve a full local copy of the dataset to be used by the experiment. This means datasets can be efficiently retrieved by any machine in a reproducible way. Together it creates a full version control solution for all your data,
that is both machine and environment agnostic.

Design Goals : Simple / Agnostic / File-based / Efficient

Key Concepts:

Dataset is a collection of files : e.g. folder with all subdirectories and files included in the dataset
Differential storage : Efficient storage / network
Flexible: support addition / removal / merge of files and datasets
Descriptive, transparent & searchable: support projects, names, descriptions, tags and searchable fields
Simple interface (CLI and programmatic)
Accessible: get a copy of the dataset files from anywhere on any machine

Workflow:

Simple dataset creation with CLI:

Create a dataset

clearml-data create --project <my_project> --name <my_dataset_name>

Add local files to the dataset

clearml-data add --files ~/datasets/best_dataset/

Close dataset and upload files (Optional: specify storage --storage s3://bucket, gs://, azure:// or /mnt/shared/)

clearml-data close --id <dataset_id>

Integrating datasets into your code:

from argparse import ArgumentParser
from clearml import Dataset

# adding command line interface, so it is easy to use
parser = ArgumentParser()
parser.add_argument('--dataset', default='aayyzz', type=str, help='Dataset ID to train on')
args = parser.parse_args()

# creating a task, so that later we could override the argparse from UI
task = Task.init(project_name='examples', task_name='dataset demo')

# getting a local copy of the dataset
dataset_folder = Dataset.get(dataset_id=args.dataset).get_local_copy()

# go over the files in `dataset_folder` and train your model

Create dataset from code

Creating datasets from code is especially helpful when some preprocessing is done on raw data and we want to save preprocessing code as well as dataset in a single Task.

from clearml import Dataset

# Preprocessing code here

dataset = Dataset.create(dataset_name='dataset name',dataset_project='dataset project')
dataset.add_files('/path_to_data')
dataset.upload()
dataset.finalize()

Modifying a dataset with CLI:

Create a new dataset (specify the parent dataset id)

clearml-data create --name <improved_dataset> --parents <existing_dataset_id>

Get a mutable copy of the current dataset

clearml-data get --id <created_dataset_id> --copy ~/datasets/working_dataset

Change / add / remove files from the dataset folder

vim ~/datasets/working_dataset/everything.csv

Folder sync mode

Folder sync mode updates dataset according to folder content changes.
This is useful in case there's a single point of truth, either a local or network folder that gets updated periodically. When using clearml-data sync and specifying parent dataset, the folder changes will be reflected in a new dataset version. This saves time manually updating (adding \ removing) files.

Sync local changes

clearml-data sync --id <created_dataset_id> --folder ~/datasets/working_dataset

Upload files (Optional: specify storage --storage s3://bucket, gs://, azure://, /mnt/shared/)

clearml-data upload --id <created_dataset_id>

Close dataset

clearml-data close --id <created_dataset_id>

Command Line Interface Summary:

search Search a dataset based on project / name / description / tag etc.
list List the file directory content of a dataset (no need to download a copy pf the dataset)
verify Verify a local copy of a dataset (verify the dataset files SHA2 hash)
create Create a new dataset (support extending/inheriting multiple parents)
delete Delete a dataset
add Add local files to a dataset
sync Sync dataset with a local folder (source-of-truth being the local folder)
remove Remove files from dataset (no need to download a copy of the dataset)
get Get a local copy of the dataset (either readonly --link, or writable --copy)
upload Upload the dataset (use --storage to specify storage target such as S3/GS/Azure/Folder, default: file server)

Under the hood (how it all works):

Each dataset instance stores the collection of files added/modified from the previous version (parent).

When requesting a copy of the dataset all parent datasets on the graph are downloaded and a new folder is merged with all changes introduced in the dataset DAG.

Implementation details:

Dataset differential snapshot is stored in a single zip file for efficiency in storage and network bandwidth. Local cache is built into the process making sure datasets are downloaded only once. Dataset contains SHA2 hash of all the files in the dataset. In order to increase dataset fetching speed, only file size is verified automatically, the SHA2 hash is verified only on user's request.

The design supports multiple parents per dataset, essentially merging all parents based on order. To improve deep dataset DAG storage and speed, dataset squashing was introduced. A user can squash a dataset, merging down all changes introduced in the DAG, creating a new flat version without parent datasets.

Datasets UI:

A dataset is represented as a special Task in the system.
It is of type data-processing with a special tag dataset.

Full log (calls / CLI) of the dataset creation process can be found in the "Execution" section.
Listing of the dataset differential snapshot, summary of files added / modified / removed and details of files in the differential snapshot (location / size / hash), is available in the Artifacts section you can find a
The full dataset listing (all files included) is available in the Configuration section under Dataset Content. This allows you to quickly compare two dataset contents and visually see the difference.
The dataset genealogy DAG and change-set summary table is visualized in Results / Plots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

ClearML introducing Dataset management!

Decoupling Data from Code - The Dataset Paradigm

The ultimate goal of `clearml-data` is to transform datasets into configuration parameters

Design Goals : Simple / Agnostic / File-based / Efficient

Key Concepts:

Workflow:

Simple dataset creation with CLI:

Integrating datasets into your code:

Create dataset from code

Modifying a dataset with CLI:

Folder sync mode

Command Line Interface Summary:

Under the hood (how it all works):

Datasets UI:

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

ClearML introducing Dataset management!

Decoupling Data from Code - The Dataset Paradigm

The ultimate goal of clearml-data is to transform datasets into configuration parameters

Design Goals : Simple / Agnostic / File-based / Efficient

Key Concepts:

Workflow:

Simple dataset creation with CLI:

Integrating datasets into your code:

Create dataset from code

Modifying a dataset with CLI:

Folder sync mode

Command Line Interface Summary:

Under the hood (how it all works):

Datasets UI:

The ultimate goal of `clearml-data` is to transform datasets into configuration parameters