We implemented DatasetDb
, a dedicated database for storing datasets that can be easily
processed in PyTorch. It provides a simple interface to access, iterate and create datasets. It is
based on SQLite so it avoids loading in memory all the dataset
content which is perfect for multiprocess training.
If you're interested in
understanding how it was implemented, please go the Dataset structure
section.
A DatasetDb
can be instantiated using a Python context manager to make sure that the underlying
database connection is correctly closed. This can be done as follows:
with DatasetDb("path/to/dataset.db") as db:
# now you can use the db...
The connection to the database will be closed automatically when the object goes out of scope.
Each dataset is composed of several examples. Each example in this library is represented as a
tuple (data_id, example_id, data)
:
data_id
is the instance index;example_id
is the identifier used by the dataset to represent the current instancedata
is a byte representation of the instance's content.
By default, the instance content is assumed to be JSON, so the DatasetDb
will return a Python
object when reading from the underlying SQLite database.
To access the data, you can iterate over them as follows:
from emma_datasets.db import DatasetDb
with DatasetDb("path/to/dataset.db") as db:
for data_id, example_id, data in db:
# do something with the fields...
You can access to a specific instance using either type of identifier. The DatasetDb
can be
used as a Python dictionary:
from emma_datasets.db import DatasetDb
with DatasetDb("path/to/dataset.db") as db:
# the `data_id` has to be of type `int`
data_id = 150
instance = db[data_id]
# the `example_id` has to be of type `str`
example_id = "pretraining_150"
instance = db[example_id]
The previous examples are useful if you are just interested in exploring the data. However, one
important use case for EMMA is to use the data to train a PyTorch model. We can use the DatasetDb
as follows:
from emma_datasets.db import DatasetDb
from torch.utils.data import Dataset
class EmmaPretrainingDataset(Dataset):
def __init__(self, db_path):
self.db = DatasetDb(db_path)
def __len__(self):
# Don't worry, this is extremely efficient because we have an index on the primary key :)
return len(self.db)
def __getitem__(self, index):
instance = self.db[index]
# I'm assuming you have a way to transform your raw JSON data to tensors
tensors = transform(instance)
return tensors
We can create a DatasetDb
using a similar API which is described in the following code snippet:
from emma_datasets.db import DatasetDb
num_instances = 10
with DatasetDb("path/to/dataset.db", readonly=False) as db:
for data_id in range(num_instances):
# this is just an example, you can use any Python object
instance = {"caption": "This is a caption"}
example_id = f"instance_{data_id}"
db[(data_id, example_id)] = instance
In this snippet, we assume that the dataset creation happens once so that we are able to assign one
unique data_id
to each instance. In general, data_id
represents an index that goes from 0
to
N-1
, where N
is the number of datapoints in the dataset.
When writing to the database, you may want to adjust the parameter batch_size
(default=512
).
This represents the number of instances in the database cache that are retained before we flush its
content into the database.
Thanks to SQLite integration in Python, we could implement this in pure Python code. The actual implementation can be found in the storage module.
SQLite is a powerful and efficient relational database that we use for storing the dataset we are
interested in. We assume that a dataset is composed of N
data points [x_1, x_2, ..., x_N]
.
In order to represent it in a relational database, we define a table dataset
that has the
following columns:
data_id
: an instance counter for all the database instances (defined asINTEGER PRIMARY KEY
)example_id
: an identifier for the instance (defined asTEXT
)data
: the instance's content in byte (defined asBLOB
, seeStorage types
for details)
The underlying SQLite does not support specific Python objects out of the box. Therefore, we
serialise all the instance data in bytes and store them in a BLOB
field. At the moment, we
support two different storage types:
TorchStorage
: This storage uses the default PyTorch serialisation format and can be used to store any PyTorch/Python object. For more details refer to the official documentation.JsonStorage
: We use the custom orjson library that also supports NumPy serialisation.
By default, JsonStorage
is used as serialisation type for all the instances. If you're interested
in storing actual PyTorch tensors, you can change the serialisation format as follows:
from emma_datasets.db import DatasetDb, StorageType
db = DatasetDb("/path/to/dataset.db", storage_type=StorageType.torch)