-
Notifications
You must be signed in to change notification settings - Fork 8
Batch loading
As the package is aimed at Machine Learning applications, we have added a batch loader.
After loading your dataset you need to initialize the batch data splits. The default split is 70% in train, 20% in validate, and 10% in test.
desired_split = {['train'] = 0.5,
['validate'] = 0.25,
['test'] = 0.25}}
data = Dataframe('my_large_data.csv')
data:create_subsets(Df_Dict(desired_split))
This will create a random split according to the requested proportions. If you want to reuse the same dataset in multiple projects a good idea is to save the dataset and only run the create_subsets
function the first time the dataset is required:
my_csv = '/home/max/data/my_large_data.csv'
if (paths.filep(my_csv:gsub("[.]csv$", ".t7"))) then
dataset = torch.load(data_path:gsub("[.]csv$", ".t7"), "binary")
else
dataset = Dataframe(my_csv)
-- Init batch
dataset:create_subsets(subsets = Df_Dict({['train'] = 0.8,
['validate'] = 0.1,
['test'] = 0.1})}
-- Save the initialized dataset
print("Saving initialized dataset to: " .. data_path:gsub("[.]csv$", ".t7"))
torch.save(data_path:gsub("[.]csv$", ".t7"), dataset)
end
The get_batch
in the Df_Subset
retrieves a Batchframe
with data. The Batchframe
has a custom to_tensor
function that retrieves two tensors (1) data and (2) labels. The input data is commonly not in CSV format but png, mpg, etc. and the function therefore takes a function as the load_data_fn
argument. The function receives the entire row as input and can use that input anyway it wishes and return a tensor of dimension. The tensors are then concatenated where the first dimension is specific to each row.
By default all numerical columns are returned as the tensor label but you can also specify exactly which columns to use through the label_columns
argument. In addition to the tensor labels there is a table returned as the third value with the names of each label column since the tensor doesn't contain column names.
A simple example of how this can be put together:
local dataset = Dataframe("my.csv")
dataset:create_subsets()
local stop = false
while (not stop) do
local batch, stop = df["/train"]:
get_batch(100)
local data, label = batch:
to_tensor{load_data_fn = function(row) return loadTrainingImage(row["Filename"]) end}
data = {data = data:cuda(),
label = label:cuda()}
function data:size() return self.data:size(1) end
setmetatable(data,
{__index = function(t, i)
return {t.data[i], t.label[i]}
end});
local err = trainer:train(data)
end