Skip to content
Max Gordon edited this page Sep 1, 2016 · 4 revisions

As the package is aimed at Machine Learning applications, we have added a batch loader.

Init

After loading your dataset you need to initialize the batch data splits. The default split is 70% in train, 20% in validate, and 10% in test.

desired_split = {['train'] = 0.5,
                 ['validate'] = 0.25,
                 ['test'] = 0.25}}
data = Dataframe('my_large_data.csv')
data:create_subsets(Df_Dict(desired_split))

This will create a random split according to the requested proportions. If you want to reuse the same dataset in multiple projects a good idea is to save the dataset and only run the create_subsets function the first time the dataset is required:

my_csv = '/home/max/data/my_large_data.csv'
if (paths.filep(my_csv:gsub("[.]csv$", ".t7"))) then
    dataset = torch.load(data_path:gsub("[.]csv$", ".t7"), "binary")
else
    dataset = Dataframe(my_csv)
    -- Init batch
    dataset:create_subsets(subsets = Df_Dict({['train'] = 0.8,
                                             ['validate'] = 0.1,
                                             ['test'] = 0.1})}
    -- Save the initialized dataset
    print("Saving initialized dataset to: " .. data_path:gsub("[.]csv$", ".t7"))
    torch.save(data_path:gsub("[.]csv$", ".t7"), dataset)
end

Loading a batch

Loading input data

The get_batch in the Df_Subset retrieves a Batchframe with data. The Batchframe has a custom to_tensor function that retrieves two tensors (1) data and (2) labels. The input data is commonly not in CSV format but png, mpg, etc. and the function therefore takes a function as the load_data_fn argument. The function receives the entire row as input and can use that input anyway it wishes and return a tensor of dimension. The tensors are then concatenated where the first dimension is specific to each row.

Label data

By default all numerical columns are returned as the tensor label but you can also specify exactly which columns to use through the label_columns argument. In addition to the tensor labels there is a table returned as the third value with the names of each label column since the tensor doesn't contain column names.

Example

A simple example of how this can be put together:

local dataset = Dataframe("my.csv")
dataset:create_subsets()

local stop = false
while (not stop) do
  local batch, stop = df["/train"]:
    get_batch(100)
  local data, label = batch:
    to_tensor{load_data_fn = function(row) return loadTrainingImage(row["Filename"]) end}
  data = {data = data:cuda(),
          label = label:cuda()}
  function data:size() return self.data:size(1) end
  setmetatable(data,
      {__index = function(t, i)
          return {t.data[i], t.label[i]}
      end});
  local err = trainer:train(data)
end
Clone this wiki locally