Skip to content

Dataframe's inner workings

Max Gordon edited this page Sep 1, 2016 · 4 revisions

This section is for understanding how the torch-dataframe saves the data. Useful if you wish to contribute to the project.

Core data

The core data is saved in the self.dataset table. Each entry is a Dataseries objects and designated by a string name in the self.dataset table, all column names are trimmed. All columns are required to be of the same length and missing elements, i.e. nil elements are substituted with nan (0/0) values. As a helper element there is the self.n_rows that corresponds to the number of rows in the dataset. The column order is maintained by the table self.column_order.

Dataseries and columns

All the data is stored using the Dataseries class. The class uses tds.Vec for boolean and string variables and torch.*Tensor for all numerical data. All the operations that can be performed on a column are available directly at the Dataseries. The Dataframe generally has a wrapper that takes the column name, retrieves the column using get_column and then calls the same column function, i.e. df:count_na("my_col_name"), df:get_column("my_col_name"):count_na() and df["$my_col_name"]:count_na() all do the same thing.

Categoricals

The information on the categorical variables is available in the categorical table within a Dataseries. Each key corresponds to a table where the keys correspond to the categorical levels and the value to their position, e.g. {["Male"] = 1, ["Female"] = 2}.

Df_Subset

The Df_Subset class is a thin wrapper around the Dataframe object. It contains only indexes and possibly a label that it uses in order to provide the samplers the data they need. The core concept is that we want to be able to do get_batch on a subset and get a random sample from that set of elements.

Batchframe

The Batchframe class is returned via the get_batch class. It's purpose is to make the to_tensor conversion more convenient by allowing loading both a data tensor and a label tensor per row.

Argcheck

The Torch Dataframe class uses argcheck in all its functions. The package tries to follow the recommended call structure, here's an example:

Dataframe.has_column = argcheck{
	doc = [[
<a name="Dataframe.has_column">
### Dataframe.has_column(@ARGP)

@ARGT

Checks if column is present in the dataset

_Return value_: boolean
]],
	{name="self", type="Dataframe"},
	{name="column_name", type="string", doc="The column to check"},
	call=function(self, column_name)
	for _,v in pairs(self.column_order) do
		if (v == column_name) then
			return true
		end
	end
	return false
end}

Note that each public function§ should have a doc argument explaining the function. The argument explanation should have the structure:

<a name="Dataframe.function_name">
### Dataframe.function_name(@ARGP)

@ARGT

_Return value_: void

§ there are no public/private in Torch class but private functions are denoted with _ and should not be included in the API-docs

Overloading

The argcheck functionality allows the same function to have different parameters. This is done through overload where it is important to note that the doc is appended to the original set of docs, see example:

Dataframe.get_mode = argcheck{
	doc =  [[
<a name="Dataframe.get_mode">
### Dataframe.get_mode(@ARGP)

Gets the mode for a Dataseries. A mode is defined as the most frequent value.
Note that if two or more values are equally common then there are several modes.
The mode is useful as it can be viewed as any algorithms most naive guess where
it always guesses the same value.

@ARGT

_Return value_: Table or Dataframe
]],
	{name="self", type="Dataframe"},
	{name='column_name', type='string', doc='column to inspect'},
	{name='normalize', type='boolean', default=false,
	 doc=[[
	 	If True then the object returned will contain the relative frequencies of
		the unique values.]]},
	{name='dropna', type='boolean', default=true,
	 doc="Don’t include counts of NaN (missing values)."},
	{name='as_dataframe', type='boolean', default=true,
	 doc="Return a dataframe"},
	call=function(self, column_name, normalize, dropna, as_dataframe)
	self:assert_has_column(column_name)

	return self:get_column(column_name):get_mode{
		normalize = normalize,
		dropna = dropna,
		as_dataframe = as_dataframe
	}
end}

Dataframe.get_mode = argcheck{
	doc =  [[

@ARGT

]],
	overload=Dataframe.get_mode,
	{name="self", type="Dataframe"},
	{name="columns", type="Df_Array", doc="The columns of interest", opt=true},
	{name='normalize', type='boolean', default=false,
	 doc=[[
	 	If True then the object returned will contain the relative frequencies of
		the unique values.]]},
	{name='dropna', type='boolean', default=true,
	 doc="Don’t include counts of NaN (missing values)."},
	{name='as_dataframe', type='boolean', default=true,
	 doc="Return a dataframe"},
	call=function(self, columns, normalize, dropna, as_dataframe)
	if (columns) then
		columns = columns.data
	else
		columns = self:get_numerical_colnames()
	end

	local modes = {}
	if (as_dataframe) then
		modes = Dataframe.new()
	end

	for i = 1,#columns do
		local cn = columns[i]
		local value =
			self:get_mode{column_name = cn,
			              normalize = normalize,
			              dropna = dropna,
			              as_dataframe = as_dataframe}
		if (as_dataframe) then
			value:add_column{
				column_name = 'Column',
				pos = 1,
				default_value = cn,
				type = "string"
			}
			modes:append(value)
		else
			modes[cn] = value
		end
	end

	return modes
end}

Helper classes

Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:

  • Df_Array - contains only values and no keys
  • Df_Dict - a dictionary table that has named keys that map to all values. The values can be atomics or arrays.
  • Df_Tbl - a raw table wrapper that doesn't even copy the original data. Useful when you want speed.

I general the concept is to pass the simplest possible argument.

Technical explanation

The background is how Lua handles the {} arguments. It basically wraps up the table and passes it as the first argument. As there are no hints that this has happened argcheck has a difficult time handling ambiguity as this:

function a(b)
  print(b)
end

a("my_csv")
a({b="my_csv"})
a{b="my_csv"}

a{b={col1={1,2,3},
     col2={"A", "B", "C"}}}
a({col1={1,2,3},
   col2={"A", "B", "C"}})

Produces the output:

my_csv	
{
  b : "my_csv"
}
{
  b : "my_csv"
}
{
  b : 
    {
      col1 : 
        {
          1 : 1
          2 : 2
          3 : 3
        }
      col2 : 
        {
          1 : "A"
          2 : "B"
          3 : "C"
        }
    }
}
{
  col1 : 
    {
      1 : 1
      2 : 2
      3 : 3
    }
  col2 : 
    {
      1 : "A"
      2 : "B"
      3 : "C"
    }
}