-
Notifications
You must be signed in to change notification settings - Fork 8
Dataframe's inner workings
This section is for understanding how the torch-dataframe saves the data. Useful if you wish to contribute to the project.
The core data is saved in the self.dataset
table. Each entry is a Dataseries
objects and designated by a string name in the self.dataset
table, all column names are trimmed. All columns are required to be of the same length and missing elements, i.e. nil
elements are substituted with nan
(0/0
) values. As a helper element there is the self.n_rows
that corresponds to the number of rows in the dataset. The column order is maintained by the table self.column_order
.
All the data is stored using the Dataseries
class. The class uses tds.Vec
for boolean and string variables and torch.*Tensor
for all numerical data. All the operations that can be performed on a column are available directly at the Dataseries
. The Dataframe
generally has a wrapper that takes the column name, retrieves the column using get_column
and then calls the same column function, i.e. df:count_na("my_col_name")
, df:get_column("my_col_name"):count_na()
and df["$my_col_name"]:count_na()
all do the same thing.
The information on the categorical variables is available in the categorical
table within a Dataseries
. Each key corresponds to a table where the keys correspond to the categorical levels and the value to their position, e.g. {["Male"] = 1, ["Female"] = 2}
.
The Df_Subset
class is a thin wrapper around the Dataframe
object. It contains only indexes and possibly a label that it uses in order to provide the samplers the data they need. The core concept is that we want to be able to do get_batch
on a subset and get a random sample from that set of elements.
The Batchframe
class is returned via the get_batch
class. It's purpose is to make the to_tensor
conversion more convenient by allowing loading both a data
tensor and a label
tensor per row.
The Torch Dataframe class uses argcheck in all its functions. The package tries to follow the recommended call structure, here's an example:
Dataframe.has_column = argcheck{
doc = [[
<a name="Dataframe.has_column">
### Dataframe.has_column(@ARGP)
@ARGT
Checks if column is present in the dataset
_Return value_: boolean
]],
{name="self", type="Dataframe"},
{name="column_name", type="string", doc="The column to check"},
call=function(self, column_name)
for _,v in pairs(self.column_order) do
if (v == column_name) then
return true
end
end
return false
end}
Note that each public function§ should have a doc argument explaining the function. The argument explanation should have the structure:
<a name="Dataframe.function_name">
### Dataframe.function_name(@ARGP)
@ARGT
_Return value_: void
§ there are no public/private in Torch class but private functions are denoted with _ and should not be included in the API-docs
The argcheck functionality allows the same function to have different parameters. This is done through overload where it is important to note that the doc is appended to the original set of docs, see example:
Dataframe.get_mode = argcheck{
doc = [[
<a name="Dataframe.get_mode">
### Dataframe.get_mode(@ARGP)
Gets the mode for a Dataseries. A mode is defined as the most frequent value.
Note that if two or more values are equally common then there are several modes.
The mode is useful as it can be viewed as any algorithms most naive guess where
it always guesses the same value.
@ARGT
_Return value_: Table or Dataframe
]],
{name="self", type="Dataframe"},
{name='column_name', type='string', doc='column to inspect'},
{name='normalize', type='boolean', default=false,
doc=[[
If True then the object returned will contain the relative frequencies of
the unique values.]]},
{name='dropna', type='boolean', default=true,
doc="Don’t include counts of NaN (missing values)."},
{name='as_dataframe', type='boolean', default=true,
doc="Return a dataframe"},
call=function(self, column_name, normalize, dropna, as_dataframe)
self:assert_has_column(column_name)
return self:get_column(column_name):get_mode{
normalize = normalize,
dropna = dropna,
as_dataframe = as_dataframe
}
end}
Dataframe.get_mode = argcheck{
doc = [[
@ARGT
]],
overload=Dataframe.get_mode,
{name="self", type="Dataframe"},
{name="columns", type="Df_Array", doc="The columns of interest", opt=true},
{name='normalize', type='boolean', default=false,
doc=[[
If True then the object returned will contain the relative frequencies of
the unique values.]]},
{name='dropna', type='boolean', default=true,
doc="Don’t include counts of NaN (missing values)."},
{name='as_dataframe', type='boolean', default=true,
doc="Return a dataframe"},
call=function(self, columns, normalize, dropna, as_dataframe)
if (columns) then
columns = columns.data
else
columns = self:get_numerical_colnames()
end
local modes = {}
if (as_dataframe) then
modes = Dataframe.new()
end
for i = 1,#columns do
local cn = columns[i]
local value =
self:get_mode{column_name = cn,
normalize = normalize,
dropna = dropna,
as_dataframe = as_dataframe}
if (as_dataframe) then
value:add_column{
column_name = 'Column',
pos = 1,
default_value = cn,
type = "string"
}
modes:append(value)
else
modes[cn] = value
end
end
return modes
end}
Due to limitations in the Lua language the package uses helper classes for separating regular table arguments from tables passed into as arguments. The three classes are:
- Df_Array - contains only values and no keys
- Df_Dict - a dictionary table that has named keys that map to all values. The values can be atomics or arrays.
- Df_Tbl - a raw table wrapper that doesn't even copy the original data. Useful when you want speed.
I general the concept is to pass the simplest possible argument.
The background is how Lua handles the {} arguments. It basically wraps up the table and passes it as the first argument. As there are no hints that this has happened argcheck has a difficult time handling ambiguity as this:
function a(b)
print(b)
end
a("my_csv")
a({b="my_csv"})
a{b="my_csv"}
a{b={col1={1,2,3},
col2={"A", "B", "C"}}}
a({col1={1,2,3},
col2={"A", "B", "C"}})
Produces the output:
my_csv
{
b : "my_csv"
}
{
b : "my_csv"
}
{
b :
{
col1 :
{
1 : 1
2 : 2
3 : 3
}
col2 :
{
1 : "A"
2 : "B"
3 : "C"
}
}
}
{
col1 :
{
1 : 1
2 : 2
3 : 3
}
col2 :
{
1 : "A"
2 : "B"
3 : "C"
}
}