-
Notifications
You must be signed in to change notification settings - Fork 623
Add first implementation of augmentedNN to predict selectivity #1473
Conversation
35692f3
to
8c6ec93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code before the tests. The high-level structure looks reasonable to me. I think we still need more comments and documentation to get a better understanding of the code.
For the questions I have, please directly add comments to the code if possible. I'll take a second pass along with the tests part after they're addressed.
float learn_rate, int batch_size, int epochs); | ||
/** | ||
* Train the Tensorflow model | ||
* @param mat: Contiguous time-series data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time-series data? Looks like you need to update the document here.
using TfFloatIn = TfSessionEntityInput<float>; | ||
using TfFloatOut = TfSessionEntityOutput<float>; | ||
|
||
class AugmentedNN : public BaseTFModel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a higher lever documentation on what this thing is doing. That is, from a DBMS perspective, what is this AugmentedNN is modeling? What is it trying to predict? What is the input data, and what is the output data?
* However instead of applying backprop it obtains predicted values. | ||
* Then the validation loss is calculated for the relevant sequence | ||
* - this is a function of segment and horizon. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as this function. Looks like the comments are outdated.
// Function to generate the args string to feed the python model | ||
std::string ConstructModelArgsString() const; | ||
// Attributes needed for the Seq2Seq LSTM model(set by the user/settings.json) | ||
int ncol_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use the abbreviation in variable names. Instead, use column_number_
or at least column_num_
if n
means number here.
// Attributes needed for the Seq2Seq LSTM model(set by the user/settings.json) | ||
int ncol_; | ||
int order_; | ||
int nneuron_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. And we need comments on what these member variables are.
float ValidateEpoch(const matrix_eig &mat); | ||
|
||
void Fit(const matrix_eig &X, const matrix_eig &y, int bsz) override; | ||
matrix_eig Predict(const matrix_eig &X, int bsz) const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to comment these functions as well. What are they doing? What the matrices in the function arguments exactly are? I think sometimes the matrix is the input, sometimes output, and sometimes both of them.
src/brain/modelgen/AugmentedNN.py
Outdated
self.optimize | ||
|
||
@staticmethod | ||
def jumpActivation(k): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the same naming convention. Something like jump_activation()
for functions.
src/brain/modelgen/AugmentedNN.py
Outdated
def jumpActivation(k): | ||
def jumpActivationk(x): | ||
return tf.pow(tf.maximum(0.0, 1-tf.exp(-x)), k) | ||
return jumpActivationk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all, what is this jump activation thing with different orders on each layer? Is there any reference? Or can you explain the intuition behind it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the documentation! I have a few more comments.
I think it's better to create another selectivity
folder under the brain
directory (for both header and source directory) and put your new stuff there. For the tests, you should put the tests for the new model in a different file, and put TestingAugmentedNNUtil in a file called testing_augmented_nn_util.h(cpp)
.
Also, I prefer using all lower case augmented_nn
in the file names instead of augmentedNN
, but that's minor.
test/brain/testing_forecast_util.cpp
Outdated
size_t split_point = | ||
data.rows() - static_cast<size_t>(data.rows() * val_split); | ||
|
||
// Split into train/test data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is splitting into train/validate data, and the test data is separate, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I'll update the comment here.
test/brain/model_test.cpp
Outdated
} | ||
|
||
TEST_F(ModelTests, DISABLED_AugmentedNNSkewedTest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we probably cannot enable them in the CI, but have you tested these tests locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've tested them.
* SELECT * FROM table WHERE c1 >= l1 AND c1 <= u1 | ||
* AND c2 >= l2 AND c2 <= u2 | ||
* AND ... | ||
* Input is [l1, u1, l2, u2, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the tests, it looks like right now we're doing only one pair of predicates, just [l1, u1]
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
f54b380
to
a716df1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you have a duplicated Python script. Other than that it looks good to me.
1457242
to
813321d
Compare
…b#1473) * add first implementation of augmentedNN to predict selectivity for range predicates * add first implementation of augmentedNN to predict selectivity * add first implementation of augmentedNN to predict selectivity * add comments and modify variable names * rename some variables * create brain/selectivity; create new test file for augmented_nn. * remove duplicated files * check if travis is ok
The model is an initial implementation to predict selectivity for range predicates.
It can be applied to queries like:
SELECT * FROM table WHERE c >= l AND c <= u
.I implement the model in augmentedNN.py and cpp wrapper code in augmentedNN.cpp, taking LSTM.py and LSTM.cpp as a reference.
Hyperparameters, especially number of training epochs, need to be discussed based on real system experiments.
Test cases for the model are also added. The test cases include a uniform distribution dataset and a skewed distribution dataset.
There are two classes defined.
AugmentedNN
(in augmentedNN.cpp). This class is just like classTimeSeriesLSTM
.Fit()
: applies backpropagation.Predict()
: returns the predictions for the input.TrainEpoch()
: trains for one epoch.ValidateEpoch()
: uses one epoch for validation.TestingAugmentedNNUtil
(in testing_forecast_util.cpp)GetData()
: generates data for training and testing. Dataset is uniform or skewed distributed.Test()
: calls the APIs mentioned above to train and test the model.Btw, in testing_forecast_util.cpp, the argument of
matrix_eig::bottomRows
was wrong. It should be the number of rows counted from the bottom of thematrix_eig
. I've modified it. Please check if I am right.