For basic understanding of the package, take a look at "intro" and "method" first.
Three classes are currently available: DenseGrowNetBase, ElasticNetLoss, and CustomLinearLayer.
Example DenseGrowNetBase initialization:
from dense_grownet import DenseGrowNetBase
model = DenseGrowNetBase(input_size, output_size, is_first_model=False)
# is_first_model=True will initialize a 1-layer neural network, which is essentially a generalized linear model or if False, will initialize
# a 2-layer neural network with ReLU activation, where the first layer is 'looks-linear' initialized and the second layer is zero-initialized
# Any model initialized with is_first_model=false will require passing in prev_output, or previous models' predictions
predictions = model(inputs, prev_output)
# To extract intermediate features generated by the first layer of a non-first model for concatenation to the next model's inputs, do
extracted_features = model.extract_features(inputs, prev_output)
Example ElasticNetLoss initialization:
import torch.nn as nn
from dense_grownet import ElasticNetLoss
criterion = ElasticNetLoss(criterion=nn.CrossEntropyLoss(), l1_lambda=0.01, l2_lambda=0.01)
# Loss is calculated as criterion_loss + l1_lambda * weights_l1_norm + l2_lambda * weights_l2_norm, with criterion as any desired loss function
Example CustomLinearLayer initialization:
from dense_grownet import CustomLinearLayer
linear_layer_2 = CustomLinearLayer(input_size, output_size, init="looks_linear")
# Two init mode supported: "zero" and "looks-linear". Look at method section for example of "looks-linear" initialization
Tree-based models and, particularly, gradient-boosted decision trees (GDBTs) have long been and continue to be state-of-the-art on medium-sized tabular data, despite extensive deep-learning research on such data types. Certain inductive biases of tree-based models, such as their rotationally variant learning procedure, which extracts information based on features' orientation, and their robustness against uninformative features, contribute to their strong performance on tabular data, in contrast to MLPs' rotationally invariant learning procedure and susceptibility to uninformative features (1). Gradient boosting's inductive bias can be explained as the bias towards explaining the largest proportion of variance through simpler interaction terms, with the contribution to variance decreasing as the order of interaction increases, rather than a large amount of high-order interaction terms, each explaining a small amount of variance. See more explanation at (2). This project aims to improve neural networks' performance on tabular data through a focus on investigating and applying the inductive biases of tree-based models, particularly, GDBTs, on MLP-like neural nets.
A gradient boosting technique for neural networks is proposed, with superior performance on tabular data and applicability to other forms of data and neural networks. The first, base model is a one-layer zero-initialized neural network, which is a generalized linear model. A dedicated solver can be used for this step as we are only interested in the raw predictions of the first model. The second model, with the first layer "looks linear" initialized (3), the second layer zero-initialized, and the activation function belonging to the ReLU-family function, is then trained to predict the residuals or correct the error of the previous model. 'looks linear' initialization can be described as initializing in the following pattern: [[1 0 ... 0 0], [-1 0 ... 0 0], [0 1 ... 0 0], [0 -1 ... 0 0], ..., [0 0 ... 1 0], [0 0 ... -1 0], [0 0 ... 0 1], [0 0 ... 0 -1]], where there are 2N neurons for N features, and the second, final layer can easily replicate linear inputs as ReLU(0, x) - ReLU(0, -x) = x. For further explanation of 'looks linear' initialization, see (4). For each model after the second model, use the original features and the intermediate features generated by all previous models as input features for the current model. Initialize the next k models in a similar fashion as the second model. For further clarification, see the diagram above. Adjust regularization, learning rate, and epoch as appropriate.
The 'looks linear' initialization preserves the original orientation of features at initialization. This alleviates the downside of a rotationally invariant learning procedure, where "intuitively, to remove uninformative features, a rotationally invariant algorithm has to first find the original orientation of the features, and then select the least informative ones: the information contained in the orientation of the data is lost" (1). Moreover, this achieves dynamical isometry, where the singular values of the network's input-output Jacobian concentrate near 1, which "has been shown to dramatically speed up learning", avoid vanishing/exploding gradients, and appears to improve generalization performance (5). Gradient boosting for neural networks where each model is only trained on the original features proved infeasible to train as training loss decreases much more slowly with each subsequent model. To address this, intermediate features from previous models were used along with the original features as input for subsequent models, which significantly improved training speed. The reuse of intermediate features appears to offer slightly greater generalization performance compared to training each model solely on the original features based on a preliminary experiment.
Elastic net regularization, which is L1 + L2 regularization, should be included as part of the training procedure, where both L1 and L2 regularization promote smaller weights. L1 regularization corresponds to the assumption of the Laplace distribution of weights, where weights tend to be sparse and are penalized proportionally based on their sizes. L2 regularization corresponds to the normal distribution of weights, where weights are penalized quadratically based on their sizes. It is known that L1 regularization is rotationally invariant and logistic regression with L1 regularization is robust against uninformative features, where sample complexity grows only logarithmically with the number of irrelevant features (6). One can, then, see that L1 regularization is also desirable to include as part of the learning procedure, as it mirrors the inductive biases that contribute to tree-based models' strong performance on tabular data.
Softplus appears to improve generalization and trainability, provided that the curvature of softplus is appropriately small, eg. softplus(4 * x) / 4. This is because softplus has a tendency to mimic a linear or identity activation. If most of your values are concentrated at a very small range around 0, say from -0.2 to 0.2, softplus will look more like a linear or identity function rather than a piecewise linear function like ReLU, see example below.
Contact: nhatbui@tamu.edu (would be great if someone is looking to discuss, collaborate, or act as a mentor on this research project :D )
Cool read:
- https://stats.stackexchange.com/questions/339054/what-values-should-initial-weights-for-a-relu-network-be
- https://proceedings.neurips.cc/paper/2020/file/f3f27a324736617f20abbf2ffd806f6d-Paper.pdf
- https://www.fast.ai/posts/2018-07-02-adam-weight-decay.html
- https://sebastianraschka.com/blog/2022/deep-learning-for-tabular-data.html
- https://mindfulmodeler.substack.com/p/inductive-biases-of-the-random-forest
- https://sebastianraschka.com/blog/2022/deep-learning-for-tabular-data.html
- https://openreview.net/forum?id=UgBo_nhiHl (not the same thing as what I'm doing, this was discovered after)
Keywords spam: GrowNet, DenseNet, ResNet, neural networks, deep learning, dynamical isometry, 'looks linear' initialization, ReLU, Softplus, activation function, gradient boosting, inductive bias, decision trees, regularization, L1, L2, elastic net
To do:
- Create an object-oriented library, publish it to PyPI, and run benchmarks
- Experiment with string length distance regularization
- Smooth versions of ReLUs and their advantages in README need to be included. This includes improved robustness (7) and smoother optimization landscapes (8) with little accuracy trade-off.
- Look at deep learning significance testing. https://deep-significance.readthedocs.io/en/latest
- Check out the convex optimization of a two-layer ReLU neural network.
- Experiment with adaptive softplus and exponentially increasing step size, learning rate, or exponentially growing gradient descent. https://arxiv.org/abs/2205.07999
-
Why do tree-based models still outperform deep learning on typical tabular data?
-
Tim Goodman's explanation of gradient boosting's inductive bias
-
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
https://proceedings.mlr.press/v70/balduzzi17b/balduzzi17b.pdf
-
"looks-linear" initialization explanation
https://www.reddit.com/r/MachineLearning/comments/5yo30r/comment/desyjot/
-
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice
-
Feature selection, L1 vs. L2 regularization, and rotational invariance
-
Smooth Adversarial Training
-
Reproducibility in Deep Learning and Smooth Activations
https://research.google/blog/reproducibility-in-deep-learning-and-smooth-activations/