Skip to content

Research on Tabular Data Deep Learning, primarily through mirroring tree-based models' inductive biases, in particular gradient-boosted decision trees (GDBTs)

Notifications You must be signed in to change notification settings

cats256/DenseGrowNet

Repository files navigation

Downloads

image

Quick Start

For basic understanding of the package, take a look at "intro" and "method" first.

Three classes are currently available: DenseGrowNetBase, ElasticNetLoss, and CustomLinearLayer.

Example DenseGrowNetBase initialization:

from dense_grownet import DenseGrowNetBase

model = DenseGrowNetBase(input_size, output_size, is_first_model=False)
# is_first_model=True will initialize a 1-layer neural network, which is essentially a generalized linear model or if False, will initialize
# a 2-layer neural network with ReLU activation, where the first layer is 'looks-linear' initialized and the second layer is zero-initialized

# Any model initialized with is_first_model=false will require passing in prev_output, or previous models' predictions
predictions = model(inputs, prev_output)

# To extract intermediate features generated by the first layer of a non-first model for concatenation to the next model's inputs, do
extracted_features = model.extract_features(inputs, prev_output)

Example ElasticNetLoss initialization:

import torch.nn as nn
from dense_grownet import ElasticNetLoss

criterion = ElasticNetLoss(criterion=nn.CrossEntropyLoss(), l1_lambda=0.01, l2_lambda=0.01)
# Loss is calculated as criterion_loss + l1_lambda * weights_l1_norm + l2_lambda * weights_l2_norm, with criterion as any desired loss function

Example CustomLinearLayer initialization:

from dense_grownet import CustomLinearLayer

linear_layer_2 = CustomLinearLayer(input_size, output_size, init="looks_linear")
# Two init mode supported: "zero" and "looks-linear". Look at method section for example of "looks-linear" initialization

Intro

Tree-based models and, particularly, gradient-boosted decision trees (GDBTs) have long been and continue to be state-of-the-art on medium-sized tabular data, despite extensive deep-learning research on such data types. Certain inductive biases of tree-based models, such as their rotationally variant learning procedure, which extracts information based on features' orientation, and their robustness against uninformative features, contribute to their strong performance on tabular data, in contrast to MLPs' rotationally invariant learning procedure and susceptibility to uninformative features (1). Gradient boosting's inductive bias can be explained as the bias towards explaining the largest proportion of variance through simpler interaction terms, with the contribution to variance decreasing as the order of interaction increases, rather than a large amount of high-order interaction terms, each explaining a small amount of variance. See more explanation at (2). This project aims to improve neural networks' performance on tabular data through a focus on investigating and applying the inductive biases of tree-based models, particularly, GDBTs, on MLP-like neural nets.

Method

A gradient boosting technique for neural networks is proposed, with superior performance on tabular data and applicability to other forms of data and neural networks. The first, base model is a one-layer zero-initialized neural network, which is a generalized linear model. A dedicated solver can be used for this step as we are only interested in the raw predictions of the first model. The second model, with the first layer "looks linear" initialized (3), the second layer zero-initialized, and the activation function belonging to the ReLU-family function, is then trained to predict the residuals or correct the error of the previous model. 'looks linear' initialization can be described as initializing in the following pattern: [[1 0 ... 0 0], [-1 0 ... 0 0], [0 1 ... 0 0], [0 -1 ... 0 0], ..., [0 0 ... 1 0], [0 0 ... -1 0], [0 0 ... 0 1], [0 0 ... 0 -1]], where there are 2N neurons for N features, and the second, final layer can easily replicate linear inputs as ReLU(0, x) - ReLU(0, -x) = x. For further explanation of 'looks linear' initialization, see (4). For each model after the second model, use the original features and the intermediate features generated by all previous models as input features for the current model. Initialize the next k models in a similar fashion as the second model. For further clarification, see the diagram above. Adjust regularization, learning rate, and epoch as appropriate.

Explanation

The 'looks linear' initialization preserves the original orientation of features at initialization. This alleviates the downside of a rotationally invariant learning procedure, where "intuitively, to remove uninformative features, a rotationally invariant algorithm has to first find the original orientation of the features, and then select the least informative ones: the information contained in the orientation of the data is lost" (1). Moreover, this achieves dynamical isometry, where the singular values of the network's input-output Jacobian concentrate near 1, which "has been shown to dramatically speed up learning", avoid vanishing/exploding gradients, and appears to improve generalization performance (5). Gradient boosting for neural networks where each model is only trained on the original features proved infeasible to train as training loss decreases much more slowly with each subsequent model. To address this, intermediate features from previous models were used along with the original features as input for subsequent models, which significantly improved training speed. The reuse of intermediate features appears to offer slightly greater generalization performance compared to training each model solely on the original features based on a preliminary experiment.

Additional

Elastic net regularization, which is L1 + L2 regularization, should be included as part of the training procedure, where both L1 and L2 regularization promote smaller weights. L1 regularization corresponds to the assumption of the Laplace distribution of weights, where weights tend to be sparse and are penalized proportionally based on their sizes. L2 regularization corresponds to the normal distribution of weights, where weights are penalized quadratically based on their sizes. It is known that L1 regularization is rotationally invariant and logistic regression with L1 regularization is robust against uninformative features, where sample complexity grows only logarithmically with the number of irrelevant features (6). One can, then, see that L1 regularization is also desirable to include as part of the learning procedure, as it mirrors the inductive biases that contribute to tree-based models' strong performance on tabular data.

Additional Notes

Softplus appears to improve generalization and trainability, provided that the curvature of softplus is appropriately small, eg. softplus(4 * x) / 4. This is because softplus has a tendency to mimic a linear or identity activation. If most of your values are concentrated at a very small range around 0, say from -0.2 to 0.2, softplus will look more like a linear or identity function rather than a piecewise linear function like ReLU, see example below.

image

Contact: nhatbui@tamu.edu (would be great if someone is looking to discuss, collaborate, or act as a mentor on this research project :D )

Cool read:

Keywords spam: GrowNet, DenseNet, ResNet, neural networks, deep learning, dynamical isometry, 'looks linear' initialization, ReLU, Softplus, activation function, gradient boosting, inductive bias, decision trees, regularization, L1, L2, elastic net

To do:

References:

  1. Why do tree-based models still outperform deep learning on typical tabular data?

    https://openreview.net/pdf?id=Fp7__phQszn

  2. Tim Goodman's explanation of gradient boosting's inductive bias

    https://stats.stackexchange.com/questions/173390/gradient-boosting-tree-vs-random-forest#comment945015_174020

  3. The Shattered Gradients Problem: If resnets are the answer, then what is the question?

    https://proceedings.mlr.press/v70/balduzzi17b/balduzzi17b.pdf

  4. "looks-linear" initialization explanation

    https://www.reddit.com/r/MachineLearning/comments/5yo30r/comment/desyjot/

  5. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

    https://arxiv.org/pdf/1711.04735

  6. Feature selection, L1 vs. L2 regularization, and rotational invariance

    https://icml.cc/Conferences/2004/proceedings/papers/354.pdf

  7. Smooth Adversarial Training

    https://arxiv.org/abs/2006.14536

  8. Reproducibility in Deep Learning and Smooth Activations

    https://research.google/blog/reproducibility-in-deep-learning-and-smooth-activations/

About

Research on Tabular Data Deep Learning, primarily through mirroring tree-based models' inductive biases, in particular gradient-boosted decision trees (GDBTs)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published