add `NaiveInfluenceFunction` (#1186) #1214

vivekmig · 2023-11-29T22:49:08Z

Summary:

Overview

This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper "Understanding Black-box Predictions via Influence Functions".

NaiveInfluenceFunction: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. "Learning Augmentation Network via Influence Functions", "Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics", "Achieving Fairness at No Utility Cost via Data Reweighting with Influence"
ArnoldiInfluenceFunction: This is a computationally efficient implementation described in the paper "Scaling Up Influence Functions" by Schioppa et al. These slides give a brief summary of it.

This diff is rebased on top of D41324297, which implements the new API.

Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here.

What is the "infinitesimal" influence score

More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by \nabla_\theta L(x)' H^{-1} \nabla_\theta L(z), where \nabla_\theta L(x) is the gradient of the loss, considering only training example x with respect to (a subset of) model parameters \theta, \nabla_\theta L(z) is the analogous quantity for a test example z, and H is the Hessian of the (subset of) model parameters at a given model checkpoint.

What the two implementations have in common

Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix R such that H^{-1} \approx RR', where k is small. In particular, let L be the matrix of width k whose columns contain the top-k eigenvectors of H, and let V be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let R=LV^{-1}L'. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors.
This approximation is useful for several reasons:

It avoids numerical issues associated with inverting small eigenvalues
Since the influence score is given by \nabla_\theta L(x)' H^{-1} \nabla_\theta L(z), which is approximated by (\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R), we can compute an "influence embedding" for a given example x, \nabla_\theta L(x)' R, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional.
Even for large models, we can store R in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute \nabla_\theta L(x) and then multiplying by R'. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples.

The implementations differ in how they compute the top-k eigenvalues / eigenvectors.

How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors

It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the _set_projections_naive_influence_function method for more details.

How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors

The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors.

This implementation does incur some one-time overhead in __init__, where it runs the Arnoldi iteration to calculate R. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example.

Unlike NaiveInfluenceFunction, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (torch.autograd.functional.hvp) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains.

High-level organization of the two implementations

Because of the common logic of the two implementations, they share the same high-level organization.

Both implementations accept a hessian_dataset initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by hessian_dataset.
in __init__, NaiveInfluenceFunction and ArnoldiInfluenceFunction both compute R using private helper methods _set_projections_naive_influence_function and _set_projections_arnoldi_influence_function, respectively.
R is used by their respective compute_intermediate_quantities methods to compute influence embeddings.
Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the _influence and self_influence methods for both implementations call the _influence_helper_intermediate_quantities_influence_function and _self_influence_helper_intermediate_quantities_influence_function helper functions, which both assume the implementation implements the compute_intermediate_quantities method.

Reason for inheritance structure

InfluenceFunctionBase refers to any implementation that computes the "infinitesimal" influence score (as opposed to TracInCPBase, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. IntermediateQuantitiesInfluenceFunction refers to implementations of InfluenceFunctionBase that implement the compute_intermediate_quantities method. The reason we don't let NaiveInfluenceFunction and ArnoldiInfluenceFunction directly inherit from InfluenceFunctionBase is that their implementations of influence and self_influence are actually identical (though for logging reasons, we cannot just move those methods into IntermediateQuantitiesInfluenceFunction). In the future, there may be implementations of InfluenceFunctionBase that do not inherit from IntermediateQuantitiesInfluenceFunction, i.e. the LISSA approach of Koh et al.

Key helper methods

captum._utils._stateless.functional_call is copy pasted from Pytorch 13.0 implementation so that the user does not need to use the latest Pytorch version, and turns a Pytorch module into a function whose inputs are the parameters of the module (represented as a dictionary). This function is used to compute the Hessian in NaiveInfluenceFunction, and Hessian-vector products in ArnoldiInfluenceFunction.
_compute_dataset_func is used by NaiveInfluenceFunction to compute the Hessian over hessian_dataset. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that torch.autograd.functional.hessian, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, flattened into a 1D tensor (and a batch). This function is given by the factory returned by naive_influnce_function._flatten_forward_factory.
_parameter_arnoldi performs the Arnoldi iteration and is used by ArnoldiInfluenceFunction. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in captum.influence._utils.common. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it.
_parameter_distill takes the output of _parameter_distill, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute R. It is used by ArnoldiInfluenceFunction.

Tests

We create a new test file tests.influence._core.test_arnoldi_influence.py, which defines the class TestArnoldiInfluence implementing the following tests:

Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff:

test_matches_linear_regression compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation:
-- NaiveInfluenceFunction with projection_dim=None, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues.
test_flatten_unflattener: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor.
test_top_eigen: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since torch.linalg.eig doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly.

Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff:

test_parameter_arnoldi checks that _parameter_arnoldi is correct. In particular, it checks that the top-k eigenvalues of the restriction of A to a Krylov subspace (the H returned by _parameter_arnoldi) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that _parameter_arnoldi implements.
test_parameter_distill checks that _parameter_distill is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of A. This is the property we require of distill, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) A to calculate a low-rank approximation of its inverse.
test_matches_linear_regression where the implementation tested is the following:
-- ArnoldiInfluenceFunction with arnoldi_dim and projection_dim set to a large value. The Krylov subspace should contain the largest eigenvectors because arnoldi_dim is large, and projection_dim is not too large relative to arnoldi_dim, but still large on an absolute level.
When projection_dim is small, ArnoldiInfluenceFunction and NaiveInfluenceFunction should produce the same influence scores, provided arnoldi_dim for ArnoldiInfluenceFunction is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in test_compare_implementations_trained_NN_model_and_data and test_compare_implementations_random_model_and_data for a trained and untrained 2-layer NN, respectively.

Minor changes / functionalities / tests

test_tracin_intermediate_quantities_aggregate, test_tracin_self_influence, test_tracin_identity_regression are applied to both implementations
_set_active_params now extracts the layers to consider when computing gradients and sets their requires_grad. This refactoring is done since the same logic is used by TracInCPBase and InfluenceFunctionBase.
some helpers are moved from tracincp to captum.influence._utils.common
a separate test_loss_fn initialization argument is supported, and both implementations are now tested in TestTracinRegression.test_tracin_constant_test_loss_fn
compute_intermediate_quantities for both implementations support the aggregate option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow.
given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to get_random_model_and_data. The specific model (and its parameters) are specified by the model_type argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD.
TracInCP and implementations of InfluenceFunctionBase all accept a sample_wise_grads_per_batch option, and have the same requirements on the loss function. Thus, _check_loss_fn_tracincp, which previously performed those checks, is renamed _check_loss_fn_sample_wise_grads_per_batch and moved to captum.influence._utils.common. Similarly, those implementations all need to compute the jacobian, with the method depending on sample_wise_grads_per_batch. The jacobian computation is moved to helper function _compute_jacobian_sample_wise_grads_per_batch.

Reviewed By: NarineK

Differential Revision: D40541294

facebook-github-bot · 2023-11-29T22:49:22Z