-
Notifications
You must be signed in to change notification settings - Fork 515
add NaiveInfluenceFunction
(#1186)
#1214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
a3c44f6
to
f8e88f2
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
f8e88f2
to
b2fa19f
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
b2fa19f
to
58dfaa1
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
58dfaa1
to
d22993e
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://internalfb.com/D40541294 fbshipit-source-id: 2af0b6ab515e8718898657d8d320c2fdc60312f3
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294 fbshipit-source-id: d193de2b00032c006fdd21bec90995aae49b7ed1
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://internalfb.com/D40541294 fbshipit-source-id: aacd31a067a69d3f36b202da9225a4dabcde51c4
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
d22993e
to
4cbed09
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
4cbed09
to
c0ee133
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
c0ee133
to
4f723d1
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
4f723d1
to
c6415ad
Compare
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294 fbshipit-source-id: 9b4609e27e504ebe24c9a811356d16a9d6a376de
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
c6415ad
to
a18e870
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294 fbshipit-source-id: 94b5490d160ad88d16428d5d70fcd86cb717ca9d
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
a18e870
to
7b4d34b
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://internalfb.com/D40541294 fbshipit-source-id: d07705649ebd8e8b596cb73ef7d56968492983b5
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://internalfb.com/D40541294 fbshipit-source-id: 1c4e40eda5dacab6e670f55c5466f214eaf50eeb
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294 fbshipit-source-id: 880813af1b3da5263c5df09883cf75206f7c8d82
7b4d34b
to
6b10046
Compare
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
6b10046
to
325446d
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://internalfb.com/D40541294 fbshipit-source-id: 8c93366c9738afe408ac25149bbf0a4b797247a5
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://www.internalfb.com/diff/D40541294?entry_point=27 fbshipit-source-id: c1fa21e0eb816deece60695e679f6ba17ddcccbb
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://www.internalfb.com/diff/D40541294?entry_point=27 fbshipit-source-id: 8dfec4302e6895f97e0c4e3a9fb4a58ce2673f20
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
325446d
to
37c76f6
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Reviewed By: NarineK Differential Revision: D40541294
37c76f6
to
b4d1c43
Compare
This pull request was exported from Phabricator. Differential Revision: D40541294 |
Summary: Pull Request resolved: pytorch#1214 Pull Request resolved: pytorch#1186 # Overview This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper ["Understanding Black-box Predictions via Influence Functions"](https://arxiv.org/pdf/1703.04730.pdf). - `NaiveInfluenceFunction`: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. ["Learning Augmentation Network via Influence Functions"](https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Learning_Augmentation_Network_via_Influence_Functions_CVPR_2020_paper.pdf), ["Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics"](https://openreview.net/forum?id=RUzSobdYy0V), ["Achieving Fairness at No Utility Cost via Data Reweighting with Influence"](https://proceedings.mlr.press/v162/li22p/li22p.pdf) - `ArnoldiInfluenceFunction`: This is a computationally efficient implementation described in the paper ["Scaling Up Influence Functions"](https://arxiv.org/pdf/2112.03052.pdf) by Schioppa et al. These [slides](https://docs.google.com/presentation/d/1yJ86FkJO1IZn7YzFYpkJUJUBqaLynDJCbCWlKKglv-w/edit#slide=id.p) give a brief summary of it. This diff is rebased on top of D41324297, which implements the new API. Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here. # What is the "infinitesimal" influence score More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, where `\nabla_\theta L(x)` is the gradient of the loss, considering only training example `x` with respect to (a subset of) model parameters `\theta`, `\nabla_\theta L(z)` is the analogous quantity for a test example `z`, and `H` is the Hessian of the (subset of) model parameters at a given model checkpoint. # What the two implementations have in common Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix `R` such that `H^{-1} \approx RR'`, where k is small. In particular, let `L` be the matrix of width k whose columns contain the top-k eigenvectors of `H`, and let `V` be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations let `R=LV^{-1}L'`. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors. This approximation is useful for several reasons: - It avoids numerical issues associated with inverting small eigenvalues - Since the influence score is given by `\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)`, which is approximated by `(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)`, we can compute an "influence embedding" for a given example `x`, `\nabla_\theta L(x)' R`, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional. - Even for large models, we can store `R` in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute `\nabla_\theta L(x)` and then multiplying by `R'`. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples. The implementations differ in how they compute the top-k eigenvalues / eigenvectors. # How `NaiveInfluenceFunction` computes the top-k eigenvalues / eigenvectors It is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the `_set_projections_naive_influence_function` method for more details. # How `ArnoldiInfluenceFunction` computes the top-k eigenvalues / eigenvectors The key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors. This implementation does incur some one-time overhead in `__init__`, where it runs the Arnoldi iteration to calculate `R`. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example. Unlike `NaiveInfluenceFunction`, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (`torch.autograd.functional.hvp`) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains. # High-level organization of the two implementations Because of the common logic of the two implementations, they share the same high-level organization. - Both implementations accept a `hessian_dataset` initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified by `hessian_dataset`. - in `__init__`, `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` both compute `R` using private helper methods `_set_projections_naive_influence_function` and `_set_projections_arnoldi_influence_function`, respectively. - `R` is used by their respective `compute_intermediate_quantities` methods to compute influence embeddings. - Because influence scores (and self-influence scores) are computed by first computing influence embeddings, the `_influence` and `self_influence` methods for both implementations call the `_influence_helper_intermediate_quantities_influence_function` and `_self_influence_helper_intermediate_quantities_influence_function` helper functions, which both assume the implementation implements the `compute_intermediate_quantities` method. # Reason for inheritance structure `InfluenceFunctionBase` refers to any implementation that computes the "infinitesimal" influence score (as opposed to `TracInCPBase`, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways. `IntermediateQuantitiesInfluenceFunction` refers to implementations of `InfluenceFunctionBase` that implement the `compute_intermediate_quantities` method. The reason we don't let `NaiveInfluenceFunction` and `ArnoldiInfluenceFunction` directly inherit from `InfluenceFunctionBase` is that their implementations of `influence` and `self_influence` are actually identical (though for logging reasons, we cannot just move those methods into `IntermediateQuantitiesInfluenceFunction`). In the future, there may be implementations of `InfluenceFunctionBase` that do *not* inherit from `IntermediateQuantitiesInfluenceFunction`, i.e. the LISSA approach of Koh et al. # Key helper methods - `captum._utils._stateless.functional_call` is copy pasted from [Pytorch 13.0 implementation](https://github.com/pytorch/pytorch/blob/17202b363780a06ae07e5cecceffaae6418ad6f8/torch/nn/utils/stateless.py) so that the user does not need to use the latest Pytorch version, and turns a Pytorch `module` into a function whose inputs are the parameters of the `module` (represented as a dictionary). This function is used to compute the Hessian in `NaiveInfluenceFunction`, and Hessian-vector products in `ArnoldiInfluenceFunction`. - `_compute_dataset_func` is used by `NaiveInfluenceFunction` to compute the Hessian over `hessian_dataset`. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is that `torch.autograd.functional.hessian`, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, *flattened* into a 1D tensor (and a batch). This function is given by the factory returned by `naive_influnce_function._flatten_forward_factory`. - `_parameter_arnoldi` performs the Arnoldi iteration and is used by `ArnoldiInfluenceFunction`. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors in `captum.influence._utils.common`. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it. - `_parameter_distill` takes the output of `_parameter_distill`, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to compute `R`. It is used by `ArnoldiInfluenceFunction`. # Tests We create a new test file `tests.influence._core.test_arnoldi_influence.py`, which defines the class `TestArnoldiInfluence` implementing the following tests: #### Tests used only by `NaiveInfluenceFunction`, i.e. appear in this diff: - `test_matches_linear_regression` compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation: -- `NaiveInfluenceFunction` with `projection_dim=None`, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues. - `test_flatten_unflattener`: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor. - `test_top_eigen`: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Since `torch.linalg.eig` doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly. #### Tests used only by `ArnoldiInfluenceFunction`, i.e. appear in next diff: - `test_parameter_arnoldi` checks that `_parameter_arnoldi` is correct. In particular, it checks that the top-`k` eigenvalues of the restriction of `A` to a Krylov subspace (the `H` returned by `_parameter_arnoldi`) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that `_parameter_arnoldi` implements. - `test_parameter_distill` checks that `_parameter_distill` is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors of `A`. This is the property we require of `distill`, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined) `A` to calculate a low-rank approximation of its inverse. - `test_matches_linear_regression` where the implementation tested is the following: -- `ArnoldiInfluenceFunction` with `arnoldi_dim` and `projection_dim` set to a large value. The Krylov subspace should contain the largest eigenvectors because `arnoldi_dim` is large, and `projection_dim` is not too large relative to `arnoldi_dim`, but still large on an absolute level. - When `projection_dim` is small, `ArnoldiInfluenceFunction` and `NaiveInfluenceFunction` should produce the same influence scores, provided `arnoldi_dim` for `ArnoldiInfluenceFunction` is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested in `test_compare_implementations_trained_NN_model_and_data` and `test_compare_implementations_random_model_and_data` for a trained and untrained 2-layer NN, respectively. # Minor changes / functionalities / tests - `test_tracin_intermediate_quantities_aggregate`, `test_tracin_self_influence`, `test_tracin_identity_regression` are applied to both implementations - `_set_active_params` now extracts the layers to consider when computing gradients and sets their `requires_grad`. This refactoring is done since the same logic is used by `TracInCPBase` and `InfluenceFunctionBase`. - some helpers are moved from `tracincp` to `captum.influence._utils.common` - a separate `test_loss_fn` initialization argument is supported, and both implementations are now tested in `TestTracinRegression.test_tracin_constant_test_loss_fn` - `compute_intermediate_quantities` for both implementations support the `aggregate` option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow. - given the aforementioned tests, testing now generates multiple kinds of models / data. The ability to do so is added to `get_random_model_and_data`. The specific model (and its parameters) are specified by the `model_type` argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD. - `TracInCP` and implementations of `InfluenceFunctionBase` all accept a `sample_wise_grads_per_batch` option, and have the same requirements on the loss function. Thus, `_check_loss_fn_tracincp`, which previously performed those checks, is renamed `_check_loss_fn_sample_wise_grads_per_batch` and moved to `captum.influence._utils.common`. Similarly, those implementations all need to compute the jacobian, with the method depending on `sample_wise_grads_per_batch`. The jacobian computation is moved to helper function `_compute_jacobian_sample_wise_grads_per_batch`. Differential Revision: https://www.internalfb.com/diff/D40541294?entry_point=27 fbshipit-source-id: cd94a98782d0aa2f012c9cf36e31ed13d58dc1d4
This pull request has been merged in bd1b4c6. |
Summary:
Overview
This diff, along with D42006733, implement 2 different implementations that both calculate the "infinitesimal" influence score as defined in the paper "Understanding Black-box Predictions via Influence Functions".
NaiveInfluenceFunction
: a computationally slow but exact implementation that is useful for obtaining "ground-truth" (though, note that influence scores themselves are an approximation of the effect of removing then retraining). Several papers actually use this approach, i.e. "Learning Augmentation Network via Influence Functions", "Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics", "Achieving Fairness at No Utility Cost via Data Reweighting with Influence"ArnoldiInfluenceFunction
: This is a computationally efficient implementation described in the paper "Scaling Up Influence Functions" by Schioppa et al. These slides give a brief summary of it.This diff is rebased on top of D41324297, which implements the new API.
Again, note that the 2 above implementations are implemented across 2 diffs, for easier review, though they are jointly described here.
What is the "infinitesimal" influence score
More details on the "infinitesimal" influence score: This "infinitesimal" influence score approximately answers the question if a given training example were infinitesimally down-weighted and the model re-trained to optimality, how much would the loss on a given test example change. Mathematically, the aforementioned influence score is given by
\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)
, where\nabla_\theta L(x)
is the gradient of the loss, considering only training examplex
with respect to (a subset of) model parameters\theta
,\nabla_\theta L(z)
is the analogous quantity for a test examplez
, andH
is the Hessian of the (subset of) model parameters at a given model checkpoint.What the two implementations have in common
Both implementations compute a low-rank approximation of the inverse Hessian, i.e. a tall and skinny (with width k) matrix
R
such thatH^{-1} \approx RR'
, where k is small. In particular, letL
be the matrix of width k whose columns contain the top-k eigenvectors ofH
, and letV
be the k by k matrix whose diagonals contain the corresponding eigenvalues. Both implementations letR=LV^{-1}L'
. Thus, the core computational step is computing the top-k eigenvalues / eigenvectors.This approximation is useful for several reasons:
\nabla_\theta L(x)' H^{-1} \nabla_\theta L(z)
, which is approximated by(\nabla_\theta L(x)' R) (\nabla_\theta L(z)' R)
, we can compute an "influence embedding" for a given examplex
,\nabla_\theta L(x)' R
, such that the influence score of one example on another is approximately the dot-product of their respective embeddings. Because k is small, i.e. 50, these influence embeddings are low-dimensional.R
in memory, provided k is small. This means influence embeddings (and thus influence scores) can be efficiently computed by doing a backwards pass to compute\nabla_\theta L(x)
and then multiplying byR'
. This is orders of magnitude faster than the previous LISSA approach of Koh et al, which to compute the influence score involving a given example, need to compute Hessian-vector products involving on the order of 10^4 examples.The implementations differ in how they compute the top-k eigenvalues / eigenvectors.
How
NaiveInfluenceFunction
computes the top-k eigenvalues / eigenvectorsIt is "naive" in that it computes the top-k eigenvalues / eigenvectors by explicitly forming the Hessian, converting it to a 2D tensor, computing its eigenvectors / eigenvalues, and then sorting. See documentation of the
_set_projections_naive_influence_function
method for more details.How
ArnoldiInfluenceFunction
computes the top-k eigenvalues / eigenvectorsThe key novelty of the approach by Schioppa et al is that it uses the Arnoldi iteration to find the top-k eigenvalues / eigenvectors of the Hessian without explicitly forming the Hessian. In more detail, the approach first runs the Arnoldi iteration, which only requires the ability to compute Hessian-vector products, to find a Krylov subspace of moderate dimension, i.e. 200. It then finds the top-k eigenvalues / eigenvectors of the restriction of the Hessian to the subspace, where k is small, i.e. 50. Finally, it expresses the eigenvectors in the original basis. This approach for finding the top-k eigenvalues / eigenvectors is justified by the property of the Arnoldi iteration, that the Krylov subspace it returns tends to contain the top eigenvectors.
This implementation does incur some one-time overhead in
__init__
, where it runs the Arnoldi iteration to calculateR
. After that overhead, calculation of influence scores is quick, only requiring a backwards pass and multiplication, per example.Unlike
NaiveInfluenceFunction
, this implementation does not flatten any parameters, as the 2D Hessian is never formed, and Pytorch's Hessian-vector implementation (torch.autograd.functional.hvp
) allows the input and output vector to be a tuple of tensors. Avoiding flattening / unflattening parameters brings scalability gains.High-level organization of the two implementations
Because of the common logic of the two implementations, they share the same high-level organization.
hessian_dataset
initialization argument. This is because "infinitesimal" influence scores depend on the Hessian, which is in practice, computed not over the entire training data, but over a subset of it, which is specified byhessian_dataset
.__init__
,NaiveInfluenceFunction
andArnoldiInfluenceFunction
both computeR
using private helper methods_set_projections_naive_influence_function
and_set_projections_arnoldi_influence_function
, respectively.R
is used by their respectivecompute_intermediate_quantities
methods to compute influence embeddings._influence
andself_influence
methods for both implementations call the_influence_helper_intermediate_quantities_influence_function
and_self_influence_helper_intermediate_quantities_influence_function
helper functions, which both assume the implementation implements thecompute_intermediate_quantities
method.Reason for inheritance structure
InfluenceFunctionBase
refers to any implementation that computes the "infinitesimal" influence score (as opposed toTracInCPBase
, which computes the checkpoint-based definition of influence score). Thus the different "base" implementations implement differently-defined influence scores, and children of a base implementation compute the same influence score in different ways.IntermediateQuantitiesInfluenceFunction
refers to implementations ofInfluenceFunctionBase
that implement thecompute_intermediate_quantities
method. The reason we don't letNaiveInfluenceFunction
andArnoldiInfluenceFunction
directly inherit fromInfluenceFunctionBase
is that their implementations ofinfluence
andself_influence
are actually identical (though for logging reasons, we cannot just move those methods intoIntermediateQuantitiesInfluenceFunction
). In the future, there may be implementations ofInfluenceFunctionBase
that do not inherit fromIntermediateQuantitiesInfluenceFunction
, i.e. the LISSA approach of Koh et al.Key helper methods
captum._utils._stateless.functional_call
is copy pasted from Pytorch 13.0 implementation so that the user does not need to use the latest Pytorch version, and turns a Pytorchmodule
into a function whose inputs are the parameters of themodule
(represented as a dictionary). This function is used to compute the Hessian inNaiveInfluenceFunction
, and Hessian-vector products inArnoldiInfluenceFunction
._compute_dataset_func
is used byNaiveInfluenceFunction
to compute the Hessian overhessian_dataset
. This is done by calculating the Hessian over individual batches, and then summing them up. One complication is thattorch.autograd.functional.hessian
, which we use to compute Hessians, does not return the Hessian as a 2D tensor unless the function we seek the Hessian of accepts a 1D tensor. Therefore, we need to define a function of the model's parameters whose input is the parameters, flattened into a 1D tensor (and a batch). This function is given by the factory returned bynaive_influnce_function._flatten_forward_factory
._parameter_arnoldi
performs the Arnoldi iteration and is used byArnoldiInfluenceFunction
. It differs from a "traditional" implementation in that the Hessian-vector function it accepts does not map from 1D tensor to 1D tensor. Instead, it maps from tuple of tensor to tuple of tensor, because the "vector" in this case represents a parameter setting, which Pytorch represents as a tuple of tensor. Therefore, all the operations work with tuple of tensors, which required defining various operations for tuple of tensors incaptum.influence._utils.common
. This method returns a basis for the Krylov subspace, and the restriction of the Hessian to it._parameter_distill
takes the output of_parameter_distill
, and returns the (approximate) top-k eigenvalues / eigenvectors of the Hessian. This is what is needed to computeR
. It is used byArnoldiInfluenceFunction
.Tests
We create a new test file
tests.influence._core.test_arnoldi_influence.py
, which defines the classTestArnoldiInfluence
implementing the following tests:Tests used only by
NaiveInfluenceFunction
, i.e. appear in this diff:test_matches_linear_regression
compares the influence scores and self-influence scores produced by a given implementation with analytically-calculated counterparts for a model where the exact influence scores are known - linear regression. Different reductions for loss function - 'mean', 'sum', 'none' are tested. Here, we test the following implementation:--
NaiveInfluenceFunction
withprojection_dim=None
, i.e. we use the inverse Hessian, not a low-rank approximation of it. In this case, the influence scores should equal the analytically calculated ones, modulo numerical issues.test_flatten_unflattener
: a common operation is flattening a tuple of tensors and unflattening it (the inverse operation). This tests checks that flattening and unflattening a tuple of tensors gives the original tensor.test_top_eigen
: a common operation is finding the the top eigenvectors / eigenvalues of a possibly non-symmetric matrix. Sincetorch.linalg.eig
doesn't sort the eigenvalues, we make a wrapper that does do it. This checks that the wrapper is working properly.Tests used only by
ArnoldiInfluenceFunction
, i.e. appear in next diff:test_parameter_arnoldi
checks that_parameter_arnoldi
is correct. In particular, it checks that the top-k
eigenvalues of the restriction ofA
to a Krylov subspace (theH
returned by_parameter_arnoldi
) agree with those of the original matrix. This is a property we expect of the Arnoldi iteration that_parameter_arnoldi
implements.test_parameter_distill
checks that_parameter_distill
is correct. In particular, it checks that the eigenvectors corresponding to the top eigenvalues it returns agree with the top eigenvectors ofA
. This is the property we require ofdistill
, because we use the top eigenvectors (and eigenvalues) of (implicitly-defined)A
to calculate a low-rank approximation of its inverse.test_matches_linear_regression
where the implementation tested is the following:--
ArnoldiInfluenceFunction
witharnoldi_dim
andprojection_dim
set to a large value. The Krylov subspace should contain the largest eigenvectors becausearnoldi_dim
is large, andprojection_dim
is not too large relative toarnoldi_dim
, but still large on an absolute level.projection_dim
is small,ArnoldiInfluenceFunction
andNaiveInfluenceFunction
should produce the same influence scores, providedarnoldi_dim
forArnoldiInfluenceFunction
is large, since in this case, the top-k eigenvalues / eigenvectors for the two implementations should agree. This agreement is tested intest_compare_implementations_trained_NN_model_and_data
andtest_compare_implementations_random_model_and_data
for a trained and untrained 2-layer NN, respectively.Minor changes / functionalities / tests
test_tracin_intermediate_quantities_aggregate
,test_tracin_self_influence
,test_tracin_identity_regression
are applied to both implementations_set_active_params
now extracts the layers to consider when computing gradients and sets theirrequires_grad
. This refactoring is done since the same logic is used byTracInCPBase
andInfluenceFunctionBase
.tracincp
tocaptum.influence._utils.common
test_loss_fn
initialization argument is supported, and both implementations are now tested inTestTracinRegression.test_tracin_constant_test_loss_fn
compute_intermediate_quantities
for both implementations support theaggregate
option. This means that both implementations can be used with D40386079, the validation influence FAIM workflow.get_random_model_and_data
. The specific model (and its parameters) are specified by themodel_type
argument. Before, the method only supports the random 2-layer NN. Now, it also supports an optimally-trained linear regression, and a 2-layer NN trained with SGD.TracInCP
and implementations ofInfluenceFunctionBase
all accept asample_wise_grads_per_batch
option, and have the same requirements on the loss function. Thus,_check_loss_fn_tracincp
, which previously performed those checks, is renamed_check_loss_fn_sample_wise_grads_per_batch
and moved tocaptum.influence._utils.common
. Similarly, those implementations all need to compute the jacobian, with the method depending onsample_wise_grads_per_batch
. The jacobian computation is moved to helper function_compute_jacobian_sample_wise_grads_per_batch
.Reviewed By: NarineK
Differential Revision: D40541294