modify tracin self influence helpers #994

99warriors · 2022-07-14T20:19:41Z

Summary:
change TracInCP._self_influence_batch_tracincp and TracInCP._self_influence_batch_tracincp TracInCP._self_influence_batches_tracincp_fast to be named self_influence, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence.

The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by influence when running it in self influence mode.

The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large.

Minor change: for influence_src_dataset argument of all __init__'s, add description of what assumptions we make of the batches yielded by the dataloader.

Reviewed By: NarineK

Differential Revision: D35603078

facebook-github-bot · 2022-07-14T20:20:11Z

This pull request was exported from Phabricator. Differential Revision: D35603078

NarineK · 2022-07-21T05:43:59Z

cc: @99warriors it looks like some of the tests related to progress bar are failing. Do you mind looking into it ?

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Reviewed By: NarineK Differential Revision: D35603078 fbshipit-source-id: 56efa7ca82253a71c3ea143f3e2f1cabbe483b58

facebook-github-bot · 2022-07-22T22:21:57Z

This pull request was exported from Phabricator. Differential Revision: D35603078

facebook-github-bot · 2022-07-22T22:42:55Z

This pull request was exported from Phabricator. Differential Revision: D35603078

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Reviewed By: NarineK Differential Revision: D35603078 fbshipit-source-id: 8a84e1ce98be36b8eb0ac82c504417bdf21a6539

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Reviewed By: NarineK Differential Revision: D35603078 fbshipit-source-id: 8ce8db910a48b92b07efc4de1ec100d2939e4794

facebook-github-bot · 2022-07-23T22:50:29Z

This pull request was exported from Phabricator. Differential Revision: D35603078

NarineK · 2022-07-25T22:27:30Z

captum/influence/_core/tracincp.py

@@ -95,7 +96,7 @@ class TracInCPBase(DataInfluence):
    def __init__(
        self,
        model: Module,
-        influence_src_dataset: Union[Dataset, DataLoader],
+        train_dataset: Union[Dataset, DataLoader],


@99warriors, do you mind fixing this naming also in the tutorials ? I remember that we used them explicitly as argument names.

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Reviewed By: NarineK Differential Revision: D35603078 fbshipit-source-id: 8944c0b909a2c6daa589b94e3df22d5f9e651346

facebook-github-bot · 2022-07-26T03:58:40Z

This pull request was exported from Phabricator. Differential Revision: D35603078

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Reviewed By: NarineK Differential Revision: D35603078 fbshipit-source-id: 5f2164d1ebe75359c39b5a540a13eee6ab582900

facebook-github-bot · 2022-07-29T04:29:04Z

This pull request was exported from Phabricator. Differential Revision: D35603078

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Differential Revision: D35603078 fbshipit-source-id: cd7ac1e59ffa5f57cfd9990626c44d7033636b7b

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Differential Revision: D35603078 fbshipit-source-id: 5c89f2f8ba5350c0c38e0cad6ccfd0929133c386

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Differential Revision: D35603078 fbshipit-source-id: 87063052e68441b82514489f4d9f9ad29b396da4

Summary: Pull Request resolved: pytorch#994 change `TracInCP._self_influence_batch_tracincp` and `TracInCP._self_influence_batch_tracincp` `TracInCP._self_influence_batches_tracincp_fast` to be named `self_influence`, which is now public, and now accept a DataLoader yielding batches (as well as a single batch, as before). The modified helper function can be called by external functions to compute self influence. The helper itself is also changed to improve efficiency, by reducing the number of times checkpoints are loaded. The modified helper, despite being able to compute self influence scores for a dataloader yielding batches, still only loads each checkpoint once, per call. This is because the modified helper now has an outer iteration over checkpoints, and an inner iteration over batches (the order of iteration is reversed compared to before). This helper is called by `influence` when running it in self influence mode. The reason we cannot just increase the batch size to reduce the number of checkpoint loadings is that for large models (precisely those for which loading checkpoints is expensive), the model takes up too much memory, so that the batch size cannot be too large. Minor change: for `influence_src_dataset` argument of all `__init__`'s, add description of what assumptions we make of the batches yielded by the dataloader. Reviewed By: NarineK Differential Revision: D35603078 fbshipit-source-id: 78d233a5da210424f3eed308ee563d3baeba4135

facebook-github-bot · 2022-07-31T23:58:05Z

This pull request was exported from Phabricator. Differential Revision: D35603078

facebook-github-bot added cla signed fb-exported labels Jul 14, 2022

99warriors force-pushed the export-D35603078 branch from 5e8177d to d1d78d2 Compare July 22, 2022 22:21

99warriors force-pushed the export-D35603078 branch from d1d78d2 to 7e068c2 Compare July 22, 2022 22:42

99warriors force-pushed the export-D35603078 branch from 7e068c2 to 0e0cbe2 Compare July 23, 2022 22:50

NarineK reviewed Jul 25, 2022

View reviewed changes

99warriors force-pushed the export-D35603078 branch from 0e0cbe2 to c96a460 Compare July 26, 2022 03:58

99warriors force-pushed the export-D35603078 branch from c96a460 to 535c7b4 Compare July 29, 2022 04:29

99warriors force-pushed the export-D35603078 branch from 535c7b4 to 4e647df Compare July 31, 2022 23:58

facebook-github-bot closed this in 1a10252 Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify tracin self influence helpers #994

modify tracin self influence helpers #994

99warriors commented Jul 14, 2022

facebook-github-bot commented Jul 14, 2022

NarineK commented Jul 21, 2022

facebook-github-bot commented Jul 22, 2022

facebook-github-bot commented Jul 22, 2022

facebook-github-bot commented Jul 23, 2022

NarineK Jul 25, 2022

facebook-github-bot commented Jul 26, 2022

facebook-github-bot commented Jul 29, 2022

facebook-github-bot commented Jul 31, 2022

modify tracin self influence helpers #994

modify tracin self influence helpers #994

Conversation

99warriors commented Jul 14, 2022

facebook-github-bot commented Jul 14, 2022

NarineK commented Jul 21, 2022

facebook-github-bot commented Jul 22, 2022

facebook-github-bot commented Jul 22, 2022

facebook-github-bot commented Jul 23, 2022

NarineK Jul 25, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Jul 26, 2022

facebook-github-bot commented Jul 29, 2022

facebook-github-bot commented Jul 31, 2022