Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Detect anomalous parameters #1547

Merged
merged 11 commits into from
Dec 14, 2021

Conversation

jshilong
Copy link
Collaborator

@jshilong jshilong commented Dec 1, 2021

Motivation

Sometimes there are some anomalous parameters in the training phase, mainly two cases

  • Some parameters are not be used in a forward pass due to our wrong implementation, which causes the error when doing loss.backward()
  • Some parameters are not be used to produce loss, which causes the deadlock when doing loss.backward()

In both cases,anomalous parameters are not included in the computational graph that is with loss as the root.

Modification

I add a debug option named detect_anomalous_params to OptimizerHook, which can help you find anomalous parameters

BC-breaking (Optional)

None

@jshilong jshilong requested a review from ZwwWayne December 1, 2021 11:20
@jshilong jshilong changed the title [Feature]Detect anomalous parameters [Feature] Detect anomalous parameters Dec 1, 2021
@jshilong
Copy link
Collaborator Author

jshilong commented Dec 2, 2021

Ping @ZwwWayne

@zhouzaida
Copy link
Collaborator

After resolving the above comments and the CI passes, the PR can be merged.

@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Dec 8, 2021

Can we check whether it slows down the training speed if we set detect_anomalous_params=True?
This PR can be merged after checking the influence on training speed and updating this information in docstring.

@zhouzaida
Copy link
Collaborator

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

It probably slows down the training speed but the flag should only be opened for debugging.

@zhouzaida
Copy link
Collaborator

Please merge the upstream master branch to this PR which will resolve the error when building documentation.

@jshilong
Copy link
Collaborator Author

jshilong commented Dec 9, 2021

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

It probably slows down the training speed but the flag should only be opened for debugging.

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

It is only used for debugging

@jshilong
Copy link
Collaborator Author

jshilong commented Dec 9, 2021

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

I will add a notice in the docstr to indicate this is only used for debugging

@zhouzaida
Copy link
Collaborator

I will add a notice in the docstr to indicate this is only used for debugging

Hi @jshilong , we will merge the PR after you add a notice in the docstring.

@jshilong
Copy link
Collaborator Author

I will add a notice in the docstr to indicate this is only used for debugging

Hi @jshilong , we will merge the PR after you add a notice in the docstring.

Done

@ZwwWayne ZwwWayne merged commit 22e73d6 into open-mmlab:master Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants