Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] seed is unsafe in TF parallel training #4440

Closed
njzjz opened this issue Nov 28, 2024 · 0 comments · Fixed by #4479
Closed

[BUG] seed is unsafe in TF parallel training #4440

njzjz opened this issue Nov 28, 2024 · 0 comments · Fixed by #4479
Labels

Comments

@njzjz
Copy link
Member

njzjz commented Nov 28, 2024

Bug summary

Per https://numpy.org/doc/stable/reference/random/parallel.html#sequence-of-integer-seeds

For example, it is common to see users add the worker ID to the root seed, especially with the legacy RandomState code.

# UNSAFE! Do not do this!
worker_seed = root_seed + worker_id
rng = np.random.RandomState(worker_seed)

It is true that for any one run of a parallel program constructed this way, each worker will have distinct streams. However, it is quite likely that multiple invocations of the program with different seeds will get overlapping sets of worker seeds. It is not uncommon (in the author’s self-experience) to change the root seed merely by an increment or two when doing these repeat runs. If the worker seeds are also derived by small increments of the worker ID, then subsets of the workers will return identical results, causing a bias in the overall ensemble of results.

Unlucky, our TF codes use such the logic, as found in #4435 (comment)

seed = jdata["training"].get("seed", None)
if seed is not None:
# avoid the same batch sequence among workers
seed += run_opt.my_rank
seed = seed % (2**32)
dp_random.seed(seed)

DeePMD-kit Version

devel

Backend and its version

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

See above

Steps to Reproduce

See above

Further Information, Files, and Links

No response

@njzjz njzjz added the bug label Nov 28, 2024
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Dec 19, 2024
@njzjz njzjz linked a pull request Dec 19, 2024 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Dec 25, 2024
Fix #4440.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced seed handling to support both single integers and lists for
improved randomness in distributed training.
	- Added logging for neighbor statistics calculation during training.

- **Bug Fixes**
- Improved error handling in data loading processes to ensure
robustness.

- **Documentation**
- Updated documentation for methods related to seed and batch size
management for clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@njzjz njzjz closed this as completed Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant