Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix_bug simplify.py #1114

Merged
merged 1 commit into from
Jan 15, 2023
Merged

fix_bug simplify.py #1114

merged 1 commit into from
Jan 15, 2023

Conversation

Vibsteamer
Copy link
Collaborator

to fix the break of dpgen simplify with "labeled true", when the last iter picked all candidate frames but still have non-zero failed frames

output files of the last iter:

ls iter.000002/01.model_devi/data.rest/*
...
iter.000002/01.model_devi/data.rest/my_elements:
box.raw  coord.raw  energy.raw  force.raw  type_map.raw  type.raw  virial.raw
...

no set.000 within,

then model_devi of iter.000003 breaks, due to no labeld systems can be loaded, where iter.000003/01.model_devi/data.rest.old is the symlink of iter.000002/01.model_devi/data.rest

os.symlink(os.path.abspath(rest_data_path), os.path.join(work_path, rest_data_name + ".old"))

tracking clues
rest_idx only contains not-picked candidates_idx of this iter

rest_idx = idx[iter_pick_number:]

/dpgen/simplify/simplify.py

def post_model_devi(iter_index, jdata, mdata):
    ...
    counter = {"candidate": sys_candinate.get_nframes(), "accurate": sys_accurate.get_nframes(), "failed": sys_failed.get_nframes()} <<-------
    ....
    # candinate: pick up randomly
    iter_pick_number = jdata['iter_pick_number']
    idx = np.arange(counter['candidate'])   <<----
    assert(len(idx) == len(labels))
    np.random.shuffle(idx)
    pick_idx = idx[:iter_pick_number]
    rest_idx = idx[iter_pick_number:]    <<--

but "rest_systems" for next iter should contain both "not-picked candidates" and "sys_failed" of this iter:

rest_systems += sys_failed

/dpgen/simplify/simplify.py

def post_model_devi(iter_index, jdata, mdata):
    ...
    for j in rest_idx:       <<------
        sys_name, sys_id = labels[j]
        rest_systems.append(sys_candinate[sys_name][sys_id])  <<----
    rest_systems += sys_failed  <<--

thus,
the size passed to set_size should be rest_systems.get_nframes(),
didn't find the necessity of the size_check of the deleted "if line", thought it only the insurance in case of set_zise =0 when passing the size of rest_idx

when breaked, err would be like:
std output of dpgen:

INFO:dpgen:-------------------------iter.000003 task 05--------------------------
Traceback (most recent call last):
  File "/opt/anaconda3/bin/dpgen", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/main.py", line 185, in main
    args.func(args)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/simplify/simplify.py", line 535, in gen_simplify
    run_iter(args.PARAM, args.MACHINE)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/simplify/simplify.py", line 508, in run_iter
    post_model_devi(ii, jdata, mdata)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/simplify/simplify.py", line 250, in post_model_devi
    sys_entire = dpdata.MultiSystems(type_map = type_map).from_deepmd_npy(os.path.join(work_path, rest_data_name + ".old"), labeled=labeled)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/system.py", line 1465, in from_format
    return self.from_fmt_obj(ff(), file_name, **kwargs)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/system.py", line 1188, in from_fmt_obj
    system = LabeledSystem().from_fmt_obj(fmtobj, dd, **kwargs)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/system.py", line 1078, in from_fmt_obj
    data = fmtobj.from_labeled_system(file_name, **kwargs)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/plugins/deepmd.py", line 60, in from_labeled_system
    return dpdata.deepmd.comp.to_system_data(file_name, type_map=type_map, labels=True)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/deepmd/comp.py", line 50, in to_system_data
    data['cells'] = np.concatenate(all_cells, axis = 0)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

iter.000003/01.model_devi/model_devi.log:

WARNING:tensorflow:From /opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/opt/deepmd-kit-2.1.5/lib/python3.10/importlib/__init__.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
  _bootstrap._exec(spec, module)
Traceback (most recent call last):
  File "/opt/deepmd-kit-2.1.5/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 576, in main
    make_model_devi(**dict_args)
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/deepmd/infer/model_devi.py", line 199, in make_model_devi
    dp_data = DeepmdData(system, set_prefix, shuffle_test=False, type_map=tmap)
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/deepmd/utils/data.py", line 51, in __init__
    self.mixed_type = self._check_mode(self.dirs[0])  # mixed_type format only has one set
IndexError: list index out of range

Signed-off-by: Wanrun Jiang 58099845+Vibsteamer@users.noreply.github.com

to fix the break of dpgen simplify with "labeled true",
when all candidates are picked but still have non-zero failed frames,
which give output files:
```
ls iter.000002/01.model_devi/data.rest/*
...
iter.000002/01.model_devi/data.rest/my_elements:
box.raw  coord.raw  energy.raw  force.raw  type_map.raw  type.raw  virial.raw
...
```
no `set.000` within,
then model_devi of iter.000003 breaks, due to no labeld systems can be loaded, 
where `iter.000003/01.model_devi/data.rest.old` is the symlink of `iter.000002/01.model_devi/data.rest`
https://github.com/deepmodeling/dpgen/blob/355f8eda0c212fe3a072f0865d1ac0d0d7c753b1/dpgen/simplify/simplify.py#L171 

**clues**
rest_idx only contains not-picked candidates_idx of this iter
```
/dpgen/simplify/simplify.py

def post_model_devi(iter_index, jdata, mdata):
    ...
    counter = {"candidate": sys_candinate.get_nframes(), "accurate": sys_accurate.get_nframes(), "failed": sys_failed.get_nframes()} <<<---
    ....
    # candinate: pick up randomly
    iter_pick_number = jdata['iter_pick_number']
    idx = np.arange(counter['candidate']) <<---
    assert(len(idx) == len(labels))
    np.random.shuffle(idx)
    pick_idx = idx[:iter_pick_number]
    rest_idx = idx[iter_pick_number:]    <---
```
<--- https://github.com/deepmodeling/dpgen/blob/355f8eda0c212fe3a072f0865d1ac0d0d7c753b1/dpgen/simplify/simplify.py#L292

but "rest_systems" for next iter should contain both "not-picked candidates" and "sys_failed" of this iter:
```
    for j in rest_idx:
        sys_name, sys_id = labels[j]
        rest_systems.append(sys_candinate[sys_name][sys_id])
    rest_systems += sys_failed  <---
```
<--- https://github.com/deepmodeling/dpgen/blob/355f8eda0c212fe3a072f0865d1ac0d0d7c753b1/dpgen/simplify/simplify.py#L314

thus the size passed to `set_size` should be `rest_systems.get_nframes()`,
didn't find the necessary of the size_check of the deleted "if line", thought it the insurance of set_zise =0 when passing the size of rest_idx


**when break, err would be like**:
std output of dpgen:
```
INFO:dpgen:-------------------------iter.000003 task 05--------------------------
Traceback (most recent call last):
  File "/opt/anaconda3/bin/dpgen", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/main.py", line 185, in main
    args.func(args)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/simplify/simplify.py", line 535, in gen_simplify
    run_iter(args.PARAM, args.MACHINE)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/simplify/simplify.py", line 508, in run_iter
    post_model_devi(ii, jdata, mdata)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpgen/simplify/simplify.py", line 250, in post_model_devi
    sys_entire = dpdata.MultiSystems(type_map = type_map).from_deepmd_npy(os.path.join(work_path, rest_data_name + ".old"), labeled=labeled)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/system.py", line 1465, in from_format
    return self.from_fmt_obj(ff(), file_name, **kwargs)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/system.py", line 1188, in from_fmt_obj
    system = LabeledSystem().from_fmt_obj(fmtobj, dd, **kwargs)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/system.py", line 1078, in from_fmt_obj
    data = fmtobj.from_labeled_system(file_name, **kwargs)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/plugins/deepmd.py", line 60, in from_labeled_system
    return dpdata.deepmd.comp.to_system_data(file_name, type_map=type_map, labels=True)
  File "/opt/anaconda3/lib/python3.8/site-packages/dpdata/deepmd/comp.py", line 50, in to_system_data
    data['cells'] = np.concatenate(all_cells, axis = 0)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate
```
iter.000003/01.model_devi/model_devi.log:
```
WARNING:tensorflow:From /opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS.
WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0
WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0
/opt/deepmd-kit-2.1.5/lib/python3.10/importlib/__init__.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged.
  _bootstrap._exec(spec, module)
Traceback (most recent call last):
  File "/opt/deepmd-kit-2.1.5/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 576, in main
    make_model_devi(**dict_args)
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/deepmd/infer/model_devi.py", line 199, in make_model_devi
    dp_data = DeepmdData(system, set_prefix, shuffle_test=False, type_map=tmap)
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/site-packages/deepmd/utils/data.py", line 51, in __init__
    self.mixed_type = self._check_mode(self.dirs[0])  # mixed_type format only has one set
IndexError: list index out of range
```



Signed-off-by: Wanrun Jiang <58099845+Vibsteamer@users.noreply.github.com>
@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2023

Codecov Report

Base: 46.06% // Head: 46.07% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (237b639) compared to base (b14063e).
Patch coverage: 0.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #1114   +/-   ##
=======================================
  Coverage   46.06%   46.07%           
=======================================
  Files          82       82           
  Lines       14452    14451    -1     
=======================================
  Hits         6658     6658           
+ Misses       7794     7793    -1     
Impacted Files Coverage Δ
dpgen/simplify/simplify.py 0.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@njzjz
Copy link
Member

njzjz commented Jan 14, 2023

If there is no data picked, how do you continue the next training procedure?

@Vibsteamer
Copy link
Collaborator Author

If there is no data picked, how do you continue the next training procedure?

code decomposes it into two cases:

if counter['candidate'] == 0 and counter['failed'] > 0:
raise RuntimeError('no candidate but still have failed cases, stop. You may want to refine the training or to increase the trust level hi')

if(counter['candidate'] == 0) :
dlog.info("no candidate")

seems to need adjustment of training schemes or simplify_param,
or simplify task just finishes (converged)

seems not related to the bug discussed above

@njzjz
Copy link
Member

njzjz commented Jan 14, 2023

I see, you mean no unpicked data, not no picked data.

@wanghan-iapcm wanghan-iapcm merged commit 3e1891d into deepmodeling:devel Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants