-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add func #29
Conversation
src/openqdc/datasets/ani.py
Outdated
@@ -50,6 +60,25 @@ def read_raw_entries(self): | |||
samples = read_qc_archive_h5(raw_path, self.__name__, self.energy_target_names, self.force_target_names) | |||
return samples | |||
|
|||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these stats are not computed on the fly? using numpy /torch or at loading time? It would take less than 15s for most datasets no? Doing this way will create problem if the data change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the on the fly computation with the original script was too slow for some datasets.
Also we talked about precomputing the statistics.
Why would a dataset data change ?
src/openqdc/datasets/base.py
Outdated
# calculation per molecule formation energy statistics | ||
e = [] | ||
for i in range(len(self.__energy_methods__)): | ||
e.append(converted_energy_data[:, i] - np.array(list(map(lambda x: x.sum(), matrixs[i])))) | ||
E = np.array(e).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know a better way to do this?
Nikhil checked the timing and for GEOM it takes around 3 min. Also switching to the pkl files seems to cause issue on loading GEOM for my notebook kernel. Probably the memory usage increased. |
src/openqdc/datasets/base.py
Outdated
|
||
@property | ||
def numbers(self): | ||
if hasattr(self, "_numbers"): | ||
return self._numbers | ||
self._numbers = np.array(list(set(self.data["atomic_inputs"][..., 0])), dtype=np.int32) | ||
self._numbers = np.unique(self.data["atomic_inputs"][..., 0]).astype(np.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pandas.unique because it is way faster than numpy
src/openqdc/datasets/base.py
Outdated
""" | ||
if self.__average_nb_atoms__ is None: | ||
logger.info(PROPERTY_NOT_AVAILABLE_ERROR) | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1? should be either an error or None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a WIP patch to test different datasets on the multihead branch. Now we can just raise an error
The last commits address numerous issues:
|
Add statistics of datasets, some improvements and fixes to downloading issues.