Merge remote-tracking branch 'origin/icml_push' into docstring-addn

valence-labs · Jul 17, 2024 · 863914a · 863914a
2 parents 98ebb91 + d291515
commit 863914a
Show file tree

Hide file tree

Showing 35 changed files with 835 additions and 367 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -16,8 +16,8 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
-        os: ["ubuntu-latest"]
+        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+        os: ["ubuntu-latest", "macos-latest", "windows-latest"]
 
     runs-on: ${{ matrix.os }}
     timeout-minutes: 30

diff --git a/LICENSE b/LICENSE
diff --git a/docs/API/basedataset.md b/docs/API/basedataset.md
@@ -0,0 +1 @@
+::: openqdc.datasets.base
diff --git a/docs/API/formats.md b/docs/API/formats.md
@@ -0,0 +1 @@
+::: openqdc.datasets.structure
diff --git a/docs/assets/StorageView.png b/docs/assets/StorageView.png
diff --git a/docs/data_storage.md b/docs/data_storage.md
@@ -0,0 +1,33 @@
+## Dataset structure
+
+For a dataset with N geometries, M atoms across all geometries, ne energy labels,
+and nf force labels, we use zarr or memory-mapped arrays of various sizes:
+
+- (M, 5) for atomic numbers (1),
+charges (1), and positions (3) of individual geometries;
+
+- (N, 2) for the beginning and end indices of
+each geometry in the previous array;
+
+- (N, ne) for the energy labels of each geometry, extendable to
+store other geometry-level QM properties such as HOMO-LUMO gap;
+
+- (M, nf , 3) for the force labels
+of each geometry, extendable to store other atom-level QM properties.
+
+
+The memory-mapped files efficiently access data stored on disk or in the cloud without reading
+them into memory, enabling training on machines with smaller RAM than the dataset size and
+accommodating concurrent reads in multi-GPU training. This allows for very efficient indexing,
+batching and iteration.
+
+![](assets/StorageView.png)
+
+
+## Formats
+
+We currently support the following formats:
+
+1) Zarr : https://zarr.readthedocs.io/en/stable/index.html
+
+2) Memmap : https://numpy.org/doc/stable/index.html
diff --git a/docs/dataset_upload.md b/docs/dataset_upload.md
@@ -1,5 +1,69 @@
-# TODO
+# How to Add a Dataset to OpenQDC
 
-Ask Cristian for now.
+Do you think that OpenQDC is missing some important dataset? Do you think your dataset would be a good fit for OpenQDC?
+If so, you can contribute to OpenQDC by adding your dataset to the OpenQDC repository in two ways:
 
-- Open PR
+1. Opening a PR to add a new dataset
+2. Request a new dataset through Google Form
+
+## OpenQDC PR Guidelines
+
+Implement your dataset in the OpenQDC repository by following the guidelines below:
+
+### Dataset class
+
+- The dataset class should be implemented in the `openqdc/datasets` directory.
+- The dataset class should inherit from the `openqdc.datasets.base.BaseDataset` class.
+- Add your `dataset.py` file to the `openqdc/datasets/potential` or `openqdc/datasets/interaction/` directory based on the type of energy.
+- Implement the following for your dataset:
+  - Add the metadata of the dataset:
+    - Docstrings for the dataset class. Docstrings should report links and references to the dataset. A small description and if possible, the sampling strategy used to generate the dataset.
+    - `__links__`: Dictionary of name and link to download the dataset.
+    - `__name__`: Name of the dataset. This will create a folder with the name of the dataset in the cache directory.
+    - The original units for the dataset `__energy_unit__` and `__distance_unit__`.
+    - `__force_mask__`: Boolean to indicate if the dataset has forces. Or if multiple forces are present. A list of booleans.
+    - `__energy_methods__`: List of the `QmMethod` methods present in the dataset.
+  - `read_raw_entries(self)` -> `List[Dict[str, Any]]`: Preprocess the raw dataset and return a list of dictionaries containing the data. For a better overview of the data format. Look at data storage. This data should have the following keys:
+    - `atomic_inputs` : Atomic inputs of the molecule. numpy.Float32.
+    - `name`: Atomic numbers of the atoms in the molecule. numpy.Object.
+    - `subset`: Positions of the atoms in the molecule.  numpy.Object.
+    - `energies`: Energies of the molecule. numpy.Float64.
+    - `n_atoms`: Number of atoms in the molecule. numpy.Int32
+    - `forces`: Forces of the molecule. [Optional] numpy.Float32.
+  - Add the dataset import to the `openqdc/datasets/<type_of_dataset>/__init__.py` file and to `openqdc/__init__.py`.
+
+### Test the dataset
+
+Try to run the openQDC CLI pipeline with the dataset you implemented.
+
+Run the following command to download the dataset:
+
+- Fetch the dataset files
+```bash
+openqdc fetch DATASET_NAME
+```
+
+- Preprocess the dataset
+```bash
+openqdc preprocess DATASET_NAME
+```
+
+- Load it on python and check if the dataset is correctly loaded.
+```python
+from openqdc import DATASET_NAME
+ds=DATASET_NAME()
+```
+
+If the dataset is correctly loaded, you can open a PR to add the dataset to OpenQDC.
+
+- Select for your PR the `dataset` label.
+
+Our team will review your PR and provide feedback if necessary. If everything is correct, your dataset will be added to OpenQDC remote storage.
+
+## OpenQDC Google Form
+
+Alternatively, you can ask the OpenQDC main development team to take care of the dataset upload for you.
+You can fill out the Google Form [here](https://docs.google.com/forms/d/e/1FAIpQLSeh0YHRn-OoqPpUbrL7G-EOu3LtZC24rtQWwbjJaZ-2V8P2vQ/viewform?usp=sf_link)
+
+As the openQDC team will strive to provide a high quality curation and upload,
+please be patient as the team will need to review the dataset and carry out the necessary steps to ensure the dataset is uploaded correctly.
diff --git a/docs/index.md b/docs/index.md
@@ -61,4 +61,4 @@ Please cite OpenQDC if you use it in your research: [![DOI](zenodo_badge)](zenod
 
 ## Compatibilities
 
-OpenQDC is compatible with Python >= 3.8 and is tested on Linux, MacOS.
+OpenQDC is compatible with Python >= 3.8 and is tested on Linux, MacOS and Windows.
diff --git a/docs/e0s_and_qm.md → docs/normalization_e0s.md b/docs/e0s_and_qm.md → docs/normalization_e0s.md
@@ -18,8 +18,8 @@ OpenQDC provides the computed the isolated atom energies `e0` for each QM method
 
 We provide support of energies through "physical" and "regression" normalization to conserve the size extensivity of chemical systems.
 OpenQDC through this normalization, provide a way to transform the potential energy to atomization energy by subtracting isolated atom energies `e0`
-physically interpretable and extensivity-conserving normalization method. Alternatively, we pre-335
-compute the average contribution of each atom species to potential energy via linear or ridge336
+physically interpretable and extensivity-conserving normalization method. Alternatively, we pre-
+compute the average contribution of each atom species to potential energy via linear or ridge
 regression, centering the distribution at 0 and providing uncertainty estimation for the computed
 values. Predicted atomic energies can also be scaled to approximate a standard normal distribution.
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -15,7 +15,26 @@ Or if you want to directly import a specific dataset:
 ```python
 from openqdc as Spice
 # Spice dataset with distance unit in angstrom instead of bohr
-dataset = Spice(distance_unit="ang")
+dataset = Spice(distance_unit="ang",
+                array_format = "jax"
+)
+dataset[0] # dict of jax array
+```
+
+Or if you prefer handling `ase.Atoms` objects:
+
+```python
+dataset.get_ase_atoms(0)
+```
+
+## Iterators
+
+OpenQDC provides a simple way to get the data as iterators:
+
+```python
+for data in dataset.as_iter(atoms=True):
+    print(data) # Atoms object
+    break
 ```
 
 ## Lazy loading

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -1,4 +1,4 @@
-site_name: "Open Quantum Data Commons (openQDC)"
+site_name: "OpenQDC"
 site_description: "I don't know... Something about data and Quantum stuff I guess :D"
 repo_url: "https://github.com/OpenDrugDiscovery/openQDC"
 repo_name: "openQDC"
@@ -10,19 +10,23 @@ use_directory_urls: false
 docs_dir: "docs"
 
 # Fail on warnings to detect issues with types and docstring
-strict: false
+strict: true
 
 nav:
   - Overview: index.md
-  - Usage: usage.md
-  - CLI: cli.md
+  - Usage:
+    - Base usage : usage.md
+    - CLI: cli.md
   - Available Datasets: datasets.md
-  - QM methods: e0s_and_qm.md
+  - QM methods: normalization_e0s.md
+  - Data structure: data_storage.md
   - Tutorials:
     - Really hard example: tutorials/usage.ipynb
   - API:
-    - e0 and QM methods: API/methods.md
-    - e0 regression: API/regressor.md
+    - QM methods: API/methods.md
+    - Normalization regressor: API/regressor.md
+    - Main class: API/basedataset.md
+    - Format loading: API/formats.md
     - Datasets:
       - Potential Energy:
         - Alchemy : API/datasets/alchemy.md
@@ -59,8 +63,9 @@ nav:
         - Splinter : API/datasets/splinter.md
     - Units: API/units.md
     - Utils: API/utils.md
-  - Contribute: contribute.md
-  - Add a dataset: dataset_upload.md
+  - Contribute:
+    - Mantaining: contribute.md
+    - Add a dataset: dataset_upload.md
   - License: licensing.md
 
 theme:
@@ -79,7 +84,6 @@ extra_css:
 
 extra_javascript:
   - javascripts/config.js
-  - https://polyfill.io/v3/polyfill.min.js?features=es6
   - https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
   #- https://unpkg.com/mermaid@10.9.0/dist/mermaid.min.js
Original file line number	Diff line number	Diff line change
Expand Up		@@ -61,4 +61,4 @@ Please cite OpenQDC if you use it in your research: [![DOI](zenodo_badge)](zenod

		## Compatibilities

		OpenQDC is compatible with Python >= 3.8 and is tested on Linux, MacOS.
		OpenQDC is compatible with Python >= 3.8 and is tested on Linux, MacOS and Windows.