Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/icml_push' into docstring-addn
Browse files Browse the repository at this point in the history
  • Loading branch information
shenoynikhil committed Jul 17, 2024
2 parents 98ebb91 + d291515 commit 863914a
Show file tree
Hide file tree
Showing 35 changed files with 835 additions and 367 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
os: ["ubuntu-latest"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
os: ["ubuntu-latest", "macos-latest", "windows-latest"]

runs-on: ${{ matrix.os }}
timeout-minutes: 30
Expand Down
511 changes: 352 additions & 159 deletions LICENSE

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/API/basedataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: openqdc.datasets.base
1 change: 1 addition & 0 deletions docs/API/formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: openqdc.datasets.structure
Binary file added docs/assets/StorageView.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions docs/data_storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## Dataset structure

For a dataset with N geometries, M atoms across all geometries, ne energy labels,
and nf force labels, we use zarr or memory-mapped arrays of various sizes:

- (M, 5) for atomic numbers (1),
charges (1), and positions (3) of individual geometries;

- (N, 2) for the beginning and end indices of
each geometry in the previous array;

- (N, ne) for the energy labels of each geometry, extendable to
store other geometry-level QM properties such as HOMO-LUMO gap;

- (M, nf , 3) for the force labels
of each geometry, extendable to store other atom-level QM properties.


The memory-mapped files efficiently access data stored on disk or in the cloud without reading
them into memory, enabling training on machines with smaller RAM than the dataset size and
accommodating concurrent reads in multi-GPU training. This allows for very efficient indexing,
batching and iteration.

![](assets/StorageView.png)


## Formats

We currently support the following formats:

1) Zarr : https://zarr.readthedocs.io/en/stable/index.html

2) Memmap : https://numpy.org/doc/stable/index.html
70 changes: 67 additions & 3 deletions docs/dataset_upload.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,69 @@
# TODO
# How to Add a Dataset to OpenQDC

Ask Cristian for now.
Do you think that OpenQDC is missing some important dataset? Do you think your dataset would be a good fit for OpenQDC?
If so, you can contribute to OpenQDC by adding your dataset to the OpenQDC repository in two ways:

- Open PR
1. Opening a PR to add a new dataset
2. Request a new dataset through Google Form

## OpenQDC PR Guidelines

Implement your dataset in the OpenQDC repository by following the guidelines below:

### Dataset class

- The dataset class should be implemented in the `openqdc/datasets` directory.
- The dataset class should inherit from the `openqdc.datasets.base.BaseDataset` class.
- Add your `dataset.py` file to the `openqdc/datasets/potential` or `openqdc/datasets/interaction/` directory based on the type of energy.
- Implement the following for your dataset:
- Add the metadata of the dataset:
- Docstrings for the dataset class. Docstrings should report links and references to the dataset. A small description and if possible, the sampling strategy used to generate the dataset.
- `__links__`: Dictionary of name and link to download the dataset.
- `__name__`: Name of the dataset. This will create a folder with the name of the dataset in the cache directory.
- The original units for the dataset `__energy_unit__` and `__distance_unit__`.
- `__force_mask__`: Boolean to indicate if the dataset has forces. Or if multiple forces are present. A list of booleans.
- `__energy_methods__`: List of the `QmMethod` methods present in the dataset.
- `read_raw_entries(self)` -> `List[Dict[str, Any]]`: Preprocess the raw dataset and return a list of dictionaries containing the data. For a better overview of the data format. Look at data storage. This data should have the following keys:
- `atomic_inputs` : Atomic inputs of the molecule. numpy.Float32.
- `name`: Atomic numbers of the atoms in the molecule. numpy.Object.
- `subset`: Positions of the atoms in the molecule. numpy.Object.
- `energies`: Energies of the molecule. numpy.Float64.
- `n_atoms`: Number of atoms in the molecule. numpy.Int32
- `forces`: Forces of the molecule. [Optional] numpy.Float32.
- Add the dataset import to the `openqdc/datasets/<type_of_dataset>/__init__.py` file and to `openqdc/__init__.py`.

### Test the dataset

Try to run the openQDC CLI pipeline with the dataset you implemented.

Run the following command to download the dataset:

- Fetch the dataset files
```bash
openqdc fetch DATASET_NAME
```

- Preprocess the dataset
```bash
openqdc preprocess DATASET_NAME
```

- Load it on python and check if the dataset is correctly loaded.
```python
from openqdc import DATASET_NAME
ds=DATASET_NAME()
```

If the dataset is correctly loaded, you can open a PR to add the dataset to OpenQDC.

- Select for your PR the `dataset` label.

Our team will review your PR and provide feedback if necessary. If everything is correct, your dataset will be added to OpenQDC remote storage.

## OpenQDC Google Form

Alternatively, you can ask the OpenQDC main development team to take care of the dataset upload for you.
You can fill out the Google Form [here](https://docs.google.com/forms/d/e/1FAIpQLSeh0YHRn-OoqPpUbrL7G-EOu3LtZC24rtQWwbjJaZ-2V8P2vQ/viewform?usp=sf_link)

As the openQDC team will strive to provide a high quality curation and upload,
please be patient as the team will need to review the dataset and carry out the necessary steps to ensure the dataset is uploaded correctly.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,4 @@ Please cite OpenQDC if you use it in your research: [![DOI](zenodo_badge)](zenod

## Compatibilities

OpenQDC is compatible with Python >= 3.8 and is tested on Linux, MacOS.
OpenQDC is compatible with Python >= 3.8 and is tested on Linux, MacOS and Windows.
4 changes: 2 additions & 2 deletions docs/e0s_and_qm.md → docs/normalization_e0s.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ OpenQDC provides the computed the isolated atom energies `e0` for each QM method

We provide support of energies through "physical" and "regression" normalization to conserve the size extensivity of chemical systems.
OpenQDC through this normalization, provide a way to transform the potential energy to atomization energy by subtracting isolated atom energies `e0`
physically interpretable and extensivity-conserving normalization method. Alternatively, we pre-335
compute the average contribution of each atom species to potential energy via linear or ridge336
physically interpretable and extensivity-conserving normalization method. Alternatively, we pre-
compute the average contribution of each atom species to potential energy via linear or ridge
regression, centering the distribution at 0 and providing uncertainty estimation for the computed
values. Predicted atomic energies can also be scaled to approximate a standard normal distribution.

Expand Down
21 changes: 20 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,26 @@ Or if you want to directly import a specific dataset:
```python
from openqdc as Spice
# Spice dataset with distance unit in angstrom instead of bohr
dataset = Spice(distance_unit="ang")
dataset = Spice(distance_unit="ang",
array_format = "jax"
)
dataset[0] # dict of jax array
```

Or if you prefer handling `ase.Atoms` objects:

```python
dataset.get_ase_atoms(0)
```

## Iterators

OpenQDC provides a simple way to get the data as iterators:

```python
for data in dataset.as_iter(atoms=True):
print(data) # Atoms object
break
```

## Lazy loading
Expand Down
24 changes: 14 additions & 10 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
site_name: "Open Quantum Data Commons (openQDC)"
site_name: "OpenQDC"
site_description: "I don't know... Something about data and Quantum stuff I guess :D"
repo_url: "https://github.com/OpenDrugDiscovery/openQDC"
repo_name: "openQDC"
Expand All @@ -10,19 +10,23 @@ use_directory_urls: false
docs_dir: "docs"

# Fail on warnings to detect issues with types and docstring
strict: false
strict: true

nav:
- Overview: index.md
- Usage: usage.md
- CLI: cli.md
- Usage:
- Base usage : usage.md
- CLI: cli.md
- Available Datasets: datasets.md
- QM methods: e0s_and_qm.md
- QM methods: normalization_e0s.md
- Data structure: data_storage.md
- Tutorials:
- Really hard example: tutorials/usage.ipynb
- API:
- e0 and QM methods: API/methods.md
- e0 regression: API/regressor.md
- QM methods: API/methods.md
- Normalization regressor: API/regressor.md
- Main class: API/basedataset.md
- Format loading: API/formats.md
- Datasets:
- Potential Energy:
- Alchemy : API/datasets/alchemy.md
Expand Down Expand Up @@ -59,8 +63,9 @@ nav:
- Splinter : API/datasets/splinter.md
- Units: API/units.md
- Utils: API/utils.md
- Contribute: contribute.md
- Add a dataset: dataset_upload.md
- Contribute:
- Mantaining: contribute.md
- Add a dataset: dataset_upload.md
- License: licensing.md

theme:
Expand All @@ -79,7 +84,6 @@ extra_css:

extra_javascript:
- javascripts/config.js
- https://polyfill.io/v3/polyfill.min.js?features=es6
- https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
#- https://unpkg.com/mermaid@10.9.0/dist/mermaid.min.js

Expand Down
Loading

0 comments on commit 863914a

Please sign in to comment.