Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update dataset & readme docs #104

Merged
merged 5 commits into from
Mar 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 15 additions & 13 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,7 @@ Please make a [GitHub account](https://github.com/) prior to implementing a data

For code and data contributions, we recommend you creata a [conda environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). If you do not have conda already installed on your system, we recommend installing [miniconda](https://docs.conda.io/en/latest/miniconda.html):

```bash
conda env create -f conda.yaml # Creates a conda env
conda activate chemnlp # Activate your conda environment
```

Then, please run

```bash
pre-commit install
```

to install the [pre-commit hooks](https://pre-commit.com/). These will automatically format and lint your code upon every commit.
There might be some warnings, e.g., by `flake8`. If you struggle with them, do not hestiate to contact us.
To create your developer environment please follow the guidelines in the `Installation and set-up` of [README.md](README.md)

# Implementing a dataset

Expand All @@ -37,6 +25,7 @@ With "implementing" we mean the following:
- Make an issue in this repository that you want to add this dataset (we will label this issue and assign it to you)
- Make a PR that adds in a new folder in `data`
- `meta.yaml` describing the dataset in the form that `transform.py` produces. We will use this later to construct the prompts.
> If your dataset has multiple natural splits (i.e. train, test, validation) you can create a <split>\_meta.yaml for each.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is one way - but I think we handled it differently now in different datasets by adding a split_col

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that PR handles the case where datasets have been included in a larger benchmark but my dataset splits are more related to the fact that the dataset itself has natural splits of train, test, validation based on its' usage alone in other papers?

If your dataset is part of a benchmark (here)

If we are specifying the benchmark split in #98 we might also want to add in which benchmark the dataset is a part of so we can remove it from benchmarking? Although there'll be some duplicate information I can't see us handling both cases with the split_col.

TLDR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I don't think that they are the same thing. The thing we're perhaps missing at the moment is recording the benchmark in which the dataset has been used. Perhaps my use of "benchmark" has been confusing as many of our tabular datasets so far are part of MoleculeNet or TDC, which are kind of "benchmarks" with leaderboards, but we didn't decide yet if we will also use them to benchmark the models ChemNLP produces (because then we would probably also need to drop the molecules that are in the test in one of the benchmarks from all other training datasets.)

The datasets revised in #98 are all part of TDC, which has been used in papers (hopefully with the train/val/test splits the TDC indicated).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, so maybe changing split_col to benchmark_split and then adding a benchmark_name field could work?

- `transform.py` Python code that transforms the original dataset (linked in `meta.yaml`) into a form that can be consumed by the loader.
For tabular datasets that will mostly involve: Removing/merging duplicated entries, renaming columns, dropping unused columns.
Try to keep the output your `transform.py` uses as lean as possible (i.e. no columns that will not be used).
Expand Down Expand Up @@ -162,3 +151,16 @@ Our first experiments will be based on [Pythia model](https://github.com/Eleuthe
If you are not familiar LLM training have a look at this very good guide: [Large-scale language modeling tutorials with PyTorch from TUNiB](https://nbviewer-org.translate.goog/github/tunib-ai/large-scale-lm-tutorials/blob/main/notebooks/01_introduction.ipynb?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp)

Please have a look for the details in the [corresponding section in our proposal](https://docs.google.com/document/d/1C44EKSJRojm39P2CaxnEq-0FGwDRaknKxJ8lZI6xr5M/edit#heading=h.aww08l8o9tti).

## Hugging Face Hub

We have a preference for using the Hugging Face Hub and processing datasets through the [`datasets`](https://github.com/huggingface/datasets) package when storing larger datasets on the [OpenBioML](https://huggingface.co/OpenBioML) hub as it can offer us a lot of nice features such as

- Easy multiprocessing parallelism for data cleaning
- Version controlling of the datasets as well as our code
- Easy interface into tokenisation & other aspects for model training
- Reuse of utility functions once we have a consistent data structure.

However, don't feel pressured to use this if you're more comfortable contributing an external dataset in another format. We are primarily thinking of using this functionality for processed, combined datasets which are ready for training.

Feel free to reach out to one of the team and read [this guide](https://huggingface.co/docs/datasets/upload_dataset#share-a-dataset-to-the-hub) for more information.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,15 @@ pip install -e "chemnlp[dev]" # to install development dependencies

If extra dependencies are required (e.g. for dataset creation) but are not needed for the main package please add to the `pyproject.toml` in the `dataset_creation` variable and ensure this is reflected in the `conda.yml` file.

Then, please run

```bash
pre-commit install
```

to install the [pre-commit hooks](https://pre-commit.com/). These will automatically format and lint your code upon every commit.
There might be some warnings, e.g., by `flake8`. If you struggle with them, do not hestiate to contact us.

**Note**

If working on model training, request access to the `wandb` project `chemnlp`
Expand Down