docs: update dataset & readme docs #104

jackapbutler · 2023-03-14T10:45:44Z

Update installation and dataset creation documentation to reflect discussions from Discord

maw501

Nice.

kjappelbaum · 2023-03-15T22:48:55Z

CONTRIBUTING.md

@@ -37,6 +25,7 @@ With "implementing" we mean the following:
 - Make an issue in this repository that you want to add this dataset (we will label this issue and assign it to you)
 - Make a PR that adds in a new folder in `data`
  - `meta.yaml` describing the dataset in the form that `transform.py` produces. We will use this later to construct the prompts.
+    > If your dataset has multiple natural splits (i.e. train, test, validation) you can create a <split>\_meta.yaml for each.


I guess this is one way - but I think we handled it differently now in different datasets by adding a split_col

I think that PR handles the case where datasets have been included in a larger benchmark but my dataset splits are more related to the fact that the dataset itself has natural splits of train, test, validation based on its' usage alone in other papers?

If your dataset is part of a benchmark (here)

If we are specifying the benchmark split in #98 we might also want to add in which benchmark the dataset is a part of so we can remove it from benchmarking? Although there'll be some duplicate information I can't see us handling both cases with the split_col.

TLDR

split_col in chore: run pre-commit in CI, rework datasets #98 indicates the benchmarking split if your dataset is part of a benchmark (but we also don't know the benchmark?)

<split>_meta.yaml in docs: update dataset & readme docs #104 indicates a split in the dataset and handles the case where the dataset alone was used in other papers and we want to maintain that original train,test split.

Thanks, I don't think that they are the same thing. The thing we're perhaps missing at the moment is recording the benchmark in which the dataset has been used. Perhaps my use of "benchmark" has been confusing as many of our tabular datasets so far are part of MoleculeNet or TDC, which are kind of "benchmarks" with leaderboards, but we didn't decide yet if we will also use them to benchmark the models ChemNLP produces (because then we would probably also need to drop the molecules that are in the test in one of the benchmarks from all other training datasets.)

The datasets revised in #98 are all part of TDC, which has been used in papers (hopefully with the train/val/test splits the TDC indicated).

Yes agreed, so maybe changing split_col to benchmark_split and then adding a benchmark_name field could work?

kjappelbaum

I only have a comment about the yamls for different splits. Otherwise, great addition, many thanks!

jackapbutler added 5 commits March 14, 2023 10:33

remove duplicate install notes

b29fb57

docs: add note about multiple splits

83d40bb

add note about HF hub

cee4fd3

docs: add precommit to install guidelines

c2406b0

update with link

e2a7621

jackapbutler changed the title ~~docs: Update dataset & readme docs~~ docs: update dataset & readme docs Mar 14, 2023

jackapbutler requested review from maw501 and kjappelbaum March 15, 2023 10:46

maw501 approved these changes Mar 15, 2023

View reviewed changes

kjappelbaum reviewed Mar 15, 2023

View reviewed changes

kjappelbaum added the Awaiting author contribution label Mar 15, 2023

jackapbutler mentioned this pull request Mar 16, 2023

feat: add chebi 20 to datasets #108

Merged

kjappelbaum mentioned this pull request Mar 16, 2023

Add benchmark field #115

Closed

jackapbutler merged commit cef63c8 into OpenBioML:main Mar 17, 2023

jackapbutler deleted the update-dataset-docs branch March 23, 2023 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: update dataset & readme docs #104

docs: update dataset & readme docs #104

jackapbutler commented Mar 14, 2023 •

edited

Loading

maw501 left a comment

kjappelbaum Mar 15, 2023

jackapbutler Mar 16, 2023

kjappelbaum Mar 16, 2023

jackapbutler Mar 16, 2023

kjappelbaum left a comment

docs: update dataset & readme docs #104

docs: update dataset & readme docs #104

Conversation

jackapbutler commented Mar 14, 2023 • edited Loading

maw501 left a comment

Choose a reason for hiding this comment

kjappelbaum Mar 15, 2023

Choose a reason for hiding this comment

jackapbutler Mar 16, 2023

Choose a reason for hiding this comment

TLDR

kjappelbaum Mar 16, 2023

Choose a reason for hiding this comment

jackapbutler Mar 16, 2023

Choose a reason for hiding this comment

kjappelbaum left a comment

Choose a reason for hiding this comment

jackapbutler commented Mar 14, 2023 •

edited

Loading