Closes #59 - Add CEI dataset #530

napsternxg · 2022-04-30T05:49:49Z

Fixes #59 - Add CEI dataset

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

If the following information is NOT present in the issue, please populate:

Name: CEI
Description: short description of the dataset (or link to social media or blog post)
Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0173132
Data: https://github.com/sb895/chemical-exposure-information-corpus/archive/refs/heads/master.zip

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

- Initial commit to add CEI

mariosaenger · 2024-10-28T10:44:54Z

@phlobo I revised the implementation of this dataset. Please have a look at it.

phlobo · 2024-10-30T20:34:24Z

@mariosaenger

I noticed there are some duplicate labels per document:

{'id': '10022290',
 'document_id': '10022290',
 'text': '...',
 'labels': ['Biomonitoring--exposure biomarker--blood--cord blood',
  'Biomonitoring--exposure biomarker--mothers milk',
  'Biomonitoring--exposure biomarker--blood--cord blood',
  'Biomonitoring--exposure biomarker--mothers milk',
  'Biomonitoring--effect marker--physiological parameter']}

This way, the label statistics don't match the ones reported in the paper: e.g., there are 1467 instances of Biomonitoring--exposure biomarker--urine vs 784 in the paper.

I'm not sure I entirely understand the syntax of the source dataset labels (e.g., https://github.com/sb895/chemical-exposure-information-corpus/blob/master/labels/10022290.txt), but duplicate removal after parsing the labels might already do the trick.

phlobo · 2024-10-30T20:34:58Z

bigbio/hub/hub_repos/cei/cei.py

+        text_files = sorted(list(base_dir.glob("./text/*.txt")))
+
+        if self.config.schema == "source":
+            # TODO: yield (key, example) tuples in the original dataset schema


Please remove TODO comments

phlobo · 2024-10-30T20:35:03Z

bigbio/hub/hub_repos/cei/cei.py

+                yield key, example
+
+        elif self.config.schema == "bigbio_text":
+            # TODO: yield (key, example) tuples in the bigbio schema


Please remove TODO comments

phlobo · 2024-10-30T20:35:43Z

bigbio/hub/hub_repos/cei/cei.py

+        with open(label_file, encoding="utf-8") as fp:
+            label_text = fp.read()
+
+        labels = [line.strip(" -") for line in LABEL_REGEX.findall(label_text)]


This results in many duplicate labels. Maybe just wrap it in a set?

phlobo · 2024-10-30T20:36:35Z

bigbio/hub/hub_repos/cei/cei.py

+_DESCRIPTION = """\
+The Chemical Exposure Information (CEI) Corpus consists of 3661 PubMed publication abstracts manually annotated by \
+experts according to a taxonomy. The taxonomy consists of 32 classes in a hierarchy. Zero or more class labels are \
+assigned to each sentence in the corpus.


the corpus does not really contain "sentences", but I guess the description was copied from the original source...

napsternxg and others added 3 commits April 11, 2022 16:20

Fixes bigscience-workshop#59 - Add CEI dataset

352e1d7

- Initial commit to add CEI

Added info. Need to figure our data parsing.

99507a1

Added working code.

105774e

napsternxg requested review from hakunanatasha, jason-fries, sunnnymskang, ruisi-su, galtay, leonweber, sg-wbi and debajyotidatta as code owners April 30, 2022 05:49

napsternxg mentioned this pull request Apr 30, 2022

Create a dataset loader for CEI #59

Open

sg-wbi changed the title ~~Fixes #59 - Add CEI dataset~~ Closes #59 - Add CEI dataset May 9, 2022

mariosaenger self-assigned this Oct 28, 2024

Mario Sänger added 2 commits October 28, 2024 11:22

Merge branch 'main' into cei

ed12144

refactor: Revise implementation of CEI to hub-style integration

52c208d

mariosaenger requested a review from phlobo October 28, 2024 10:44

phlobo requested changes Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #59 - Add CEI dataset #530

Closes #59 - Add CEI dataset #530

napsternxg commented Apr 30, 2022 •

edited

Loading

mariosaenger commented Oct 28, 2024

phlobo commented Oct 30, 2024

phlobo Oct 30, 2024

phlobo Oct 30, 2024

phlobo Oct 30, 2024

phlobo Oct 30, 2024

Closes #59 - Add CEI dataset #530

Are you sure you want to change the base?

Closes #59 - Add CEI dataset #530

Conversation

napsternxg commented Apr 30, 2022 • edited Loading

Checkbox

mariosaenger commented Oct 28, 2024

phlobo commented Oct 30, 2024

phlobo Oct 30, 2024

Choose a reason for hiding this comment

phlobo Oct 30, 2024

Choose a reason for hiding this comment

phlobo Oct 30, 2024

Choose a reason for hiding this comment

phlobo Oct 30, 2024

Choose a reason for hiding this comment

napsternxg commented Apr 30, 2022 •

edited

Loading