Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add antibody developability from TDC #99

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
817805b
add antibody developability from TDC
phalem Mar 11, 2023
66b78cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2023
bd7fedd
Merge branch 'OpenBioML:main' into add_antibody_developability
phalem Mar 24, 2023
ad2dd8f
Add files via upload
phalem Mar 24, 2023
94c8df1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
0b955ef
Delete data/SAbDab_Chen directory
phalem Mar 24, 2023
a1c0163
Delete data/TAP directory
phalem Mar 24, 2023
6eb651c
Add files via upload
phalem Mar 24, 2023
caf6bab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
309bfe6
Add files via upload
phalem Mar 24, 2023
22bff01
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
3b7c7f8
Add files via upload
phalem Mar 24, 2023
6aa3608
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
a52db11
Update meta.yaml
phalem Mar 24, 2023
87dc335
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
40ad396
Update meta.yaml
phalem Mar 24, 2023
d626e3f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
5067c90
Add files via upload
phalem Mar 24, 2023
21e3301
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
c6a05c2
Add files via upload
phalem Mar 25, 2023
94e42ed
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 25, 2023
39744aa
Add files via upload
phalem Mar 26, 2023
937ffb9
Merge branch 'OpenBioML:main' into add_antibody_developability
phalem Mar 26, 2023
d5d5187
Add files via upload
phalem Mar 27, 2023
c583f3b
Add files via upload
phalem Mar 27, 2023
339ff9f
Merge branch 'OpenBioML:main' into add_antibody_developability
phalem Mar 29, 2023
d7a6192
Add benchmark to transform.py and remove for other
phalem Mar 29, 2023
864d880
Merge branch 'main' into add_antibody_developability
MicPie Apr 18, 2023
09f5899
feat: sabdab_chen clean up
MicPie Apr 18, 2023
f141055
feat: tap clean up train split setup
MicPie Apr 19, 2023
9ba3e1c
feat: update new names setup for sabdab_chen
MicPie Apr 27, 2023
2df8eb6
Merge branch 'main' into add_antibody_developability
MicPie Apr 27, 2023
aebcfd7
feat: update new names setup for tap
MicPie Apr 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions data/sabdab_chen/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
name: sabdab_chen
description: |-
Antibody data from Chen et al, where they process from the SAbDab.
From an initial dataset of 3816 antibodies, they retained 2426 antibodies that
satisfy the following criteria: 1.have both sequence (FASTA) and Protein Data
Bank (PDB) structure files, 2. contain both a heavy chain and a light chain,
and 3.have crystal structures with resolution < 3 A. The DI label is derived
from BIOVIA's pipelines.
targets:
- id: developability
description: functional antibody candidate to be developed into a manufacturable(1),
or not(0)
units: ''
type: categorical
names:
- antibody developability
- monoclonal anitbody
- functional antibody candidate
- manufacturable, stable, safe, and effective antibody drug
uris:
- https://rb.gy/idkdqp
- https://rb.gy/b8cx8i
benchmarks:
- name: TDC
link: https://tdcommons.ai/
split_column: split
identifiers:
- id: antibody_pdb_ID
type: Other
names:
- pdb id
- Protein Data Bank id
description: anitbody pdb id
- id: heavy_chain
type: Other
names:
- Fastq
- gene sequence
description: anitbody heavy chain amino acid sequence in FASTA
- id: light_chain
type: Other
names:
- Fastq
- gene sequence
description: anitbody light chain amino acid sequence in FASTA
license: CC BY 4.0
links:
- url: https://doi.org/10.1101/2020.06.18.159798
description: corresponding publication
- url: https://doi.org/10.1093/nar/gkt1043
description: corresponding publication
- url: https://www.3ds.com/products-services/biovia/products/data-science/pipeline-pilot/
description: corresponding tools used
- url: https://tdcommons.ai/single_pred_tasks/develop/#sabdab-chen-et-al
description: data source
num_points: 2409
bibtex:
- |-
@article{Chen2020,
doi = {10.1101/2020.06.18.159798},
url = {https://doi.org/10.1101/2020.06.18.159798},
year = {2020},
month = jun,
publisher = {Cold Spring Harbor Laboratory},
author = {Xingyao Chen and Thomas Dougherty and
Chan Hong and Rachel Schibler and Yi Cong Zhao and
Reza Sadeghi and Naim Matasci and Yi-Chieh Wu and Ian Kerman},
title = {Predicting Antibody Developability from Sequence
using Machine Learning}
- |-
@article{Dunbar2013,
doi = {10.1093/nar/gkt1043},
url = {https://doi.org/10.1093/nar/gkt1043},
year = {2013},
month = nov,
publisher = {Oxford University Press ({OUP})},
volume = {42},
number = {D1},
pages = {D1140--D1146},
author = {James Dunbar and Konrad Krawczyk and Jinwoo Leem
and Terry Baker and Angelika Fuchs and Guy Georges and Jiye Shi and
Charlotte M. Deane},
title = {SAbDab: the structural antibody database},
journal = {Nucleic Acids Research}
170 changes: 170 additions & 0 deletions data/sabdab_chen/transform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
import pandas as pd
import yaml
from tdc.single_pred import Develop


def get_and_transform_data():
# get raw data
target_subfolder = "SAbDab_Chen"
data = Develop(name=target_subfolder)

# proceed raw data
df = data.get_data()
fields_orig = df.columns.tolist()
assert fields_orig == ["Antibody_ID", "Antibody", "Y"]

fn_data_original = "data_original.csv"

antibody_list = df.Antibody.tolist()
s2l = lambda list_string: list(
map(str.strip, list_string.strip("][").replace("'", "").split(","))
)
df["heavy_chain"] = [s2l(x)[0] for x in antibody_list]
df["light_chain"] = [s2l(x)[1] for x in antibody_list]
df = df[["Antibody_ID", "heavy_chain", "light_chain", "Y"]]
df.to_csv(fn_data_original, index=False)

# load raw data and assert columns
df = pd.read_csv(fn_data_original, sep=",")
fields_orig = df.columns.tolist()
assert fields_orig == ["Antibody_ID", "heavy_chain", "light_chain", "Y"]
fields_clean = ["antibody_pdb_ID", "heavy_chain", "light_chain", "developability"]
df.columns = fields_clean
assert not df.duplicated().sum()

# save to csv
fn_data_csv = "data_clean.csv"
df.to_csv(fn_data_csv, index=False)

meta = {
"name": "sabdab_chen", # unique identifier, we will also use this for directory names
"description": """Antibody data from Chen et al, where they process from the SAbDab.
From an initial dataset of 3816 antibodies, they retained 2426 antibodies that
satisfy the following criteria: 1.have both sequence (FASTA) and Protein Data
Bank (PDB) structure files, 2. contain both a heavy chain and a light chain,
and 3.have crystal structures with resolution < 3 A. The DI label is derived
from BIOVIA's pipelines.""",
"targets": [
{
"id": "developability", # name of the column in a tabular dataset
"description": "functional antibody candidate to be developed into a manufacturable(1), or not(0)",
"units": "", # units of the values in this column (leave empty if unitless)
"type": "categorical", # can be "categorical", "ordinal", "continuous"
"names": [ # names for the property (to sample from for building the prompts)
"antibody developability",
"monoclonal anitbody",
"functional antibody candidate",
"manufacturable, stable, safe, and effective antibody drug",
],
"uris": [
"https://rb.gy/idkdqp",
"https://rb.gy/b8cx8i",
],
},
],
"benchmarks": [
{
"name": "TDC", # unique benchmark name
"link": "https://tdcommons.ai/", # benchmark URL
"split_column": "split", # name of the column that contains the split information
},
],
"identifiers": [
{
"id": "antibody_pdb_ID", # column name
"type": "Other", # can be "SMILES", "SELFIES", "IUPAC", "Other"
"names":[
"pdb id",
"Protein Data Bank id",
],
"description": "anitbody pdb id", # description (optional, except for "Other")
},
{
"id": "heavy_chain", # column name
"type": "Other", # can be "SMILES", "SELFIES", "IUPAC", "Other"
"names":[
"Fastq",
"gene sequence",
],
"description": "anitbody heavy chain amino acid sequence in FASTA", # description (optional, except for "Other")
},
{
"id": "light_chain", # column name
"type": "Other", # can be "SMILES", "SELFIES", "IUPAC", "Other"
"names":[
"Fastq",
"gene sequence",
],
"description": "anitbody light chain amino acid sequence in FASTA", # description (optional, except for "Other")
},
],
"license": "CC BY 4.0", # license under which the original dataset was published
"links": [ # list of relevant links (original dataset, other uses, etc.)
{
"url": "https://doi.org/10.1101/2020.06.18.159798",
"description": "corresponding publication",
},
{
"url": "https://doi.org/10.1093/nar/gkt1043",
"description": "corresponding publication",
},
{
"url": "https://www.3ds.com/products-services/biovia/products/data-science/pipeline-pilot/",
"description": "corresponding tools used",
},
{
"url": "https://tdcommons.ai/single_pred_tasks/develop/#sabdab-chen-et-al",
"description": "data source",
},
],
"num_points": len(df), # number of datapoints in this dataset
"bibtex": [
"""@article{Chen2020,
doi = {10.1101/2020.06.18.159798},
url = {https://doi.org/10.1101/2020.06.18.159798},
year = {2020},
month = jun,
publisher = {Cold Spring Harbor Laboratory},
author = {Xingyao Chen and Thomas Dougherty and
Chan Hong and Rachel Schibler and Yi Cong Zhao and
Reza Sadeghi and Naim Matasci and Yi-Chieh Wu and Ian Kerman},
title = {Predicting Antibody Developability from Sequence
using Machine Learning}""",
"""@article{Dunbar2013,
doi = {10.1093/nar/gkt1043},
url = {https://doi.org/10.1093/nar/gkt1043},
year = {2013},
month = nov,
publisher = {Oxford University Press ({OUP})},
volume = {42},
number = {D1},
pages = {D1140--D1146},
author = {James Dunbar and Konrad Krawczyk and Jinwoo Leem
and Terry Baker and Angelika Fuchs and Guy Georges and Jiye Shi and
Charlotte M. Deane},
title = {SAbDab: the structural antibody database},
journal = {Nucleic Acids Research}""",
],
}

def str_presenter(dumper, data):
"""configures yaml for dumping multiline strings
Ref: https://stackoverflow.com/questions/8640959/how-can-i-control-what-scalar-form-pyyaml-uses-for-my-data
"""
if data.count("\n") > 0: # check for multiline string
return dumper.represent_scalar("tag:yaml.org,2002:str", data, style="|")
return dumper.represent_scalar("tag:yaml.org,2002:str", data)

yaml.add_representer(str, str_presenter)
yaml.representer.SafeRepresenter.add_representer(
str, str_presenter
) # to use with safe_dum
fn_meta = "meta.yaml"
with open(fn_meta, "w") as f:
yaml.dump(meta, f, sort_keys=False)

print(f"Finished processing {meta['name']} dataset!")


if __name__ == "__main__":
get_and_transform_data()
118 changes: 118 additions & 0 deletions data/tap/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
name: tap
description: |-
Immunogenicity, instability, self-association,
high viscosity, polyspecificity, or poor expression can all preclude
an antibody from becoming a therapeutic. Early identification of these
negative characteristics is essential. Akin to the Lipinski guidelines,
which measure druglikeness in small molecules,
Therapeutic Antibody Profiler (TAP) highlights antibodies
that possess characteristics that are rare/unseen in
clinical-stage mAb therapeutics.
targets:
- id: CDR_Length
description: CDR Complementarity-determining regions length
units: ''
type: continuous
names:
- Antibody Complementarity-determining regions length
- Therapeutic Antibody Profiler
- antibody developability
- monoclonal anitbody
uris:
- https://rb.gy/s9gv88
- https://rb.gy/km77hq
- https://rb.gy/b8cx8i
- id: PSH
description: patches of surface hydrophobicity
units: ''
type: continuous
names:
- antibody patches of surface hydrophobicity
- Therapeutic Antibody Profiler
- antibody developability
- monoclonal anitbody
uris:
- https://rb.gy/bchhaa
- https://rb.gy/2irr4l
- https://rb.gy/b8cx8i
- id: PPC
description: patches of positive charge
units: ''
type: continuous
names:
- patches of positive charge
- Therapeutic Antibody Profiler
- antibody developability
- monoclonal anitbody
uris:
- https://rb.gy/b8cx8i
- id: PNC
description: patches of negative charge
units: ''
type: continuous
names:
- anitbody patches of negative charge
- Therapeutic Antibody Profiler
- antibody developability
- monoclonal anitbody
uris:
- https://rb.gy/b8cx8i
- id: SFvCSP
description: structural Fv charge symmetry parameter
units: ''
type: continuous
names:
- antibody structural Fv charge symmetry parameter
- Therapeutic Antibody Profiler
- antibody developability
- monoclonal anitbody
uris:
- https://rb.gy/uxyhc3
- https://rb.gy/b8cx8i
benchmarks:
- name: TDC
link: https://tdcommons.ai/
split_column: split
identifiers:
- id: antibody_name
type: Other
names:
- Name of the antibody
- Name of the antibody drug
- Name of drug
description: anitbody name
- id: heavy_chain
type: Other
names:
- Fastq
- gene sequence
description: anitbody heavy chain amino acid sequence
- id: light_chain
type: Other
names:
- Fastq
- gene sequence
description: anitbody light chain amino acid sequence
license: CC BY 4.0
links:
- url: https://doi.org/10.1073/pnas.1810576116
description: corresponding publication
- url: https://tdcommons.ai/single_pred_tasks/develop/#tap
description: data source
num_points: 241
bibtex:
- |-
@article{Raybould2019,
doi = {10.1073/pnas.1810576116},
url = {https://doi.org/10.1073/pnas.1810576116},
year = {2019},
month = feb,
publisher = {Proceedings of the National Academy of Sciences},
volume = {116},
number = {10},
pages = {4025--4030},
author = {Matthew I. J. Raybould and Claire Marks and Konrad Krawczyk
and Bruck Taddese and Jaroslaw Nowak and Alan P. Lewis and Alexander Bujotzek
and Jiye Shi and Charlotte M. Deane},
title = {Five computational developability guidelines for therapeutic antibody profiling},
journal = {Proceedings of the National Academy of Sciences}
Loading