Skip to content

Latest commit

 

History

History
45 lines (43 loc) · 1.79 KB

README.md

File metadata and controls

45 lines (43 loc) · 1.79 KB

Data organization

After the execution preparing datasets, tokens, and training models, the data directory will be organized as follows:

.
├── emolecules
│   ├── analysis    < graphical comparitive distributions of chemical features >
│   ├── dataset     < step by step intermediary datasets >
│   ├── download    < source data (automatically downloaded) >
│   ├── sampling    < sampled subsets >
│   ├── splitting   < train / valid / test subsets from sampling >
│   │   ├── test.tsv
│   │   ├── train.tsv
│   │   └── valid.tsv
│   └── tokens      < Sentence Piece tokenizer models and statistics >
├── metanetx
│   ├── analysis
│   ├── dataset
│   │   ├── 1_reshaped.tsv
│   │   ├── 2_filtered.tsv
│   │   ├── 3_fingerprints.tsv
│   │   └── 4_deduped.tsv
│   ├── download
│   ├── sampling
│   ├── splitting   < 5-fold split subsets >
│   │   ├── test.tsv
│   │   ├── train_fold0.tsv
│   │   ├── train_fold1.tsv
│   │   ├── train_fold2.tsv
│   │   ├── train_fold3.tsv
│   │   ├── train_fold4.tsv
│   │   ├── valid_fold0.tsv
│   │   ├── valid_fold1.tsv
│   │   ├── valid_fold2.tsv
│   │   ├── valid_fold3.tsv
│   │   └── valid_fold4.tsv
│   └── tokens
├── models          < pre-trained and fine-tuned models for SMILES and ECFP tokenization >
└── tokens          < Sentence Piece tokenizer models (cp from emolecules) >
    ├── ECFP.model
    ├── ECFP.vocab
    ├── SMILES.model
    └── SMILES.vocab