This repo holds the source code for the paper Multimodal Performers for Genomic Selection and Crop Yield Prediction, doi: https://doi.org/10.1016/j.atech.2021.100017.
There are a multitude of possible arguments that can be set by the user. However, here are the most important arguments:
This argument sets the type of network to train.
There are 5 different networks to choose from:
- Vanilla CNN: –network 1
- ResNet: –network 2
- Performer: –network 3
- Historical Performer: –network 4
- Multimodal Performer: –network 5
This argument is used to tell the code where the folder containing the different datasets is. It expects a path to a folder containing one, or more, sub-folders each containing a dataset in the form of a lmdb database.
When exploring options, we used several potential ways of generating the
–dataset_dir path/to/datasets/folder
This argument specifies the particular dataset of the potentially many options in the directory specified in the previous arg.
–dataset dataset-name
The combination of the dataset_dir argument and dataset argument will specify the complete path to the dataset used for training the network as such:
Actual Training path = concatenate(dataset_dir, dataset)
Thus this argument should only contain the name of the dataset.
This argument specifies whether the code should expect historical weather data or mean weather data. If set to 0, it expects mean data, if set to 1 it expects historical data.
Historical data is expected to be a tensor of the following shape: [batch_size, history_length, num_recordings], where history_length refers to the number of days recorded and num_recordings refers to the number of different types of recordings included in the data. So for data containing one precipitation and one temperature recording per historical step, the num_recordings is 2.
!NOTE! This code expects the num_recordings for weather data to be 2. If your data is different, you must change it in the code. !NOTE!
–historical_weather 1
This argument specifies the length of the genome (number of SNP recordings included) in the dataset.
–gene_length 13321
This argument specifies the size of each SNP recording. If one records for i.e. additive effects, this should be set to 4 (-1, 0, 1, NA).
–gene_size 4
This code expects the scaling of y values to be done using one of the scalers found in the scikit learn preprocessing library. These scalers can be saved so that the inverse scaling transform can be used. By specifying the inverste_transform a saved scaler will be loaded and its inverse will be used to calculate the metrics on the y values, rather than the scaled ones.
–inverse_transform /path/to/scaler.save
Here’s an example of a terminal call: python main.py –network 2 –inverse_transform data/dataset_dbs/all-row/13112021_log_y_scaler.save –datasets_dir /data/dataset_dbs/all-row –dataset 13112021_mean_recoded_std_z_log_y –historical_weather 0 –gene_length 10679 –gene_size 4
This code base includes a conda environment named “environment.yml”. This environment has been verified to be able to run the code. To install the environment, first make sure you have anaconda installed. Then, run the command “conda env create -f environment.yml” in the command line.
You can now activate the environment by:
- making sure you have activated conda (source /path/to/anaconda/bin/activate)
- activating the environment by running conda activate genomic_selection
Your terminal prompt should change to include something like “(genomic_selection) your_username@machine_hostname” to indicate a successful activation.
You can now run the code using the arguments described above.
Pythorch-lightning does not like the HyperOptArgumentParser introduced in the test-tube package. To fix errors related to this you need to replace the following line in the code:
Starting at line 265 replace:
for k in params.keys():
# convert relevant np scalars to python types first (instead of str)
if isinstance(params[k], (np.bool_, np.integer, np.floating)):
params[k] = params[k].item()
elif type(params[k]) not in [bool, int, float, str, torch.Tensor]:
params[k] = str(params[k])
return params
with:
import types
for k in params.keys():
# convert relevant np scalars to python types first (instead of str)
if isinstance(params[k], (np.bool_, np.integer, np.floating)):
params[k] = params[k].item()
elif type(params[k]) not in [bool, int, float, str, torch.Tensor, types.MethodType]:
params[k] = str(params[k])
return params
If you use this work, please cite it using the following bibtex:
@article{MALOY2021100017,
title = {Multimodal performers for genomic selection and crop yield prediction},
journal = {Smart Agricultural Technology},
volume = {1},
pages = {100017},
year = {2021},
issn = {2772-3755},
doi = {https://doi.org/10.1016/j.atech.2021.100017},
url = {https://www.sciencedirect.com/science/article/pii/S2772375521000174},
author = {Håkon Måløy and Susanne Windju and Stein Bergersen and Muath Alsheikh and Keith L. Downing},
keywords = {Genomic selection, Yield prediction, Deep learning, Attention models, Barley},
abstract = {Working towards optimal crop yields is a crucial step towards securing a stable food supply for the world. To this end, approaches to model and predict crop yields can help speed up research and reduce costs. However, crop yield prediction is very challenging due to the dependencies on factors such as genotype and environmental factors. In this paper we introduce a performer-based deep learning framework for crop yield prediction using single nucleotide polymorphisms and weather data. We compare the proposed models with traditional Bayesian-based methods and traditional neural network architectures on the task of predicting barley yields across 8 different locations in Norway for the years 2017 and 2018. We show that the performer-based models significantly outperform the traditional approaches, achieving an R2 score of 0.820 and a root mean squared error of 69.05, compared to 0.807 and 71.63, and 0.076 and 149.78 for the best traditional neural network and traditional Bayesian approach respectively. Furthermore, we show that visualizing the self-attention maps of a Multimodal Performer network indicates that the model makes meaningful connections between genotype and weather data that can be used by the breeder to inform breeding decisions and shorten breeding cycle length. The performer-based models can also be applied to other types of genomic selection such as salmon breeding for increased Omega-3 fatty acid production or similar animal husbandry applications. The code is available at: https://github.com/haakom/pay-attention-to-genomic-selection.}
}
Copyright (C) 2022 Håkon Måløy
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.