Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNPUpdate Bugs #4

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions .github/workflows/build-sif.yml
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for putting together a yaml and a new singularity def; the one I had is a bit outdated

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgopalan98 I think this all looks great and this is how I would expect to approach the indel issue as well; have you tested to see if the gtc output validates ok w/ what is expected? If so, I can merge the PR

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you also want to be added as a collab let me know. Happy to add you to this repo, although I think you all have your own version of this now but happy to add you nonetheless

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Build SIF from Def File

on:
push:
workflow_dispatch:

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Check out code
uses: actions/checkout@v2

- name: Setup Singularity
uses: eWaterCycle/setup-singularity@v7
with:
singularity-version: '3.8.3'

- name: Install debootstrap
run: sudo apt-get update && sudo apt-get install -y debootstrap

- name: Build SIF from Definition File
run: |
singularity build --fakeroot GThaCk.sif GThaCk.def

- name: Upload SIF as Artifact
uses: actions/upload-artifact@v2
with:
name: GThaCk.sif
path: GThaCk.sif
105 changes: 105 additions & 0 deletions GThaCk.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
Bootstrap: debootstrap
OSVersion: bionic
MirrorURL: http://us.archive.ubuntu.com/ubuntu/
Include: apt

%help
This container contains software for the CCPM biobank pertaining to biobank data qc, freeze generation, and gtc manipulations and validations. See documentation of packages at: https://github.com/tbrunetti/GThaCk and https://github.com/Illumina/GTCtoVCF

%setup

%post
echo "Installing all container dependencies!"
apt-get -y update
apt-get install -y software-properties-common
apt-get install -y gpgv gpgsm gnupg-l10n gnupg
apt-get -y install git
add-apt-repository universe
apt-get -y install libpcre2-8-0 libcurl4-openssl-dev
apt-get install -y build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev wget
add-apt-repository ppa:deadsnakes/ppa
apt-get -y update
apt-get -y install autoconf
apt-get -y install automake
apt-get -y install libbz2-dev
apt-get -y install liblzma-dev
apt-get -y install libgsl0-dev
apt-get -y install perl
wget https://github.com/Kitware/CMake/releases/download/v3.15.2/cmake-3.15.2.tar.gz
tar -zxvf cmake-3.15.2.tar.gz
cd cmake-3.15.2
./bootstrap
make
make install
apt-get -y install libssl1.1
apt-get -y install libxml2-dev
apt-get -y install python3
apt-get -y install python3-pip
apt-get -y install python3-dev
apt-get -y install 2to3
apt-get -y install python3-lib2to3
apt-get install python3-toolz
python3 -m pip install --upgrade pip
python3 -m pip install cget
python3 -m pip install numpy
python3 -m pip install pandas
python3 -m pip install matplotlib
python3 -m pip install seaborn
python3 -m pip install pysam
python3 -m pip install pyvcf
python3 -m pip install Cython
which python3
ln -s /usr/bin/python3 /usr/bin/python
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
apt-get -y update

cd /opt/
git clone https://github.com/samtools/htslib.git
cd /opt/htslib/
git submodule update --init --recursive
autoheader
autoconf -Wno-syntax
./configure
make
make install

cd /opt/
git clone https://github.com/samtools/samtools.git
cd /opt/samtools/
autoheader
autoconf -Wno-syntax
./configure
make
make install

cd /opt/
git clone https://github.com/samtools/bcftools.git
cd bcftools
autoheader && autoconf && ./configure --enable-libgsl
make
make install
cd /opt/
git clone https://github.com/sgopalan98/GThaCk.git
cd /opt/GThaCk/
git checkout fixing-bug-manipulate-gtc
cd /opt/GThaCk/modules
mv IlluminaBeadArrayFiles.tar.gz /usr/local/lib/python3.6/dist-packages/
cd /usr/local/lib/python3.6/dist-packages/
tar -zxvf IlluminaBeadArrayFiles.tar.gz
cd /opt/
git clone https://github.com/Illumina/GTCtoVCF.git
cd /opt/GTCtoVCF/scripts
./download_reference.sh 37
./download_reference.sh 38

%files

%environment
export NETCDF_INCLUDE=/usr/include
export PATH="${PATH}:/mnt:/opt/GTCtoVCF:/opt/:/opt/GThaCk/:/usr/bin:/bin:/usr/local/sbin/:/usr/local/bin"
export PYTHONPATH="${PYTHONPATH}:/mnt/:/opt/"
export PYTHONPATH="${PYTHONPATH}:/usr/local/bin/:/usr/local/lib/python3.6/dist-packages/:/usr/local/lib/python3.6/:/usr/bin/:/opt/GThaCk/:/opt/GThaCk/modules"

%runscript
exec "$@"
6 changes: 4 additions & 2 deletions gtcFuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@

class GtcFunctions:

def __init__(self, bpm, gtcDir, outDir):
def __init__(self, bpm, bpm_csv, gtcDir, outDir):
self.logger = logging.getLogger("classObject")
self.bpm = bpm
self.bpm_csv = bpm_csv
self.gtcDir = gtcDir
self.outDir = outDir

Expand Down Expand Up @@ -133,6 +134,7 @@ def query():
parser = argparse.ArgumentParser(description='Functions and methods for gtc files')
parser.add_argument('method', choices=['manipulateGTCs', 'getIntensities', 'sampleInformation', 'createSampleSheet', 'allCombos'])
parser.add_argument('--bpm', required=True, type=str, help='Full path to bead pool manifest file (.bpm); must be same one used to generate gtc')
parser.add_argument('--bpm-csv',required=True, type=str, help='Full path to bead pool manifest file in CSV form (.csv); must be same one used to generate gtc')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea -- I like this esp since Illumina does provide the csv and that also allows us to modify the csv if needed w/o going through the bpm

parser.add_argument('--gtcDir', type=str, default=os.getcwd(), help='Full path to location of directory/folder containing gtc files to process (files must end in .gtc) -- will not recursively go into subdirectories')
parser.add_argument('--outDir', default=os.getcwd(), type=str,help='Full path to directory or folder to output results. If it path does not exist, program will attempt to create it')
parser.add_argument('--updates', default=None, type=str, help='Full path to file containing snps and/or metadata to update')
Expand Down Expand Up @@ -184,7 +186,7 @@ def query():

if args.method == 'manipulateGTCs':
logger.info('method manipulateGTCs selected \n creating new object of class GtcFunctions')
analysisObj = GtcFunctions(args.bpm, args.gtcDir, args.outDir)
analysisObj = GtcFunctions(args.bpm, args.bpm_csv, args.gtcDir, args.outDir)
analysisObj.manipulateUpdate(args.updates, args.overrides)

elif args.method == 'createSampleSheet':
Expand Down
110 changes: 91 additions & 19 deletions modules/manipulateGTC.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def manipulate_gtc(self):
logger.debug('In method manipulate_gtc()')

bpm=self.bpm
bpm_csv=self.bpm_csv
gtcDir=self.gtcDir
outDir=self.outDir
snpsToUpdate=self.snpUpdateFile
Expand Down Expand Up @@ -73,31 +74,64 @@ def snpUpdate(data, line):
logger = logging.getLogger('snpUpdate')
logger.debug("In sub-method of manipulate_gtc() -- snpUpdate()")

COMPLEMENT_MAP = dict(zip("ABCDGHKMRTVYNID", "TVGHCDMKYABRNID"))

loc = manifest.names.index(line.rstrip().split()[0])
originalSnp = data[1003][loc]
data[1003][loc] = str(line.rstrip().split()[1]).encode()
if ((str(line.rstrip().split()[1])[0] != str(line.rstrip().split()[1])[1]) and (str(line.rstrip().split()[1])[0] != '-')):
manifest_name = manifest.names[loc]
manifest_name_csv = manifest_csv.names[loc]
assert manifest_name == manifest_name_csv


manifestSnpsStr = manifest.snps[loc].decode()
manifestSnps = [manifestSnpsStr[1], manifestSnpsStr[-2]]
if manifest.ref_strands[loc] == RefStrand.Minus:
manifestSnps = [COMPLEMENT_MAP[snp] for snp in manifestSnps]


newSnpsStr = str(line.rstrip().split()[1])
newSnps = [newSnpsStr[0], newSnpsStr[1]]

# Getting genotype byte
if newSnps[0] != newSnps[1] and newSnps[0] != '-':
data[1002][loc] = 2
elif (str(line.rstrip().split()[1])[0] == '-') and (str(line.rstrip().split()[1])[1] == '-'):
elif newSnps[0] == '-' and newSnps[1] == '-':
data[1002][loc] = 0
elif (str(line.rstrip().split()[1])[0] == str(line.rstrip().split()[1])[1]) and (str(line.rstrip().split()[1])[0] in ['A', 'T', 'G', 'C']) and (str(line.rstrip().split()[1])[1] in ['A', 'T', 'G', 'C']):
if manifest.snps[loc].decode().find(str(line.rstrip().split()[1])[0]) != -1:
data[1002][loc] = manifest.snps[loc].decode().find(str(line.rstrip().split()[1])[0])
data[1003][loc] = '--'.encode()
elif newSnps[0] == newSnps[1]:
if newSnps[0] in manifestSnps:
if newSnps[0] == manifestSnps[0]:
data[1002][loc] = 1
else:
data[1002][loc] = 3
else:
logger.warning('WARNING! {} allele possibilities do not match manifest. {}={} and manifest={}. This snp will not be updated.'
.format(line.rstrip().split()[0],
line.rstrip().split()[0], originalSnp,
manifest.snps[loc]))
print(
'WARNING! {} allele possibilities do not match manifest. {}={} and manifest={}. This snp will not be updated.'
.format(line.rstrip().split()[0],
line.rstrip().split()[0], originalSnp,
manifest.snps[loc]))
sys.stdout.flush()
data[1003][loc] = originalSnp
newSnps = [COMPLEMENT_MAP[snp] for snp in newSnps]
if newSnps[0] == manifestSnps[0]:
data[1002][loc] = 1
else:
data[1002][loc] = 3
else:
pass
print("NO CONDITION MET; SOMETHING IS WRONG")

# Getting base call byte
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "getting genotype class byte"

if data[1002][loc] == 0:
data[1003][loc] = '--'.encode()
else:
if newSnps[0] in ['I', 'D']:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! This is the use case I never added or even thought of b/c at the time I don't think we cared about indel and were exclusively looking at SNPs so I never added the case nor did I ever have a use case that I tested it on, but this makes perfect sense.

if data[1002][loc] == 1:
data[1003][loc] = (manifestSnps[0] + manifestSnps[0]).encode()
elif data[1002][loc] == 2:
data[1003][loc] = (manifestSnps[0] + manifestSnps[1]).encode()
else:
data[1003][loc] = (manifestSnps[1] + manifestSnps[1]).encode()
else:
allele_a = manifest_csv.alleles[loc][0]
allele_b = manifest_csv.alleles[loc][1]
if data[1002][loc] == 1:
data[1003][loc] = (allele_a + allele_a).encode()
elif data[1002][loc] == 2:
data[1003][loc] = (allele_a + allele_b).encode()
else:
data[1003][loc] = (allele_b + allele_b).encode()
return data


Expand Down Expand Up @@ -174,6 +208,44 @@ def snpOverride(manifest, overrides):
###################################################################################################################
logger.debug('Preparing to read in bpm file...')
manifest = BeadPoolManifest(bpm)
def get_manifest_csv():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for internal doc and as a reminder to myself; this code chunk is to support the input csv as the manifest instead of the bpm

import csv

def read_csv(file_path):
with open(file_path) as f:
reader = csv.reader(f)
for i in range(7):
next(reader)
header = next(reader)
for row in reader:
yield dict(zip(header, row))


file_path = self.bpm_csv
# for row in read_csv(file_path):
# print(row.header)
rows = list(read_csv(file_path))
class Manifest:
def __init__(self):
self.names = []
self.alleles = []

manifest = Manifest()
for row in rows:
if 'Name' not in row or 'TopGenomicSeq' not in row:
continue
name = row['Name']
top_genomic_seq = row['TopGenomicSeq']
# get the string in top_genomic_seq which is in the format [alphabet/alphabet] and which will be in any index in the top_genomic_seq
allele = 'NA'
for i in range(len(top_genomic_seq) - 1):
if top_genomic_seq[i] == '[' and top_genomic_seq[i + 4] == ']':
allele = top_genomic_seq[i + 1], top_genomic_seq[i + 3]
break
manifest.names.append(name)
manifest.alleles.append(allele)
return manifest
manifest_csv = get_manifest_csv()
logger.debug('Successfully loaded BPM file')

#######################
Expand Down