Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ongoing transition to DB backend (SQLite) #184

Open
wants to merge 51 commits into
base: new_data_store
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
78e923b
DataStore: small fix in SQL query, add to-do comment
hendrikweisser Mar 9, 2021
e5d7621
Eventalign_collapse: add flag for writing to DB (or TSV file), fix er…
hendrikweisser Mar 9, 2021
51a87e4
Merge branch 'new_data_store' of https://github.com/tleonardi/nanocom…
hendrikweisser Mar 9, 2021
d7edde9
better log message in 'DataStore.__init__'
hendrikweisser Mar 10, 2021
ba2da54
Merge branch 'new_data_store' of https://github.com/tleonardi/nanocom…
hendrikweisser Mar 10, 2021
afe005c
coding style (added spaces for readability)
hendrikweisser Mar 11, 2021
7f60da1
store read-level kmer stats in database (needed for whitelisting)
hendrikweisser Mar 11, 2021
da7eb4b
Whitelist: read data from SQLite, filter reads during query
hendrikweisser Mar 12, 2021
2fefe7c
add function to check validity of sample dictionary to 'common.py'
hendrikweisser Mar 16, 2021
8f52c1c
new class 'DatabaseWrapper' for reusable DB interaction code; adapt '…
hendrikweisser Mar 16, 2021
6121376
Whitelist: fix filtering condition for sample subsets
hendrikweisser Mar 17, 2021
989af13
consolidate database code in 'DataStore', remove 'DatabaseWrapper' (n…
hendrikweisser Mar 17, 2021
6dbbdb2
DataStore: small fixes, move 'DBCreateMode' (enum) to top level
hendrikweisser Mar 23, 2021
031a995
Eventalign_collapse: small fix in 'DataStore' call
hendrikweisser Mar 23, 2021
9cc0806
Whitelist: updated 'DataStore' call
hendrikweisser Mar 23, 2021
a1cb349
SampComp: get data from SQLite DB; some refactoring
hendrikweisser Mar 23, 2021
fd0d446
TxComp: update to changes in 'SampComp'; some refactoring
hendrikweisser Mar 23, 2021
69563e0
refactor DataStore, create child classes 'DS_EventAlign' and 'DS_Samp…
hendrikweisser Mar 29, 2021
b1ad899
use 'DataStore_EventAlign' in 'Eventalign_collapse' and 'Whitelist'
hendrikweisser Mar 29, 2021
b8de4e9
TxComp: simplify 'txCompare' results data structure (remove 'lowCov' …
hendrikweisser Mar 29, 2021
7fae6d8
SampComp: small logging fix and update of 'txCompare' results
hendrikweisser Mar 29, 2021
9e96e8a
DataStore: add functions to store sample information and whitelisted …
hendrikweisser Mar 31, 2021
63429d5
SampComp: store whitelisted reads in SQLite DB
hendrikweisser Mar 31, 2021
edf0081
TxComp: use 'ST' as abbrev. for (Student) t-test
hendrikweisser Apr 6, 2021
536ab64
DataStore: fix DB init (tables creation), rework DB schema for SampCo…
hendrikweisser Apr 6, 2021
24881a4
SampComp: remove unused parameters, write output to SQLite
hendrikweisser Apr 6, 2021
fa0ec95
TxComp: coding style - added whitespace
hendrikweisser Jul 16, 2021
112e50d
DataStore: split 'gmm_results' SQL table into two; improve error logg…
hendrikweisser Jul 16, 2021
6d00f2e
add new class 'PostProcess' for data export etc. (work in progress)
hendrikweisser Jul 16, 2021
d8324c7
SampComp/TxComp/DataStore: limit to one univariate and one GMM-based …
hendrikweisser Jul 21, 2021
a7e9269
DataStore: improve definition of SQL tables
hendrikweisser Jul 21, 2021
5b35d30
TxComp: cosmetic changes
hendrikweisser Jul 21, 2021
833cc2e
DataStore/SampComp: add DB columns for adj. p-values, adapt DB schema…
hendrikweisser Jul 21, 2021
2a35003
TxComp: combine collection of functions into class 'TxComp'
hendrikweisser Jul 21, 2021
3ff8038
DataStore: small bug fixes (add 'self' for method calls)
hendrikweisser Jul 21, 2021
6270cdc
SampComp: use new 'TxComp' class, simplify parameter handling
hendrikweisser Jul 21, 2021
bd3499b
TxComp/DataStore: bug fixes (use of 'sequence_context')
hendrikweisser Jul 21, 2021
39c5009
SampComp: add multiple testing correction, remove 'shelve' export
hendrikweisser Jul 23, 2021
c6faa5e
PostProcess: implement 'save_report' for SQLite data
hendrikweisser Jul 23, 2021
bc3e8ba
PostProcess: remove 'save_shift_stats' (now included in 'save_report')
hendrikweisser Jul 23, 2021
479b443
Eventalign_collapse: remove TSV output option, simplify parameters
hendrikweisser Aug 12, 2021
543653d
SampComp: remove irrelevant data from output queue tuple (thanks Tomm…
hendrikweisser Aug 12, 2021
1721746
common: update function to build dict. with sample information
hendrikweisser Aug 16, 2021
fc3f0bb
DataStore: add to-do comment
hendrikweisser Aug 16, 2021
3219601
main: update command line options
hendrikweisser Aug 16, 2021
e7ef11e
main: fix PostProcess (TSV export) usage
hendrikweisser Aug 17, 2021
f6d82ee
main: update CLI documentation (minimal examples), make report genera…
hendrikweisser Aug 17, 2021
75193d7
Eventalign_collapse: small optimization (input file reading)
hendrikweisser Aug 25, 2021
d967101
SuperParser: add spaces (coding style)
hendrikweisser Aug 25, 2021
5980637
Whitelist: add to-do comment
hendrikweisser Aug 26, 2021
314bf1f
DataStore: reduce file size of 'eventalign_collapse' output DB (by ab…
hendrikweisser Aug 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 86 additions & 41 deletions nanocompore/DataStore.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
# -*- coding: utf-8 -*-

from collections import *
from enum import Enum
import datetime
import os
import sqlite3 as lite
import sqlite3
import contextlib

# Third party
from loguru import logger
import nanocompore as pkg
from nanocompore.common import NanocomporeError


class DBCreateMode(Enum):
"""Options for handling (non-) existence of the SQLite database file"""
MUST_EXIST = "r" # open for reading, error if file doesn't exist
CREATE_MAYBE = "a" # use an existing database, otherwise create one
OVERWRITE = "w" # always create a new database, overwrite if it exists


class DataStore(object):
""" Init analysis and check args"""
"""Store Nanocompore data in an SQLite database"""

create_reads_query = ("CREATE TABLE IF NOT EXISTS reads ("
"id INTEGER NOT NULL PRIMARY KEY,"
Expand All @@ -22,9 +31,13 @@ class DataStore(object):
"numevents INT NOT NULL,"
"numsignals INT NOT NULL,"
"dwelltime REAL NOT NULL,"
"kmers INT NOT NULL,"
"missing_kmers INT NOT NULL,"
"NNNNN_kmers INT NOT NULL,"
"mismatch_kmers INT NOT NULL,"
"valid_kmers INT NOT NULL,"
"FOREIGN KEY(sampleid) REFERENCES samples(id)"
"FOREIGN KEY(transcriptid) REFERENCES transcripts(id),"
"UNIQUE(id, name)"
"FOREIGN KEY(transcriptid) REFERENCES transcripts(id)"
")"
)

Expand All @@ -44,67 +57,69 @@ class DataStore(object):
"FOREIGN KEY(readid) REFERENCES reads(id)"
")"
)
# TODO: 'sequence' is stored redundantly - move it to a separate table
# TODO: encode 'status' as int to save space (foreign key referencing a table with all possible statuses)

create_samples_query = ("CREATE TABLE IF NOT EXISTS samples ("
"id INTEGER NOT NULL PRIMARY KEY,"
"name VARCHAR NOT NULL UNIQUE"
")"
)

# TODO: add 'condition' column

create_transcripts_query = ("CREATE TABLE IF NOT EXISTS transcripts ("
"id INTEGER NOT NULL PRIMARY KEY,"
"name VARCHAR NOT NULL UNIQUE"
")"
)



def __init__(self, db_path:str):
self.__db_path=db_path
db_is_new = not os.path.exists(self.__db_path)
logger.debug(f"DB file doesn't exist: {db_is_new}")
if db_is_new: self.__init_db()
def __init__(self,
db_path:str,
create_mode=DBCreateMode.MUST_EXIST):
self.__db_path = db_path
self.__create_mode = create_mode
self.__connection = None
self.__cursor = None

def __enter__(self):
self.__open_db_connection()
return self

def __exit__(self,exc_type, exc_value, traceback):
self.__connection.commit()
self.__close_db_connection()

def __open_db_connection(self):
if self.__create_mode == DBCreateMode.MUST_EXIST and not os.path.exists(self.__db_path):
raise NanocomporeError(f"Database file '{self.__db_path}' does not exist")
if self.__create_mode == DBCreateMode.OVERWRITE:
with contextlib.suppress(FileNotFoundError): # file may not exist
os.remove(self.__db_path)
logger.debug(f"Removed existing database file '{self.__db_path}'")
try:
logger.debug("Connecting to DB")
self.__connection = lite.connect(self.__db_path);
logger.debug("Connecting to database")
self.__connection = sqlite3.connect(self.__db_path)
self.__connection.row_factory = sqlite3.Row
self.__cursor = self.__connection.cursor()
except:
logger.error("Error connecting to database")
raise
if self.__create_mode == DBCreateMode.OVERWRITE or \
(self.__create_mode == DBCreateMode.CREATE_MAYBE and not os.path.exists(self.__db_path)):
self.__init_db()
return self

def __close_db_connection(self):
def __exit__(self, exc_type, exc_value, traceback):
if self.__connection:
logger.debug("Closing connection to DB")
logger.debug("Closing database connection")
self.__connection.commit()
self.__connection.close()
self.__connection = None
self.__cursor = None

def __init_db(self):
logger.debug("Setting up DB tables")
self.__open_db_connection()
logger.debug("Setting up database tables")
try:
self.__cursor.execute(self.create_reads_query)
self.__cursor.execute(self.create_kmers_query)
self.__cursor.execute(self.create_samples_query)
self.__cursor.execute(self.create_transcripts_query)
self.__connection.commit()
except:
self.__close_db_connection()
logger.error("Error creating tables")
logger.error("Error creating database tables")
raise
self.__connection.commit()
self.__close_db_connection()

def store_read(self, read):
"""
Expand All @@ -116,13 +131,14 @@ def store_read(self, read):
"""
tx_id = self.get_transcript_id_by_name(read.ref_id, create_if_not_exists=True)
sample_id = self.get_sample_id_by_name(read.sample_name, create_if_not_exists=True)
values = (read.read_id, sample_id, tx_id, read.ref_start, read.ref_end,
read.n_events, read.n_signals, read.dwell_time) + tuple(read.kmers_status.values())
try:
self.__cursor.execute("INSERT INTO reads VALUES(NULL, ?, ?, ?, ?, ?, ?, ?, ?)",
(read.read_id, sample_id, tx_id, read.ref_start, read.ref_end,
read.n_events, read.n_signals, read.dwell_time))
self.__cursor.execute("INSERT INTO reads VALUES(NULL" + ", ?" * len(values) + ")",
values)
read_id = self.__cursor.lastrowid
except Exception:
logger.error("Error inserting read into DB")
logger.error("Error inserting read into database")
raise Exception

for kmer in read.kmer_l:
Expand All @@ -144,7 +160,7 @@ def __store_kmer(self, kmer, read_id):
res["num_signals"], res["status"], res["dwell_time"],
res["NNNNN_dwell_time"], res["mismatch_dwell_time"], res["median"], res["mad"]))
except Exception:
logger.error("Error inserting kmer into DB")
logger.error("Error inserting kmer into database")
raise Exception

def get_transcript_id_by_name(self, tx_name, create_if_not_exists=False):
Expand All @@ -161,7 +177,7 @@ def get_transcript_id_by_name(self, tx_name, create_if_not_exists=False):
try:
self.__cursor.execute(query)
except Exception:
logger.error("There was an error while inserting a new transcript in the DB")
logger.error("Error while inserting transcript into the database")
raise Exception

query = f"SELECT id from transcripts WHERE name = '{tx_name}'"
Expand All @@ -170,7 +186,7 @@ def get_transcript_id_by_name(self, tx_name, create_if_not_exists=False):
record = self.__cursor.fetchone()
self.__connection.commit()
except Exception:
logger.error("There was an error while selecting the transcript_id from the DB")
logger.error("Error while selecting transcript ID from the database")
raise Exception
if record is not None:
return record[0]
Expand All @@ -191,7 +207,7 @@ def get_sample_id_by_name(self, sample_name, create_if_not_exists=False):
try:
self.__cursor.execute(query)
except Exception:
logger.error("There was an error while inserting a new sample in the DB")
logger.error("Error while inserting sample into the database")
raise Exception

query = f"SELECT id from samples WHERE name = '{sample_name}'"
Expand All @@ -200,9 +216,38 @@ def get_sample_id_by_name(self, sample_name, create_if_not_exists=False):
record = self.__cursor.fetchone()
self.__connection.commit()
except Exception:
logger.error("There was an error while selecting the sample_id from the DB")
logger.error("Error while selecting sample ID from the database")
raise Exception
if record is not None:
return record[0]
else:
return None

@property
def cursor(self):
return self.__cursor

def get_samples(self, sample_dict=None):
if not self.__connection:
raise NanocomporeError("Database connection not yet opened")
expected_samples = []
if sample_dict: # query only relevant samples
for samples in sample_dict.values():
expected_samples += samples
if not expected_samples:
raise NanocomporeError("No sample names in 'sample_dict'")
where = " WHERE name IN ('%s')" % "', '".join(expected_samples)
else:
where = ""
db_samples = {}
try:
self.__cursor.execute("SELECT * FROM samples" + where)
for row in self.__cursor:
db_samples[row["id"]] = row["name"]
except Exception:
logger.error("Error reading sample names from database")
raise Exception
for sample in expected_samples: # check that requested samples are in DB
if sample not in db_samples.values():
raise NanocomporeError(f"Sample '{sample}' not present in database")
return db_samples
Loading