Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] remove redundant ACGT-checking code. #1590

Merged
merged 55 commits into from
Jun 2, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
83c755a
remove unneeded check_and_normalize_read calls
ctb Jan 24, 2017
7d46f07
remove unneeded check_and_process_read code
ctb Jan 24, 2017
db51c34
Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
ctb Feb 14, 2017
ffe5076
Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
ctb Feb 14, 2017
b9b708d
remove check_and_normalize_read calls from abundance_distribution fun…
ctb Feb 14, 2017
976cf30
Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
ctb Feb 19, 2017
55f36e1
remove outdated comment ref merge
ctb Feb 19, 2017
4cbe6bd
fix another comment
ctb Feb 19, 2017
a34ae08
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 19, 2017
26180cd
update consume_seqfile to use cleaned_seq
ctb Feb 19, 2017
fde5960
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 19, 2017
8029d19
update for add'l tests
ctb Feb 19, 2017
19e014a
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 20, 2017
1b5c03f
abundance foo
ctb Feb 20, 2017
f91ebee
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 20, 2017
0aa103c
made output_partitioned_file use cleaned_seq
ctb Feb 20, 2017
e687229
switch to using kmer iterator in output_partitioned_file
ctb Feb 20, 2017
fe3d9cb
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 20, 2017
0f1bf48
Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
ctb Feb 25, 2017
bfeebef
remove check_and_normalize_read from trim_on_stoptags
ctb Feb 25, 2017
efd2bb3
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 25, 2017
6d97388
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 25, 2017
0cdd034
split tests out to test_sequence_validation
ctb Feb 25, 2017
f63a4d8
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 25, 2017
c7123da
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Feb 25, 2017
cad8213
add exceptions
ctb Feb 25, 2017
c269436
Merge remote-tracking branch 'origin/master' into remove/is_valid_dna
ctb Mar 19, 2017
990dc37
fix setup.py for offline operation
ctb Mar 19, 2017
85a730e
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
264dc29
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
416f94b
remove check_and_normalize_read from hashgraph
ctb Mar 19, 2017
cec6243
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
7375dbd
fix problems revealed by larger ksize
ctb Mar 19, 2017
12aa4d8
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
06342dc
updated read cleaning in labelhash & tests
ctb Mar 19, 2017
8b34213
removed check_and_normalize_read, yay
ctb Mar 19, 2017
b17f386
uncomment pytest-runner req
ctb Mar 19, 2017
3164408
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
8b846a1
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
f708a76
switch to countingtype
ctb Mar 19, 2017
b2d0838
switch to countingtype x 2
ctb Mar 19, 2017
10ce9bd
update trim functions on bad dna to be properly undefined
ctb Mar 19, 2017
98cd756
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
8a683c8
Merge branch 'fix/consume_partitioned_err' into remove/is_valid_dna
ctb Mar 19, 2017
fc63010
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
01c3b44
Merge branch 'fix/consume_partitioned_err' into remove/is_valid_dna
ctb Mar 19, 2017
cb3d6cc
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
9a5080b
add partition ID to new 'bad' sequence
ctb Mar 19, 2017
3f6a673
adjust tests for new 'bad dna' sequence
ctb Mar 19, 2017
d4eb8a5
Merge branch 'remove/is_valid_dna_tests' into remove/is_valid_dna
ctb Mar 19, 2017
cf8ebc0
Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
ctb Mar 20, 2017
122258c
Merge branch 'master' into remove/is_valid_dna
betatim Mar 22, 2017
c031499
Merge branch 'master' of github.com:dib-lab/khmer into remove/is_vali…
ctb Jun 2, 2017
4b3d4ed
update Changelog
ctb Jun 2, 2017
1a54b45
grr typo
ctb Jun 2, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,15 @@ The khmer project's command line scripts adhere to
under semantic versioning, but will be in future versions of khmer.

## [Unreleased]

### Added
- Cython wrapper for liboxli.
- Cython containers for parsing, assembly, and hashing.
- Header install for liboxli.

### Changed
- Non-ACTG handling significantly changed so that only bulk-loading functions
"clean" sequences of non-DNA characters. See #1590 for details.
- Split CPython wrapper file into per-class files under src/khmer and include/khmer.
- Moved liboxli headers to include/oxli and implementations to src/oxli.
- Removed CPython assembler wrappers.
Expand Down
7 changes: 0 additions & 7 deletions include/oxli/hashtable.hh
Original file line number Diff line number Diff line change
Expand Up @@ -260,13 +260,6 @@ public:
// count every k-mer in the string.
unsigned int consume_string(const std::string &s);

// checks each read for non-ACGT characters
bool check_and_normalize_read(std::string &read) const;

// check each read for non-ACGT characters, and then consume it.
unsigned int check_and_process_read(std::string &read,
bool &is_valid);

// Count every k-mer in a file containing nucleotide sequences.
template<typename SeqIO>
void consume_seqfile(
Expand Down
38 changes: 16 additions & 22 deletions src/oxli/hashgraph.cc
Original file line number Diff line number Diff line change
Expand Up @@ -309,13 +309,12 @@ void Hashgraph::consume_seqfile_and_tag(
break;
}

if (check_and_normalize_read( read.sequence )) {
unsigned long long this_n_consumed = 0;
consume_sequence_and_tag( read.sequence, this_n_consumed );
read.set_clean_seq();
unsigned long long this_n_consumed = 0;
consume_sequence_and_tag(read.cleaned_seq, this_n_consumed);

__sync_add_and_fetch( &n_consumed, this_n_consumed );
__sync_add_and_fetch( &total_reads, 1 );
}
__sync_add_and_fetch(&n_consumed, this_n_consumed);
__sync_add_and_fetch(&total_reads, 1);
} // while reads left for parser

}
Expand Down Expand Up @@ -373,21 +372,20 @@ void Hashgraph::consume_partitioned_fasta(
} catch (NoMoreReadsAvailable &exc) {
break;
}
seq = read.sequence;
read.set_clean_seq();
seq = read.cleaned_seq;

if (check_and_normalize_read(seq)) {
// First, figure out what the partition is (if non-zero), and save that.
PartitionID p = _parse_partition_id(read.name);
// First, figure out what the partition is (if non-zero), and save that.
PartitionID p = _parse_partition_id(read.name);

// Then consume the sequence
n_consumed += consume_string(seq); // @CTB why are we doing this?
// Then consume the sequence
n_consumed += consume_string(seq); // @CTB why are we doing this?

// Next, compute the tag & set the partition, if nonzero
HashIntoType kmer = hash_dna(seq.c_str());
all_tags.insert(kmer);
if (p > 0) {
partition->set_partition_id(kmer, p);
}
// Next, compute the tag & set the partition, if nonzero
HashIntoType kmer = hash_dna(seq.c_str());
all_tags.insert(kmer);
if (p > 0) {
partition->set_partition_id(kmer, p);
}

// reset the sequence info, increment read number
Expand Down Expand Up @@ -467,10 +465,6 @@ unsigned int Hashgraph::kmer_degree(const char * kmer_s)

size_t Hashgraph::trim_on_stoptags(std::string seq) const
{
if (!check_and_normalize_read(seq)) {
return 0;
}

KmerIterator kmers(seq.c_str(), _ksize);

size_t i = _ksize - 2;
Expand Down
81 changes: 11 additions & 70 deletions src/oxli/hashtable.cc
Original file line number Diff line number Diff line change
Expand Up @@ -57,51 +57,6 @@ using namespace oxli;
using namespace oxli:: read_parsers;


//
// check_and_process_read: checks for non-ACGT characters before consuming
//

unsigned int Hashtable::check_and_process_read(std::string &read,
bool &is_valid)
{
is_valid = check_and_normalize_read(read);

if (!is_valid) {
return 0;
}

return consume_string(read);
}

//
// check_and_normalize_read: checks for non-ACGT characters
// converts lowercase characters to uppercase one
// Note: Usually it is desirable to keep checks and mutations separate.
// However, in the interests of efficiency (we are potentially working
// with TB of data), a check and mutation have been placed inside the
// same loop. Potentially trillions fewer fetches from memory would
// seem to be a worthwhile goal.
//

bool Hashtable::check_and_normalize_read(std::string &read) const
{
bool rc = true;

if (read.length() < _ksize) {
return false;
}

for (unsigned int i = 0; i < read.length(); i++) {
read[ i ] &= 0xdf; // toupper - knock out the "lowercase bit"
if (!is_valid_dna( read[ i ] )) {
rc = false;
break;
}
}

return rc;
}

//
// consume_seqfile: consume a file of reads
//
Expand Down Expand Up @@ -136,8 +91,8 @@ void Hashtable::consume_seqfile(
break;
}

unsigned int this_n_consumed =
check_and_process_read(read.sequence, is_valid);
read.set_clean_seq();
unsigned int this_n_consumed = consume_string(read.cleaned_seq);

__sync_add_and_fetch( &n_consumed, this_n_consumed );
__sync_add_and_fetch( &total_reads, 1 );
Expand Down Expand Up @@ -350,20 +305,19 @@ uint64_t * Hashtable::abundance_distribution(
} catch (NoMoreReadsAvailable &exc) {
break;
}
seq = read.sequence;
read.set_clean_seq();
seq = read.cleaned_seq;

if (check_and_normalize_read(seq)) {
KmerHashIteratorPtr kmers = new_kmer_iterator(seq);
KmerHashIteratorPtr kmers = new_kmer_iterator(seq);

while(!kmers->done()) {
HashIntoType kmer = kmers->next();
while(!kmers->done()) {
HashIntoType kmer = kmers->next();

if (!tracking->get_count(kmer)) {
tracking->count(kmer);
if (!tracking->get_count(kmer)) {
tracking->count(kmer);

BoundedCounterType n = get_count(kmer);
dist[n]++;
}
BoundedCounterType n = get_count(kmer);
dist[n]++;
}

name.clear();
Expand All @@ -387,10 +341,6 @@ unsigned long Hashtable::trim_on_abundance(
BoundedCounterType min_abund)
const
{
if (!check_and_normalize_read(seq)) {
return 0;
}

KmerHashIteratorPtr kmers = new_kmer_iterator(seq);

HashIntoType kmer;
Expand Down Expand Up @@ -422,12 +372,7 @@ unsigned long Hashtable::trim_below_abundance(
BoundedCounterType max_abund)
const
{
if (!check_and_normalize_read(seq)) {
return 0;
}

KmerHashIteratorPtr kmers = new_kmer_iterator(seq);

HashIntoType kmer;

if (kmers->done()) {
Expand Down Expand Up @@ -458,10 +403,6 @@ std::vector<unsigned int> Hashtable::find_spectral_error_positions(
const
{
std::vector<unsigned int> posns;
if (!check_and_normalize_read(seq)) {
throw oxli_exception("invalid read");
}

KmerHashIteratorPtr kmers = new_kmer_iterator(seq);

HashIntoType kmer = kmers->next();
Expand Down
48 changes: 24 additions & 24 deletions src/oxli/labelhash.cc
Original file line number Diff line number Diff line change
Expand Up @@ -106,21 +106,21 @@ void LabelHash::consume_seqfile_and_tag_with_labels(
break;
}

if (graph->check_and_normalize_read( read.sequence )) {
// TODO: make threadsafe!
unsigned long long this_n_consumed = 0;
consume_sequence_and_tag_with_labels( read.sequence,
this_n_consumed,
the_label );
the_label++;
read.set_clean_seq();

// TODO: make threadsafe!
unsigned long long this_n_consumed = 0;
consume_sequence_and_tag_with_labels( read.cleaned_seq,
this_n_consumed,
the_label );
the_label++;

#if (0) // Note: Used with callback - currently disabled.
n_consumed_LOCAL = __sync_add_and_fetch( &n_consumed, this_n_consumed );
n_consumed_LOCAL = __sync_add_and_fetch( &n_consumed, this_n_consumed );
#else
__sync_add_and_fetch( &n_consumed, this_n_consumed );
__sync_add_and_fetch( &n_consumed, this_n_consumed );
#endif
__sync_add_and_fetch( &total_reads, 1 );
}
__sync_add_and_fetch( &total_reads, 1 );

// TODO: Figure out alternative to callback into Python VM
// Cannot use in multi-threaded operation.
Expand Down Expand Up @@ -169,19 +169,19 @@ void LabelHash::consume_partitioned_fasta_and_tag_with_labels(
PartitionID p;
while(!parser->is_complete()) {
read = parser->get_next_read();
seq = read.sequence;

if (graph->check_and_normalize_read(seq)) {
// First, figure out what the partition is (if non-zero), and
// save that.
printdbg(parsing partition id)
p = _parse_partition_id(read.name);
printdbg(consuming sequence and tagging)
consume_sequence_and_tag_with_labels( seq,
n_consumed,
p );
printdbg(back in consume_partitioned)
}

read.set_clean_seq();
seq = read.cleaned_seq;

// First, figure out what the partition is (if non-zero), and
// save that.
printdbg(parsing partition id)
p = _parse_partition_id(read.name);
printdbg(consuming sequence and tagging)
consume_sequence_and_tag_with_labels( seq,
n_consumed,
p );
printdbg(back in consume_partitioned)

// reset the sequence info, increment read number
total_reads++;
Expand Down
92 changes: 45 additions & 47 deletions src/oxli/subset.cc
Original file line number Diff line number Diff line change
Expand Up @@ -147,61 +147,59 @@ size_t SubsetPartition::output_partitioned_file(
break;
}

seq = read.sequence;

if (_ht->check_and_normalize_read(seq)) {
const char * kmer_s = seq.c_str();

bool found_tag = false;
for (unsigned int i = 0; i < seq.length() - ksize + 1; i++) {
kmer = _ht->hash_dna(kmer_s + i);

// is this a known tag?
if (set_contains(partition_map, kmer)) {
found_tag = true;
break;
}
read.set_clean_seq();
seq = read.cleaned_seq;

bool found_tag = false;
KmerHashIteratorPtr kmers = _ht->new_kmer_iterator(read.cleaned_seq);
while (!kmers->done()) {
kmer = kmers->next();

// is this a known tag?
if (set_contains(partition_map, kmer)) {
found_tag = true;
break;
}
}

// all sequences should have at least one tag in them.
// assert(found_tag); @CTB currently breaks tests. give fn flag
// to disable.
// all sequences should have at least one tag in them.
// assert(found_tag); @CTB currently breaks tests. give fn flag
// to disable.

PartitionID partition_id = 0;
if (found_tag) {
PartitionID * partition_p = partition_map[kmer];
if (partition_p == NULL ) {
partition_id = 0;
n_singletons++;
} else {
partition_id = *partition_p;
partitions.insert(partition_id);
}
PartitionID partition_id = 0;
if (found_tag) {
PartitionID * partition_p = partition_map[kmer];
if (partition_p == NULL ) {
partition_id = 0;
n_singletons++;
} else {
partition_id = *partition_p;
partitions.insert(partition_id);
}
}

if (partition_id > 0 || output_unassigned) {
if (read.quality.length()) { // FASTQ
outfile << "@" << read.name << "\t" << partition_id
<< "\n";
outfile << seq << "\n+\n";
outfile << read.quality << "\n";
} else { // FASTA
outfile << ">" << read.name << "\t" << partition_id;
outfile << "\n" << seq << "\n";
}
if (partition_id > 0 || output_unassigned) {
if (read.quality.length()) { // FASTQ
outfile << "@" << read.name << "\t" << partition_id
<< "\n";
outfile << seq << "\n+\n";
outfile << read.quality << "\n";
} else { // FASTA
outfile << ">" << read.name << "\t" << partition_id;
outfile << "\n" << seq << "\n";
}
}

total_reads++;
total_reads++;

// run callback, if specified
if (total_reads % CALLBACK_PERIOD == 0 && callback) {
try {
callback("output_partitions", callback_data,
total_reads, reads_kept);
} catch (...) {
outfile.close();
throw;
}
// run callback, if specified
if (total_reads % CALLBACK_PERIOD == 0 && callback) {
try {
callback("output_partitions", callback_data,
total_reads, reads_kept);
} catch (...) {
outfile.close();
throw;
}
}
}
Expand Down
Loading