Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esearch input #47

Merged
merged 9 commits into from
Jun 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions bin/buildKraken1.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,15 @@ cp -rv $TAXDIR $DB/taxonomy

# Make --add-to-library more efficient with
# concatenated fasta files
export nl=$'\n'
find $SRC -name '*.fasta.gz' | \
xargs -n 100 -P 1 bash -c '
for i in "$@"; do
gzip -cd $i
done > $tmpfile
echo -ne "ADDING to library:\n "
zgrep "^>" $tmpfile | sed "s/^>//" | tr "$nl" " "
echo "^^ contents of $tmpfile ^^"
kraken-build --db $DB --add-to-library $tmpfile
'

Expand All @@ -35,3 +39,7 @@ kraken-build --db $DB --build --threads 1
# Reduce the size of the database
kraken-build --db $DB --clean


if [ ! -e "$sharedir/kalamari-kraken1" ]; then
ln -sv kalamari-kraken "$sharedir/kalamari-kraken1"
fi
26 changes: 12 additions & 14 deletions bin/downloadKalamari.pl
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
use IO::Compress::Gzip;
use version 0.77;

our $VERSION = version->parse("5.6.0");
our $VERSION = version->parse("5.7.0");

use threads;

Expand Down Expand Up @@ -167,27 +167,25 @@ sub downloadEntries{
my $numEntries = scalar(@$entries);
my @acc = map{$$_{nuccoreAcc}} @$entries;
logmsg "Downloading ".scalar(@acc)." accessions";
my $queryArg = join("[accession] OR ", sort(@acc))."[accession]";
my $dir = tempdir("download.XXXXXX", DIR=>$$settings{tempdir});

# Make the input file for efetch
my $inputAcc = "$dir/input.acc";
open(my $fh, ">", $inputAcc) or die "ERROR: could not write to $inputAcc: $!";
print $fh join("\n", @acc)."\n";
close $fh;

# Accessions that had errors
my @err;

# Get the esearch xml in place for at least one downstream query
my $esearchXml = "$dir/esearch.xml";
my $esearchCmd = "esearch -db nuccore -query '$queryArg' > $esearchXml";
command($esearchCmd);
# Get started on the comprehensive assembly file
my $outfile = "$dir/all.fasta";
logmsg "Downloading all accessions to $outfile using input accessions in $inputAcc";
command("efetch -db nuccore -input $inputAcc -format fasta > $dir/all.fasta");
if($?){
die "ERROR running: $esearchCmd: $!";
die "ERROR: could not download all accessions";
}

# Get started on the assembly file
my $outfile = "$dir/all.fasta";

# Main query: efetch
my $efetchCmd = "cat $esearchXml | efetch -format fasta > $outfile";
system($efetchCmd);

my $seqsWithVersion = readSeqs($outfile);
my $seqs = {};
while(my($acc, $seq) = each(%$seqsWithVersion)){
Expand Down
3 changes: 3 additions & 0 deletions bin/filterTaxonomy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

set -eu

# Check for dependencies
which taxonkit

thisdir=$(dirname $0)
thisfile=$(basename $0)
KALAMARI_VER=$(downloadKalamari.pl --version)
Expand Down
10 changes: 7 additions & 3 deletions src/chromosomes.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,13 @@ Bacteroides fragilis NC_006347 817 816
Bacteroides thetaiotaomicron NC_004663 818 816
Bartonella bacilliformis CP000524 774 773
Betacoronavirus coronavirus MT233526 2697049 694009
Bifidobacterium adolenscentis CP028341 1680 1678
Bifidobacterium bifidum NC_014638 1681 1678
Bifidobacterium longum NC_004307 216816 1678
Bordetella bronchiseptica NC_019382 518 517
Borreliella burgdorferi NC_001318 139 64895
Bos taurus KC153975 9913 9903
Brachybacterium faecium NC_013172 43669 43668
Brachybacterium faecium CP001643 43669 43668
Bradyrhizobium diazoefficiens NC_004463 1355477 374
Brassica oleracea NC_016118 3712 3705
Buchnera aphidicola NC_002528 9 32199
Expand Down Expand Up @@ -67,7 +68,8 @@ Chlamydia trachomatis NC_000117 813 810
Chlamydomonas reinhardtii AF008237 3055 3052
Chlorobaculum tepidum NC_002932 1097 256319
Chloroflexus aurantiacus NC_010175 1108 1107
Citrobacter freundii CP007557 1333848 546
Citrobacter freundii CP007557 546 544
Citrobacter amalonaticus CP014070 35703 544
Clavibacter michiganensis_sepedonicus AM849034 31964 28447
Clostridioides difficile NC_009089 1496 1870884
Clostridium acetobutylicum NC_003030 1488 1485
Expand All @@ -77,6 +79,7 @@ Clostridium botulinum groupI NC_010723 9000005 1491
Clostridium botulinum groupII NC_010516 9000004 1491
Clostridium butyricum CP013239 1492 1485
Clostridium perfringens NC_008262 1502 1485
Corynebacterium diphtheriae CP091095 1717 1716
Corynebacterium glutamicum NC_003450 1718 1716
Corynebacterium urealyticum AM942444 43771 1716
Coxiella burnetii NC_002971 777 776
Expand Down Expand Up @@ -137,6 +140,7 @@ Helianthus annuus MG770607 4232 4231
Helicobacter pylori AE000511 210 209
Heliobacterium modesticaldum CP000930 35701 2697
Homo sapiens NC_012920 9606 9605
Humulus lupulus NC_086845 3486 3484
Ketogulonicigenium vulgare NC_017384 92945 92944
Klebsiella aerogenes NC_015663 548 570
Klebsiella pneumoniae NC_016845 573 570
Expand Down Expand Up @@ -202,7 +206,7 @@ Salinibacter ruber NC_007677 146919 146918
Salmonella bongori FR877557 54736 590
Salmonella enterica IIa CP053411 9000010 28901
Salmonella enterica IIb LR134141 9000011 28901
Salmonella enterica IIIa UGXG01000002 9000014 28901
Salmonella enterica IIIa CP000880 9000014 28901
Salmonella enterica IIIb CP053583 9000015 28901
Salmonella enterica I AE006468 59201 28901
Salmonella enterica IV CP053579 59205 28901
Expand Down
4 changes: 2 additions & 2 deletions src/plasmids.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -2964,7 +2964,7 @@ Rickettsia CP015014 780 33988
Rickettsia CP010970 780 33988
Onion yellows phytoplasma AB480166 100379 85620
Onion yellows phytoplasma AB479509 100379 85620
'Brassica napus' phytoplasma HQ637382 469009 85620
Brassica napus phytoplasma HQ637382 469009 85620
Candidatus Phytoplasma FJ905104 33926 2146
Candidatus Phytoplasma KF801472 33926 2146
Onion yellows phytoplasma AB479513 100379 2146
Expand All @@ -2986,7 +2986,7 @@ Paulownia witches'-broom phytoplasma EF426472 39647 85620
Paulownia witches'-broom phytoplasma EF426473 39647 85620
Periwinkle little leaf phytoplasma JN835187 137854 85635
Rice orange leaf phytoplasma KY086101 146897 85635
'Catharanthus roseus' aster yellows phytoplasma CP035950 1193712 85620
Catharanthus roseus aster yellows phytoplasma CP035950 1193712 85620
Bacillus thuringiensis CP016196 1428 85620
Bacillus sp. BS98 CP043831 2608254 185979
Bacillus CP009595 1386 185979
Expand Down