New version discussion: input/output formats #63

hyattpd · 2019-10-01T15:11:04Z

Discussion of input/output in new version.

What formats would you like to see supported?

Current proposal:

Sequence formats: FASTA

Does anyone want or need FASTQ support? No one's ever requested it. Prodigal currently has a crappy Genbank/EMBL sequence parser, but I'm not really sure this is a feature that should be supported (FASTA seems fine).

Compression formats: gz, bz2

Any others? .xz?

Allow standard input?:

How important is it to allow standard input? Would people be disappointed if they had to specify files?

Gene coordinates: GFF, GTF3, pseudo-Genbank

Any other formats people would like to see supported?

Translations/mRNAs: FASTA

Any other formats needed here?

Gene IDS: the IDs will be 0-padded/sortable, as per a proposal from a previous issue (i.e. 00001, 00002,...10000...).
There will be a "convert" subcommand to convert the various output formats (though don't expect it to have a full understanding of GFF3, GTF, etc... it will expect Prodigal output.)

tseemann · 2019-10-01T23:01:28Z

For input, I think FASTA is sufficient. I've never used the Genbank input option.
I think .gz compression is sufficient, and important for huge metagenome files.
STDIN support is nice, but given you will probably need to copy it to a temp uncompressed (maybe 2-bit format) on disk anyway it is really just a convenience. /dev/stdin can be used in place anyway.
Output coordinates should support GFF3 and maybe GTF (GFF 2.5). Genbank would be ok I guess. Please make an option to include the original contigs in the GFF3 in the concluding '##FASTA` block.
Output sequences yes, people will want .ffn (CD) and .faa (Prot)
Gene IDs should support a printf style format. eg. --idfmt "ECOLI_%05d". This allows the user to completely control the output style. maybe add a %c to support the "contig ID" and %s for strand. eg. MRSA-%c_%s_%04d => MRSA_contig123_+_1233

tseemann · 2019-10-01T23:12:31Z

Support for masked FASTA would be useful. It is used by BLAST+ and many other tools. Lowecase bases will be ignored. Could be treated as N in Prodigal 3.0 ?

hyattpd · 2019-10-01T23:42:00Z

Yeah, this is essential, especially for doing eukaryotic gene prediction.

tseemann · 2019-10-02T00:00:19Z

Is prodigal 3.0 (prok) the same as radigal 1.0 (euk) ?

hyattpd · 2019-10-02T01:45:14Z

I'm leaning towards just calling the whole thing radigal.

oschwengers · 2019-10-02T07:19:23Z

As Torsten said:

Fasta input
gz support
*.ffn and *.faa output for subsequent processing.

As we're very often parsing Prodigal's output in our pipelines instead of merely passing it over to 3rd party executables: a very simple tab separated format including the most important information would be nice, e.g.:
gene id, contig, start, stop, strand, partial?, shifted?, nuc seq, aa seq

I know, GFF3 is very close but either the sequence is not included so the ffn/faa files need to be parsed as well or if they are included for multi contig files the format gets more complex than it has to be (my pers. opinion). So this way one would have everything in place in a simple straight-forward manner, at least for proks.

Another idea would be to have everything Prodigal/Radigal can provide in a well structured JSON format. This way you have everything in place in a machine readable format that is well supported by every modern language.

tseemann · 2019-10-03T00:31:59Z

I partially agree with the TSV/TAB output but I would hope GFF or BED could be used so bedtools and samtools will work with it.

+1 for JSON format!

oschwengers · 2019-10-03T12:16:06Z

What about using the "simple tab" for stdout and everything else as optional parameters?
Idea:
prodigal3 --input <genome.fasta> [--output ] [--prefix ] --json --gff3 --bed...
Then in there is <prefix.json>, <prefix.gff>, <prefix.bed>, etc...

Would be simple, flexible and straight forward. Of course, the simple tab could also be a non-standard option, e.g. --tsv <prefix.tsv>

ayixon · 2019-12-07T21:33:16Z

¿Can i feed prodigal with multiple genomes at once? ¿something like $ prodigal -i *.faa? How can i set the output on individual files?

tseemann · 2019-12-10T23:17:13Z

.faa files are usually reserved for peptide sequences.

I don't think prodigal takes multiple input files at once.
You can either concatenate first: cat *.faa > everything.fasta
or try a bash subshell: prodigal .... <(cat *.faa)

evanroyrees mentioned this issue Mar 23, 2020

Resolving issues #16, #17, #18, #21 and update to Autometa API and Logger KwanLab/Autometa#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New version discussion: input/output formats #63

New version discussion: input/output formats #63

hyattpd commented Oct 1, 2019 •

edited

Loading

tseemann commented Oct 1, 2019 •

edited

Loading

tseemann commented Oct 1, 2019

hyattpd commented Oct 1, 2019

tseemann commented Oct 2, 2019

hyattpd commented Oct 2, 2019 •

edited

Loading

oschwengers commented Oct 2, 2019

tseemann commented Oct 3, 2019

oschwengers commented Oct 3, 2019 •

edited

Loading

ayixon commented Dec 7, 2019

tseemann commented Dec 10, 2019 •

edited

Loading

New version discussion: input/output formats #63

New version discussion: input/output formats #63

Comments

hyattpd commented Oct 1, 2019 • edited Loading

tseemann commented Oct 1, 2019 • edited Loading

tseemann commented Oct 1, 2019

hyattpd commented Oct 1, 2019

tseemann commented Oct 2, 2019

hyattpd commented Oct 2, 2019 • edited Loading

oschwengers commented Oct 2, 2019

tseemann commented Oct 3, 2019

oschwengers commented Oct 3, 2019 • edited Loading

ayixon commented Dec 7, 2019

tseemann commented Dec 10, 2019 • edited Loading

hyattpd commented Oct 1, 2019 •

edited

Loading

tseemann commented Oct 1, 2019 •

edited

Loading

hyattpd commented Oct 2, 2019 •

edited

Loading

oschwengers commented Oct 3, 2019 •

edited

Loading

tseemann commented Dec 10, 2019 •

edited

Loading