Missing data: what should be in the pheno file

Missing data needs to be handled with care. All sequenced samples should be included for kmds. seer may use any list of samples, and the phenotype can be different from that used for kmds (as long as the original unfiltered kmer count files are used as the seer input).

In summary, there are the following partitions of X samples and i phenotypes Y_i

A: Samples with only assembly data (X \ Y_i)
B: Samples with both assembly data, and a phenotype i (X ∪ Y_i)
C: Samples with only a phenotype i (Y_i \ X)

kmds needs, in its .pheno file, precisely all assemblies that have been counted (A & B = X)

seer can have any mix of B and C (i.e. any subset of Y) in the .pheno file, though only those samples in B will be used for the inference (samples in C are set to kmer presence = 0 by default, samples in A are ignored). The value of the phenotype need not be the same as used for kmds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing data: what should be in the pheno file

Clone this wiki locally