-
Notifications
You must be signed in to change notification settings - Fork 10
Missing data: what should be in the pheno file
John Lees edited this page Feb 19, 2016
·
1 revision
Missing data needs to be handled with care. All sequenced samples should be included for kmds. seer may use any list of samples, and the phenotype can be different from that used for kmds (as long as the original unfiltered kmer count files are used as the seer input).
In summary, there are the following partitions of X samples and i phenotypes Yi
- A: Samples with only assembly data (X \ Yi)
- B: Samples with both assembly data, and a phenotype i (X ∪ Yi)
- C: Samples with only a phenotype i (Yi \ X)
kmds needs, in its .pheno file, precisely all assemblies that have been counted (A & B = X)
seer can have any mix of B and C (i.e. any subset of Y) in the .pheno file, though only those samples in B will be used for the inference (samples in C are set to kmer presence = 0 by default, samples in A are ignored). The value of the phenotype need not be the same as used for kmds.