Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No filtering was done! Exiting... #2

Open
mictadlo opened this issue Jun 11, 2021 · 5 comments
Open

No filtering was done! Exiting... #2

mictadlo opened this issue Jun 11, 2021 · 5 comments

Comments

@mictadlo
Copy link

Hi,
Thank you for sharing your cafe pipeline. However, I ran into the following problems:

> awk -F'\t' '{print "(null)\t"$0}' Orthogroups.GeneCount.tsv > tmp.tsv
> sed -i.bak 's|(null)|Desc|g' tmp.tsv

> head tmp.tsv
Desc	Orthogroup	Atha	Cann	Csat	Mgut	Natt	Nlab	Nobt	Noto	Nqld	Nsyl	Ntab	Ntom	Ptri	Slyc	Smel	Stub	Vvin	Total
Desc	OG0000000	0	0	0	0	0	3	0	0	982	0	0	1	0	0	0	986
Desc	OG0000001	0	9	0	0	5	23	8	59	3	31	267	14	0	6	113	540

> python2.7 ~/scripts/cafe/cafetutorial_clade_and_size_filter.py -i tmp.tsv -s -o cafe.input.tsv
No filtering was done! Exiting...

> conda activate cafe
> cafe
# load -i tmp.tsv reports/log_run1.txt
Failed to load file

What did I miss?

Thank you in advance

@harish0201
Copy link
Owner

Ah! It's been some time that I've looked at this, so apologies for the confusion ;)

In the tmp.tsv, as per CAFE's format, the first column should be the (null) with the header being Desc. So what you've shared should look like:

head tmp.tsv
Desc	Orthogroup	Atha	Cann	Csat	Mgut	Natt	Nlab	Nobt	Noto	Nqld	Nsyl	Ntab	Ntom	Ptri	Slyc	Smel	Stub	Vvin	Total
(null)	OG0000000	0	0	0	0	0	3	0	0	982	0	0	1	0	0	0	986
(null)	OG0000001	0	9	0	0	5	23	8	59	3	31	267	14	0	6	113	540

IIRC, this is what the awk portion actually does, only the "first column header" should be the "Desc".

Once this is done, you can run the cafetutorial_clade_and_size_filter.py script that CAFE has.

In any case, what it does is, it's going to remove the OGs/Groups if any of the species have more than 100 proteins in them.

Once this is done, you'd need to create a rooted tree. I recommend using Timetree.org or other age approaches (ape package in R is a good started)

Unfortunately, the way I've arranged the CAFE thing is essentially a guideline as to how I do it, rather than a complete pipeline because I kept running into issues :(

I might automate it some day, hopefully, but the use case for me is fairly too low to justify the time sunk into it.

@mictadlo
Copy link
Author

Thank you. The new file looks like this:

> head tmp.tsv
Desc	Orthogroup	Atha	Cann	Csat	Mgut	Natt	Nlab	Nobt	Noto	Nqld	Nsyl	Ntab	Ntom	Ptri	Slyc	Smel	Stub	Vvin	Total
(null)	OG0000000	0	0	0	0	0	3	0	0	982	0	0	1	0	0	0	0	0	986
(null)	OG0000001	0	9	0	0	5	23	8	59	3	31	267	14	0	6	113	2	0	540
(null)	OG0000002	0	128	3	20	21	21	8	42	21	36	70	37	3	25	38	39	22	534
(null)	OG0000003	1	66	12	20	29	17	12	42	21	30	60	21	5	19	36	84	42	517

The age (28 MYA) has been provided to me by a collaborator. Unfortunately, I got a new error:

> make_ultrametric.py -r 28 SpeciesTree_rooted.txt
> head SpeciesTree_rooted.txt.ultrametric.tre
(((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819);

python2.7 ~/scripts/cafe/cafetutorial_clade_and_size_filter.py -i tmp.tsv -s -o cafe.input.tsv
> cafe
# load -i cafe.input.tsv
-----------------------------------------------------------
Family information: cafe.input.tsv
Log: stdout
The number of families is 35104
Root Family size : 1 ~ 124
Family size : 0 ~ 149
P-value: 0.01
Num of Threads: 1
Num of Random: 1000
# tree (((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819);
(((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819)
No species 'total' was found in the tree

What did I miss?

Thank you in advance,

Michal

@harish0201
Copy link
Owner

Hey!

You have to remove the "Total" column, that won't be needed. The error specifies that :)

@mictadlo
Copy link
Author

Thank you,, I have forgotten to do awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > mod.tsv. Unfortunately, I got now Tree must be binary:

> awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > mod.tsv
> python2.7 ~/scripts/cafe/cafetutorial_clade_and_size_filter.py -i mod.tsv -s -o cafe.input.tsv  
>  cafe
# load -i cafe.input.tsv
-----------------------------------------------------------
Family information: cafe.input.tsv
Log: stdout
The number of families is 27493
Root Family size : 1 ~ 124
Family size : 0 ~ 149
P-value: 0.01
Num of Threads: 1
Num of Random: 1000
# tree (((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819);
(((Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727):3.92728,(Mgut:23.5518,((Cann:4.94278,((Stub:1.91372,Slyc:1.91372):1.99847,Smel:3.91219):1.0306):2.54854,((Natt:2.59231,((Nsyl:1.44399,Ntab:1.44399):0.826096,(Nlab:0.508663,Nqld:0.508663):1.76142):0.322218):0.530061,((Ntom:1.26249,Noto:1.26249):1.11482,Nobt:2.37732):0.745047):4.36896):16.0605):4.44819)
# lambda -s -t ((1,1)1,(1,1):1,1)
Tree must be binary

What did I miss?

Thank you in advance,

Michal

@harish0201
Copy link
Owner

harish0201 commented Jun 16, 2021

The lambda tree that you've pasted should look like the tree that you've generated (Ptri.....)

or rather taking a small portion of your ultrametric tree: (Ptri:21.5258,(Csat:19.4873,Atha:19.4873):2.03847):2.54693,Vvin:24.0727)

the lambda tree would look like: (1,(2,2)),1) assuming Csat is your species of interest. My recommendation is go through the CAFE tutorial once though, because it covers this in more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants