This repository contains data files and scripts to reproduce the analyses and results presented in our paper.
All relevant scripts listed below can be found within the protein/scripts
folder.
-
collect_coding_seqs2/run_ccs2_function.sh
- blast wrapper to extract orthologous protein-coding sequences from genomes. If you want to do the blast search using our scripts, you will need to download genomes listed in tableprotein/data/TRNP1_source_genomes.csv
-
processingL_final.R
- process our own primate sequence assemblies from targeted re-sequencing -
collect_coding_seqs2/Ferret_transcr_assembly_steps.sh
andcollect_coding_seqs2/Ferret_cutContig_makeFa.R
- process the re-sequenced ferret sequence -
collect_coding_seqs.R
- gather the orthologous TRNP1 protein-coding sequences from all included sources. Intersect with the available trait data. Save sequences and traits for the downstream analyses
Multiple Alignments with PRANK (v150803)
align_with_prank.sh
- protein-coding sequence alignment
PAML (v4.8)
First, run PAML site models as described in the README from folder PAML
.
select_sign_sites_PAML_M8.R
- pull out the identified sites under positive selection
COEVOL (v1.4)
-
run_coevol.sh
- wrapper to run Coevol;finish_coevol.sh
- wrapper to stop Coevol and generate summaries -
summarize_coevol_output_TRNP1.R
- access the estimated omega, correlations and posterior probabilities
Scripts for evolutionary analysis of control proteins can be found in a separate folder other_protein_alignments
where there is a separate README on this part.
proliferation_analysis.R
- gather proliferation assay data, estimate proliferation rates using logistic regression, infer association with brain size and GI using PGLS
All relevant scripts listed below can be found within the regulation/scripts
folder.
-
MPRA/MPRA_sequences.R
- identify and collect orthologous TRNP1 CRE sequences across mammals from our sequenced data as well as published genomes -
MPRA/MPRA_oligolib_construction.R
- MPRA design - using a sliding window, construct enhancer tiles based on the orthologous CRE sequences from the previous script to test within the MPRA assay -
MPRA/preprocessing
MPRA count pre-processing. Extract reporter gene expression counts for each included enhancer tile. This folder contains a README with further details -
MPRA/collect_MPRA_fastas.R
- separate and save the relevant sequences from each of the 7 CRE regions, align using MAFFT (v7.407) -
MPRA/MPRA_analysis.R
- filter and summarize CRE activities. Plug into PGLS and compare to brain mass and gyrification -
MPRA/combine_dnds_intron.R
- combine TRNP1 protein evolution rates inferred using Coevol with the intron activity across catharrines within the same model
-
TFs/download_motifs_JASPAR2020.R
- download PWMs and motif clustering from JASPAR 2020, transform PWMs for Cluster-Buster -
TFs/MPRNAseq_NPC.yaml
- zUMIs (v2.5.4) yaml file for mapping RNA-seq reads from NPCs. Input raw data for this processing can be accessed under E-MTAB-9951 -
TFs/TF_expression_analysis.R
- find the expressed transcription factors in our NPCs (from bulk RNA-seq data). Run Cluster-Buster (Jun 13 2019) on the intron sequences including only the PWMs of the expressed TFs to identify overrepresented motifs -
TFs/PGLS_motifs.R
- investigate binding score assocation with intron CRE activity and GI among the 22 most abundant motifs on the intron sequence using PGLS
Tree construction: regulation/scripts/MPRA/tree_construction.R
Throughout the workflow, we are using job scheduling system slurm (v0.4.3).
Primer sequences for the resequencing of putative Trnp1 cis-regulatory elements as well as for the MPRA can be found in oligo_sequences/
. For more information on the different tables please have a look at the README oligo_sequences/README