-
Write a script to do the following to Python_06.txt
- Open and read the contents.
- Uppercase each line
- Print each line to the STDOUT
-
Modifiy the script in the previous problem to write the contents to a new file called "Python_06_uc.txt"
-
Open and print the reverse complement of each sequence in Python_06.seq.txt. Each line is the following format:
seqName\tsequence\n.
Make sure to print the output in fasta format including the sequence name and a note in the description that this is the reverse complement. Print to STDOUT and capture the output into a file with a command line redirect '>'.- Remember is is always a good idea to start with a test set for which you know the correct output.
-
Open the FASTQ file Python_06.fastq and go through each line of the file. Count the number of lines and the number of characters per line. Have your program report the:
- total number of lines
- total number of characters
- average line length
-
You are going to generate a couple of gene list that are saved in files, add their contents to sets, and compare them.
Generate Gene Lists:
Get all genes:
- Go to Ensembl Biomart.
- In dropdown box, select "Ensembl Genes 94"
- In dropdown box, select "Alpaca Genes"
- On the left, click Attributes
- Expand GENE:
- Deselect "transcript stable ID".
- Click Results (top left)
- Export all results to "File" "TSV" --> GO
- Rename the file to "alpaca_all_genes.tsv"
In the same Ensembl window, follow the steps below to get genes that have been labeled with Gene Ontology term "stem cell proliferation". For extra information on stem cell proliferation, check out stem cell proliferation
- Click "Filters"
- Under "Gene Ontology", check "Go term name" and enter "stem cell proliferation"
- Click Results (top left)
- Export all results to "File" "TSV" --> GO
- Rename the file to "alpaca_stemcellproliferation_genes.tsv"
In the same Ensembl window, follow the steps below to get genes that have been labeled with Gene Ontology term "stem cell proliferation". For extra information on pigmentation, check out pigmentation
- Click "Filters"
- Under "Gene Ontology", check "Go term name" and enter "pigmentation"
- Click Results (top left)
- Export all results to "File" "TSV" --> GO
- Rename the file to "alpaca_pigmentation_genes.tsv"
Open each of the three files and add the geneIDs to a Set. One Set per file.
A. Find all the genes that are not cell proliferation genes.
B. Find all genes that are both stem cell proliferation genes and pigment genes.
Note Make sure to NOT add the header to your set.
Now, let do it again with transciption factors.
- Go back to your Ensembl Biomart window
- Deselect the "GO Term Name"
- Select "GO Term Accession"
- Enter these two accessions IDs which in most organisms will be all the transcription factors
- GO:0006355 is "regulation of transcription, DNA-dependent”.
- GO:0003677 is "DNA binding"
- Click Results (top left)
- Export all results to "File" "TSV" --> GO
- Rename the file to "alpaca_transcriptionFactors.tsv"
Open these two files: 1) the transcription factor gene list file and 2) the cell proliferation gene list file. Add each to a Set, One Set per file
A. Find all the genes that are transcription factors for cell proliferation
Now do the same on the command line with comm
command. You might need to sort
each file first.
- get the raw file Python_06.seq.txt
- in a script, open this file
- iterate over each line in this file (seqName\tsequence\n)
- for each sequence:
- calculate and store the count of each unique nucleotide character in a dictionary
- report the name, total of each nucleotide count, and the GC content
- for each sequence:
Extra: Now that you know how to open a file and iterate over each line, you can write your first FASTA parser
- use file I/O,
if
statements and dictionaries to write your first FASTA parser. Some other useful functions and methods arefind
,split
, string concatenation.