This program utilizes genome sequencing data for Synechococcus elongatus from GenBank in order to devise a computational measure for expected intraspecies genomic similarity.
I used an element-wise comparison of the genome strands as the similarity measure, which generates the following probability distribution for two random genomes consisting of 2940 nucleotides (probability vs. similarity):
After analyzing 100 samples of the bacteria's genomes, consisting of 4950 comparisons, I generated the following histogram for my sampling distribution (frequency vs. similarity and z-score, respectively):
These results suggest that a z-score of approximately 91.380 is to be expected as a measure of intraspecies genomic similarity. There are several possible avenues of further research, including more sophisticated measures of similarity (e.g. cosine similarity, pair-wise nucleotide comparison, weighted comparisons) as well as setting more realistic limitations on viable genomes.