Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepareGenome #38

Closed
nservant opened this issue Jun 15, 2020 · 6 comments
Closed

prepareGenome #38

nservant opened this issue Jun 15, 2020 · 6 comments

Comments

@nservant
Copy link

Hi,
When I run the SNPsplit_genome_preparation script on the complete Mouse genome (base chromosomes + all scaffolds/fixes), with --no_nmasking, the full_sequence output contains only the base chromosome.

My genome reference comes from ;

 ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz

>>grep ">" Mus_musculus.GRCm38.dna.toplevel.fa 
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
>2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
>3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
>4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
>5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
>6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
>7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
>8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
>9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
>10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
>11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
>12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
>13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
>14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
>15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
>16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
>17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
>18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
>19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
>X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
>Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
>MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
>CHR_MG171_PATCH dna:chromosome chromosome:GRCm38:CHR_MG171_PATCH:1:151834685:1 PATCH_FIX
>CHR_MG4222_MG3908_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4222_MG3908_PATCH:1:94987243:1 PATCH_FIX
>CHR_MG51_PATCH dna:chromosome chromosome:GRCm38:CHR_MG51_PATCH:1:156507375:1 PATCH_FIX
>CHR_MG3496_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3496_PATCH:1:195440828:1 PATCH_FIX
>CHR_MG4200_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4200_PATCH:1:94983374:1 PATCH_FIX
>CHR_MG4243_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4243_PATCH:1:156484188:1 PATCH_FIX
>CHR_MG4209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4209_PATCH:1:91793962:1 PATCH_FIX
>CHR_MG74_PATCH dna:chromosome chromosome:GRCm38:CHR_MG74_PATCH:1:104052134:1 PATCH_FIX
>CHR_MG4310_MG4311_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4310_MG4311_PATCH:1:156656003:1 PATCH_FIX
>CHR_MG4249_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4249_PATCH:1:61433356:1 PATCH_FIX
>CHR_MG3833_MG4220_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3833_MG4220_PATCH:1:98208654:1 PATCH_FIX
>CHR_MG3231_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3231_PATCH:1:171029545:1 PATCH_FIX
>CHR_MG4151_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4151_PATCH:1:145439975:1 PATCH_FIX
>CHR_MG104_PATCH dna:chromosome chromosome:GRCm38:CHR_MG104_PATCH:1:170913546:1 PATCH_FIX
>CHR_MMCHR1_CHORI29_IDD5_1 dna:chromosome chromosome:GRCm38:CHR_MMCHR1_CHORI29_IDD5_1:1:195506435:1 PATCH_NOVEL
>CHR_MG3700_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3700_PATCH:1:90658154:1 PATCH_FIX
>CHR_MG3530_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3530_PATCH:1:130695022:1 PATCH_FIX
>CHR_MG4261_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4261_PATCH:1:103906836:1 PATCH_FIX
>CHR_MG3251_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3251_PATCH:1:145419646:1 PATCH_FIX
>CHR_MG3562_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3562_PATCH:1:156470354:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG4 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG4:1:122190308:1 PATCH_NOVEL
>CHR_WSB_EIJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG2:1:122242168:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG2:1:122246885:1 PATCH_NOVEL
>CHR_MG3648_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3648_PATCH:1:104165524:1 PATCH_FIX
>CHR_MG3618_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3618_PATCH:1:156477076:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG5 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG5:1:122035401:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG3:1:122032376:1 PATCH_NOVEL
>CHR_MG4136_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4136_PATCH:1:156508116:1 PATCH_FIX
>CHR_MG4138_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4138_PATCH:1:130620757:1 PATCH_FIX
>CHR_MG3835_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3835_PATCH:1:90835696:1 PATCH_FIX
>CHR_MG89_PATCH dna:chromosome chromosome:GRCm38:CHR_MG89_PATCH:1:159939961:1 PATCH_FIX
>CHR_MG4213_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4213_PATCH:1:91736668:1 PATCH_FIX
>CHR_MG3829_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3829_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG209_PATCH:1:94987270:1 PATCH_FIX
>CHR_WSB_EIJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG3:1:122041104:1 PATCH_NOVEL
>CHR_MG4308_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4308_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG3609_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3609_PATCH:1:156568640:1 PATCH_FIX
>CHR_MG4180_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4180_PATCH:1:120129530:1 PATCH_FIX
>CHR_MG3686_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3686_PATCH:1:170897390:1 PATCH_FIX
>CHR_MG65_PATCH dna:chromosome chromosome:GRCm38:CHR_MG65_PATCH:1:61442615:1 PATCH_FIX
>CHR_MG3627_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3627_PATCH:1:124903046:1 PATCH_FIX
>CHR_MG3999_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3999_PATCH:1:195424274:1 PATCH_FIX
>CHR_MG3699_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3699_PATCH:1:90657263:1 PATCH_FIX

Command line ;

SNPsplit_genome_preparation --strain CAST_EiJ --reference_genome genome/ --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --no_nmasking

Output ;

>>grep ">" CAST_EiJ_maternal_genome.fa 
>10
>11
>12
>13
>14
>15
>16
>17
>18
>19
>1
>2
>3
>4
>5
>6
>7
>8
>9
>MT
>X
>Y

I think it would be good to export all chromosomes, even if there have no SNPs.
From ENSEMBLE ; Fix patches: provide improved sequence for known assembly errors. These patches will be incorporated into the primary assembly in the next major assembly release. They are coloured green in the Chromosome summary page and Region in detail page. They are improvements on the primary assembly and should be used preferentially over the primary assembly.

Thanks @FelixKrueger !
Nicolas

@FelixKrueger
Copy link
Owner

Hi Nicolas,

I have now tried to change the behaviour to print out all chromosomes, even if they were not covered by SNPs. Could you give it a whirl and see if it appears to do what you wanted? Addressed here: 9a81c16

@nservant
Copy link
Author

nservant commented Jul 9, 2020

HI @FelixKrueged,
I run the new version.

SNPsplit_genome_preparation --strain CAST_EiJ --reference_genome genome --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf

Two things :

  • In the log there is still the base chromosomes only in Using the following chromosomes (...)
  • And in thesummary I see "0 Ns were newly introduced into the N-masked genome for strain CAST_EiJ in total"

Is it expected ?
Otherwise, I do have all chromosomes as expected in the results folder.
Cheers

@FelixKrueger
Copy link
Owner

Hmm, the N-masking seems to work fine if you specify --full_sequence as well, I'll take another look tomorrow.

@FelixKrueger
Copy link
Owner

Right, it was ... - a scoping issue. It should work now, could you try cloning the dev version and try again? Addressed here 0e4431e.

@nservant
Copy link
Author

Yes. Much better now !

Summary
20668547 Ns were newly introduced into the N-masked genome for strain CAST_EiJ in total

@FelixKrueger
Copy link
Owner

Awesome, I'll leave this open for a few more days to give you some time to test. It will then find its way into the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants