-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating .hic and .assembly for editing in juicebox #4
Comments
Hi Amanda, Now
You can run Best, |
I should have also mentioned that,
Chenxi |
Hi Chengxi and Amanda, I have nicely produced an editable combination of .hic and .assembly files for two different vertebrate genomes (using the updated juicer_pre with -a option, etc.). They work pretty well with Juicebox, i can see the "contigs/scaffolds" within each super-scaffold. I made some edits in one of them (fixing a few within-scaffold translocations and inversions). The I saved the .review.assembly file. However,I would like to generate a new fasta from it and this is hard without the merge_nodups.txt file. Should I do something different to this? Ideally, i would like to generate .assembly, .agp and .hic of every round of curation documenting each decision taken. Sorry, I am neophyte curator and I feel like I do have loads of related questions that i should post somewhere else. But here a few:
Thanks for the help, |
Hi Fernando, I never used Juicebox curation mode, so I do not know the answers to those questions related. Sorry about that. But for the other questions, I will try to answer. 2. Remappings for curation against yahs_scaffolds_final.fa should include supplemental and secondary alignments. Once they are done can i just get the binary file, .assembly and .hic? how? From my understanding, I do not think you should map against the scaffolds if you need an editable .hic file. The Juicebox assembly mode always requires the hic reads mapped to contigs. I guess the best way to do this is to generate an AGP file for curated scaffolds and use it as input to 3. Could we use yahs on a different assembly (e.g. salsa2) to simply generate a .assembly and .hic for juicebox? Yes and no. As long as you have an AGP file, you should be fine to run scaffold_5 34451147 37999945 7 W ptg000087l_1_1 1 3548799 + scaffold_16 19587586 21092835 5 W ptg000087l_1_2 1 1505250 - You see the split contig ended up with two new contigs scaffold_5 34451147 37999945 7 W ptg000087l_1 1 3548799 + scaffold_16 19587586 21092835 5 W ptg000087l_1 3548800 5054049 - Best, |
These directions to use the underlying contigs work great, Chenxi. Thanks for the enhancement and additional tips - I really appreciate it! |
Hi, I am consistently getting this error for a mammalian genome (2.5-3Gb) when I try to get the .hic file using juicer_tools. java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. This is a juicer error, that at some point seemed to be solvable with an upgrade (aidenlab/juicer#231 (comment)) I realized i was using the latest version (Juicer Tools Version 1.22.01). Just in case this morning i did a created a new Juicer install where I basically copied previous Juicer installation and replaced juicer_tools.jar in the juicer scripts folder with link to v1.9.9. wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar ; Same error. Although contacts should be there, because the resulting genome assembly is extremely good, by aligning it to a close species chrom-level assembly i see very few rearrangements and it has very similar chromosome sizes (super-scaffolds). I would like to inspect it and verify a couple of inversions. Do you have any idea on how to fix this problem for juicebox compatibility? I know it goes a bit beyond yahs support...but it would be necessary to curate and publish these genomes. Perhaps is due to the way I am running yahs, not exactly using run_yahs.sh build my own .sh placing the commands step by step (see below). However for other two genomes 1-1.6Gb I got the .hic and .assembly editable files. Thanks, PS. Did not find time yet to get a .fasta from the .review. Want to check mapping all reads to the input assembly with juicer and see if the mnd file works. I'll update on this here as soon as i can run_yash_fernando.sh 01. Load YAHS & run assembly with BAM fileexport PATH=/scratch/project/devel/aateam/src/yahs_2021_12_21:$PATH; mkdir -p out; yahs assembly.fa mapped.PT.name_sorted.bam -o out/yahs 02. Generate .assembly HiC contact maps for juicercd out;
Generate .assembly for juicebox using - Run Yahs' juicer_pre 2021-12-20 release accepts -a option
Generate a new .hic file once the length of the total assembly is known
|
Hello Fernando, Sorry for the late reply. I was not very active during the break. I guess your problem here is Best, |
Thanks Chenxi, I wasn't aware of the 2Gb limit for juicebox (I guess is just to allow visualization). Sorry, because you left all this fairly written in scripts/run_yash.sh It is a bit annoying to follow up the original coordinates in the scaled view, but now it works and i can make edits on genomes larger than 2Gb! In my case the scaling is about half the assembly length (2424933882 bp), thus PRE_C_SIZE: assembly 1212463441 - Do this makes sense to you? I am a bit confused with a small detail. So far I was passing to juicer_pre the contigs.fai and then running juicer_tools as stated above (#4 (comment)) without using chrom.sizes at all... java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441") But I noticed you also pass a file with the scaled chrom sizes instead the total PRE_C_SIZE assembly. Why is this? Do this two approaches have different implications? approach 1 will be dierctly feeding the total scaled length (what i did now): approach 2 what your wrote at the end of the run_yash.sh: this is to generate input file for juicer_tools - assembly (JBAT) mode (-a)../juicer_pre -a -o ${outdir}/${out}_JBAT ${outdir}/${out}.bin ${outdir}/${out}_scaffolds_final.agp ${contigs}.fai 2>${outdir}/tmp_juicer_pre_JBAT.log Thanks again, |
Yes, it makes sense. For the other question, Best, |
I justed noticed the way you did it, asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) Did you also generate the asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) Chenxi |
Hi Chenxi, Thanks for your quick reply. You're right for approach 1 does not makes sense the way i wrote it. I copied the lines wrong. What I exactly did was: So I am just passing the total scaled length to juicer_tools.jar. For another genome I previously stored the scaled size in $asmlen (see below) and pass it the same way with the echo. I was not using the .chrom.sizes at all. To store asmlen: asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) My question is if i should pass all the scaled scaffold lengths (.chrom.sizes) or it works well using <(echo "assembly $asmlen") instead. That is what i understood here #4 (comment) when you wrote: java -Xmx36G -jar ${juicer_tools} pre test.txt test.hic <(echo "assembly 183277074") Does it work equally by passing the total scaled length? Thanks, |
Yes, the total scaled size should be used. Chenxi |
Great, I wasn't understanding well the run_yahs.sh command. Thanks for clarifying, |
Hi @c-zhou, |
Hi @schellt @Astahlke @gitcruz, In the latest commit 63baff3 and c854cb7, I tried to make YaHS more friendly to manual editing with Juicebox JBAT. Changes including,
Please check Manual curation with Juicebox (JBAT) section in README for detailed information. I should also point out that, this version is not fully compatible with the previous versions. However, if your total genome assembly size is smaller than 2Gb (where the scale factor is 1), the review assembly file generated by JBAT is all good. You only need to rerun I only did some simple tests, if you find any problems or have any further questions, please let me know. Best, |
Hi @c-zhou , |
Hi, I am trying to generate the Hi-C contact map to be visualised in JBAT.
However, when trying to go ahead with the following command: I obtain this error:
I tried to run the command both using juicer_tools.1.9.9_jcuda.0.8.jar and the last version of juicer_tools.jar, and the error is always the same. I really appreciate any help! Thank you in advance. Best regards, Lia |
Hello Lia, Thanks for using YaHS. In this command Best, |
I will close this issue for now. Feel free to reopen it whenever needed. |
Hi @LiaOb21, I also remember having the error below
This is more or less what worked for me to get rid of it:
Run Yahs again. Hope it helps, |
Hi Fernando, Thank you so much for your help. I resolved following the suggestion of @c-zhou about the chromosome size file. Cheers, Lia |
Cool!
Fernando
El mar., 21 jun. 2022 11:28, LiaOb21 ***@***.***> escribió:
… Hi @LiaOb21 <https://github.com/LiaOb21>,
I also remember having the error below
I obtain this error:
Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
at juicebox.tools.HiCTools.main(HiCTools.java:96)
This is more or less what worked for me to get rid of it:
1. Make sure you install Juicer version 1.6 (or any other recommended buy chenxi).
2. Then inside this Juicer-for-Yahs/SLURM/scripts directory you should remove juicer_tools.jar
3. Install the recommended version inside Juicer-for-Yahs/SLURM/scripts by doing: _wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar_
4. Then link it as juicer_tools.jar inside that directory doing
_ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar_
5. Yahs will look for _juicer_tools.jar_ So before running Yahs export it to the PATH. export PATH=/path-to-Juicer-for-Yahs/SLURM/scripts/:$PATH
Run Yahs again.
Hope it helps, Fernando
Hi Fernando,
Thank you so much for your help. I resolved following the suggestion of
@c-zhou <https://github.com/c-zhou> about the chromosome size file.
Thank you anyway.
Cheers,
Lia
—
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB34KVLGQSI5SZNHKFJVQFDVQGDLHANCNFSM5KKZ6YWA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @c-zhou!! This thread helped me a lot with creating the files for manual curation. However, after manual curation, I'm still having issues producing a final assembly for a reason that its not directly related to YaHS or the juicer version distributed with it, but I want to know if there's a way around with the post-curation steps. Overall, the assembly looks good, but I have a lot of sequences that are outside of the contact map (see image). I ran the assembly with Hifiasm with the Hi-C integrated mode, and then ran purge_dups manually setting the cutoffs. After that I mapped the Hi-C reads to the assemblies and ran YaHS. I'm no expert in the matter, so I would like to know if there is a way to remove those sequences outside the contact map (sequences after the 1.5Gb position approx) after manual curation and produce a new Hi-C contact map without those sequences, without having to rerun the whole assembly pipeline again. Thanks for your insights with this situation, and thanks for YaHS as well, awesome program!!! Mateo |
Is there any way to get the post review done without hic mapping like I use HapHic which is another scaffolder and so I don't have the merged_nodups.txt but I have reviewed my assembly and I have a Hi-C mapping file? or should I rerun juicer pre ? |
Hi Mateo @malvaradol, Sorry for the delayed reply. You can directly edit the AGP file if you want to remove some sequences from your assembly. For example, if you only want to keep the first 1000 scaffolds, you can find the last line of the A more general approach is to make a file for the list of scaffolds you want to keep, e.g., scaffod_1, scaffold_2,...., one scaffold per line, in the Regarding you HiC map, I am not sure what happened to those small fragments. They could be contaminations or could be very repetitive sequences that did not get aligned very well. Best, |
Also for the people who have the error about that one particular fragment
has a 0 value, you need to change the 0 to 1. So the
contig/debris/fragment/entry in the AGP will be 1 base long and not 0 and
this will make the script to get fasta from agp work well.
…On Tue, Oct 22, 2024, 6:43 PM Chenxi Zhou ***@***.***> wrote:
Hi Mateo @malvaradol <https://github.com/malvaradol>,
Sorry for the delayed reply. You can directly edit the AGP file if you
want to remove some sequences from your assembly. For example, if you only
want to keep the first 1000 scaffolds, you can find the last line of the
scaffold_1000 in your AGP file and do a head to only keep the rows up to
that specific line, i.e., something like head -2000 OLD.agp >NEW.agp.
Once you get the new AGP file, you can generate a FASTA file with
agp_to_fasta and make new HiC maps as above.
A more general approach is to make a file for the list of scaffolds you
want to keep, e.g., scaffod_1, scaffold_2,...., one scaffold per line, in
the SCF.list. Then you could run something like awk 'NR==FNR{s[$1]=1;
next}{if(s[$1]) print}' SCF.list OLD.agp >NEW.agp
Regarding you HiC map, I am not sure what happened to those small
fragments. They could be contaminations or could be very repetitive
sequences that did not get aligned very well.
Best,
Chenxi
—
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASYS5TBF2O6TRAWOYQEVDS3Z4Y26PAVCNFSM6AAAAABQEKZBJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZGA2TGNJZGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks for the comment @Isoris. Regarding your HiC map question above, you need to rerun juicer pre (with HiC alignment file and AGP file) and HiC map building (juicer_tools pre) whenever your AGP/FASTA file changed. I am not very familiar with HapHiC. What is the format of the HiC mapping file? If it is a BAM or BED, you should be able to seamlessly run Chenxi |
Hello C-Zhou/Yahs,
Yes finally I have got it to work. Thank you.
Le mar. 22 oct. 2024 à 19:59, Chenxi Zhou ***@***.***> a
écrit :
… Thanks for the comment @Isoris <https://github.com/Isoris>.
Regarding your HiC map question above, you need to rerun juicer pre (with
HiC alignment file and AGP file) and HiC map building (juicer_tools pre)
whenever your AGP/FASTA file changed. I am not very familiar with HapHiC.
What is the format of the HiC mapping file? If it is a BAM or BED, you
should be able to seamlessly run juicer pre with it.
Chenxi
—
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASYS5TCIGG4U6R62FV65E73Z4ZD2DAVCNFSM6AAAAABQEKZBJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZGIYTSMBUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
https://github.com/phasegenomics/juicebox_scripts |
Hi @Isoris, As I used to find lot of problems with Juicer to produce the .assembly for an assembly, I normally run the mappings and preprocessing with the dovetail pipeline (for both OmniC and HiC) and then I produce the contact map with PretextMap to visualize and curate the assembly with PretextView following the recommendations of the sanger curation Team. PretextView is pretty light-weighted, but the resolution is limited to a fixed number of texels so the larger the genome the larger the texel size. You can also use HiGlass for increased resolution, but PretextView works very well for most cases. Best |
Hi Chenxi,
It looks like yahs will really speed up our scaffolding efforts - so far the scaffolded fastas are looking great. Awesome work! However, I'm having trouble creating the correct input files for our manual curation phase, editing the scaffolds in Juicebox. Our main goal is to relate the underlying contigs (especially when we use the yahs flag --no-contig-ec) to the assembled scaffolds and hic map.
Using your provided
juicebox_pre
program and the Juiceboxjuicer pre
, the resulting .hic and .assembly files are not correctly editable in Juicebox. I think SALSA users are having a similar issue marbl/SALSA#154I think you can only create a draft assembly for editing with
run-assembly-visualizer.sh
(https://github.com/aidenlab/3d-dna/blob/master/visualize/run-asm-visualizer.sh).Normally our workflow looks like this. makeAgpFromFasta and agp2assembly.py are 3d-dna scripts. Matlock is provided by Phase Genomics - similar concept to your juicebox_pre to convert the alignments to alignments_sorted.txt.
It seems like I should be able to substitute the alignments_sorted.txt for phasehic.sorted.links.txt in our workflow, but
alignments_sorted.txt is missing some columns. Maybe we just need to figure out how to fill these columns?
Not sure if this is a very clear question. In short, can you provide any guidance on creating a .assembly and .hic file for the Juicebox run-assembly-visualizer.sh tool?
Thank you!
Amanda
The text was updated successfully, but these errors were encountered: