Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

larger genome size than expected #2354

Closed
yaradua opened this issue Dec 13, 2024 · 10 comments
Closed

larger genome size than expected #2354

yaradua opened this issue Dec 13, 2024 · 10 comments

Comments

@yaradua
Copy link

yaradua commented Dec 13, 2024

Hello, thank you once again for developing canu.

I am trying to assemble a plant genome with heterozygosity of 6.13% and high repeat. the estimated genome size is 1.486gb and canu2.2 give me a result with 9gb size.

my code is: ./canu -p myassembly -d canu_assembly maxThreads=32 genomeSize=1.5g -pacbio-hifi /nfs_fs/nfs4/Samaila/project/GingerGenome/SH/HiFi_Hic/HIFI_DATA/ShHIFI.fasta.gz

I have the assembly report for you below. Please I need your assistance on how to improve the assembly using canu2.2

[TRIMMING/READS]
--
-- In sequence store './myassembly.seqStore':
--   Found 5990887 reads.
--   Found 101140799396 bases (67.42 times coverage).
--    Histogram of corrected reads:
--    
--    G=101140799396                     sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        22861    386971  10114084412  ||       1353-2570          730|-
--    00020        20318    859057  20228161291  ||       2571-3788         1526|-
--    00030        18780   1377918  30342240563  ||       3789-5006         1720|-
--    00040        17634   1934290  40456328953  ||       5007-6224         2086|-
--    00050        16688   2524251  50570404013  ||       6225-7442         3896|-
--    00060        15868   3146055  60684487549  ||       7443-8660         7127|-
--    00070        15132   3798939  70798574464  ||       8661-9878        11375|-
--    00080        14439   4483202  80912649622  ||       9879-11096       18774|--
--    00090        13701   5201620  91026723632  ||      11097-12314       73590|----
--    00100         1353   5990886 101140799396  ||      12315-13532      526603|-----------------------------
--    001.000x             5990887 101140799396  ||      13533-14750     1174451|---------------------------------------------------------------
--                                               ||      14751-15968     1104842|------------------------------------------------------------
--                                               ||      15969-17186      866956|-----------------------------------------------
--                                               ||      17187-18404      655268|------------------------------------
--                                               ||      18405-19622      476288|--------------------------
--                                               ||      19623-20840      335812|-------------------
--                                               ||      20841-22058      231915|-------------
--                                               ||      22059-23276      158130|---------
--                                               ||      23277-24494      107265|------
--                                               ||      24495-25712       72706|----
--                                               ||      25713-26930       49630|---
--                                               ||      26931-28148       33638|--
--                                               ||      28149-29366       23243|--
--                                               ||      29367-30584       16131|-
--                                               ||      30585-31802       11215|-
--                                               ||      31803-33020        7702|-
--                                               ||      33021-34238        5323|-
--                                               ||      34239-35456        3826|-
--                                               ||      35457-36674        2682|-
--                                               ||      36675-37892        1936|-
--                                               ||      37893-39110        1322|-
--                                               ||      39111-40328         897|-
--                                               ||      40329-41546         675|-
--                                               ||      41547-42764         474|-
--                                               ||      42765-43982         346|-
--                                               ||      43983-45200         242|-
--                                               ||      45201-46418         178|-
--                                               ||      46419-47636         117|-
--                                               ||      47637-48854          87|-
--                                               ||      48855-50072          61|-
--                                               ||      50073-51290          30|-
--                                               ||      51291-52508          24|-
--                                               ||      52509-53726          16|-
--                                               ||      53727-54944           8|-
--                                               ||      54945-56162           5|-
--                                               ||      56163-57380          10|-
--                                               ||      57381-58598           3|-
--                                               ||      58599-59816           2|-
--                                               ||      59817-61034           2|-
--                                               ||      61035-62252           2|-
--

[UNITIGGING/READS]
--
-- In sequence store './myassembly.seqStore':
--   Found 5990887 reads.
--   Found 101140799396 bases (67.42 times coverage).
--    Histogram of corrected-trimmed reads:
--    
--    G=101140799396                     sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        22861    386971  10114084412  ||       1353-2570          730|-
--    00020        20318    859057  20228161291  ||       2571-3788         1526|-
--    00030        18780   1377918  30342240563  ||       3789-5006         1720|-
--    00040        17634   1934290  40456328953  ||       5007-6224         2086|-
--    00050        16688   2524251  50570404013  ||       6225-7442         3896|-
--    00060        15868   3146055  60684487549  ||       7443-8660         7127|-
--    00070        15132   3798939  70798574464  ||       8661-9878        11375|-
--    00080        14439   4483202  80912649622  ||       9879-11096       18774|--
--    00090        13701   5201620  91026723632  ||      11097-12314       73590|----
--    00100         1353   5990886 101140799396  ||      12315-13532      526603|-----------------------------
--    001.000x             5990887 101140799396  ||      13533-14750     1174451|---------------------------------------------------------------
--                                               ||      14751-15968     1104842|------------------------------------------------------------
--                                               ||      15969-17186      866956|-----------------------------------------------
--                                               ||      17187-18404      655268|------------------------------------
--                                               ||      18405-19622      476288|--------------------------
--                                               ||      19623-20840      335812|-------------------
--                                               ||      20841-22058      231915|-------------
--                                               ||      22059-23276      158130|---------
--                                               ||      23277-24494      107265|------
--                                               ||      24495-25712       72706|----
--                                               ||      25713-26930       49630|---
--                                               ||      26931-28148       33638|--
--                                               ||      28149-29366       23243|--
--                                               ||      29367-30584       16131|-
--                                               ||      30585-31802       11215|-
--                                               ||      31803-33020        7702|-
--                                               ||      33021-34238        5323|-
--                                               ||      34239-35456        3826|-
--                                               ||      35457-36674        2682|-
--                                               ||      36675-37892        1936|-
--                                               ||      37893-39110        1322|-
--                                               ||      39111-40328         897|-
--                                               ||      40329-41546         675|-
--                                               ||      41547-42764         474|-
--                                               ||      42765-43982         346|-
--                                               ||      43983-45200         242|-
--                                               ||      45201-46418         178|-
--                                               ||      46419-47636         117|-
--                                               ||      47637-48854          87|-
--                                               ||      48855-50072          61|-
--                                               ||      50073-51290          30|-
--                                               ||      51291-52508          24|-
--                                               ||      52509-53726          16|-
--                                               ||      53727-54944           8|-
--                                               ||      54945-56162           5|-
--                                               ||      56163-57380          10|-
--                                               ||      57381-58598           3|-
--                                               ||      58599-59816           2|-
--                                               ||      59817-61034           2|-
--                                               ||      61035-62252           2|-
--

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2  30985040 *****                                                                  0.0196 0.0009
--       3-     4  35159661 ******                                                                 0.0301 0.0016
--       5-     7 124939887 ************************                                               0.0592 0.0045
--       8-    11 352810878 *******************************************************************    0.1673 0.0216
--      12-    16 363621464 ********************************************************************** 0.4034 0.0764
--      17-    22 192445958 *************************************                                  0.6012 0.1402
--      23-    29 153248349 *****************************                                          0.7125 0.1900
--      30-    37  93776641 ******************                                                     0.8033 0.2435
--      38-    46  55696136 **********                                                             0.8576 0.2842
--      47-    56  36902701 *******                                                                0.8908 0.3153
--      57-    67  25825067 ****                                                                   0.9131 0.3409
--      68-    79  19016997 ***                                                                    0.9288 0.3625
--      80-    92  15287538 **                                                                     0.9405 0.3816
--      93-   106  12944872 **                                                                     0.9500 0.3998
--     107-   121  10748866 **                                                                     0.9580 0.4176
--     122-   137   8358883 *                                                                      0.9647 0.4344
--     138-   154   6400843 *                                                                      0.9699 0.4493
--     155-   172   5035525                                                                        0.9739 0.4621
--     173-   191   4142123                                                                        0.9770 0.4734
--     192-   211   3515147                                                                        0.9796 0.4839
--     212-   232   3007409                                                                        0.9818 0.4937
--     233-   254   2555652                                                                        0.9837 0.5029
--     255-   277   2173946                                                                        0.9853 0.5115
--     278-   301   1869497                                                                        0.9866 0.5195
--     302-   326   1629696                                                                        0.9878 0.5270
--     327-   352   1431970                                                                        0.9888 0.5341
--     353-   379   1262322                                                                        0.9897 0.5409
--     380-   407   1115819                                                                        0.9905 0.5473
--     408-   436    996369                                                                        0.9912 0.5534
--     437-   466    891564                                                                        0.9919 0.5592
--     467-   497    801131                                                                        0.9924 0.5648
--     498-   529    723265                                                                        0.9929 0.5702
--     530-   562    653385                                                                        0.9934 0.5754
--     563-   596    593581                                                                        0.9938 0.5803
--     597-   631    542547                                                                        0.9942 0.5851
--     632-   667    494890                                                                        0.9945 0.5898
--     668-   704    453332                                                                        0.9948 0.5942
--     705-   742    415049                                                                        0.9951 0.5985
--     743-   781    382287                                                                        0.9954 0.6027
--     782-   821    354076                                                                        0.9956 0.6068
--
--           0 (max occurrences)
-- 71723456436 (total mers, non-unique)
--  1579781536 (distinct mers, non-unique)
--           0 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing      12167    0.20    14317.81 +- 4191.70       1787.55 +- 1716.78    (bad trimming)
--   middle-hump           440    0.01    15402.44 +- 3521.88       4431.25 +- 2975.52    (bad trimming)
--   no-5-prime          11084    0.19    12292.86 +- 2901.45       3107.01 +- 3302.55    (bad trimming)
--   no-3-prime           9806    0.16    12278.88 +- 2891.69       3385.36 +- 3389.08    (bad trimming)
--   
--   low-coverage      5512956   92.02    12021.97 +- 2543.36         11.17 +- 3.55       (easy to assemble, potential for lower quality consensus)
--   unique              59323    0.99    11806.22 +- 2457.09         79.16 +- 22.32      (easy to assemble, perfect, yay)
--   repeat-cont         17523    0.29    11350.45 +- 2110.61        660.66 +- 407.88     (potential for consensus errors, no impact on assembly)
--   repeat-dove           195    0.00    18287.39 +- 3572.68        543.63 +- 382.87     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat        133098    2.22    12476.06 +- 2847.39       2838.21 +- 3005.27    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   124718    2.08    11222.25 +- 1772.57                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove    52833    0.88    14046.03 +- 2990.18                             (will end contigs, potential to misassemble)
--   uniq-anchor          3883    0.06    12804.70 +- 2847.98       3858.92 +- 3207.55    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/ERROR RATES]
--  
--  ERROR RATES
--  -----------
--                                                   --------threshold------
--  8196645                      fraction error      fraction        percent
--  samples                              (1e-5)         error          error
--                   --------------------------      --------       --------
--  command line (-eg)                           ->     30.00        0.0300%
--  command line (-ef)                           ->  -----.--      ---.----%
--  command line (-eM)                           ->     30.00        0.0300%
--  mean + std.dev       0.24 +-  12 *     1.90  ->     22.99        0.0230%
--  median + mad         0.00 +-  12 *     0.00  ->      0.00        0.0000%
--  90th percentile                              ->      1.00        0.0010%  (enabled)
--  
--  BEST EDGE FILTERING
--  -------------------
--  At graph threshold 0.0300%, reads:
--    available to have edges:      2395296
--    with at least one edge:       2329624
--  
--  At max threshold 0.0300%, reads:  (not computed)
--    available to have edges:            0
--    with at least one edge:             0
--  
--  At tight threshold 0.0010%, reads with:
--    both edges below error threshold:   2199104  (80.00% minReadsBest threshold = 1863699)
--    one  edge  above error threshold:    103022
--    both edges above error threshold:     27498
--    at least one edge:                  2329624
--  
--  At loose threshold 0.0230%, reads with:
--    both edges below error threshold:   2317432  (80.00% minReadsBest threshold = 1863699)
--    one  edge  above error threshold:     11531
--    both edges above error threshold:       661
--    at least one edge:                  2329624
--  
--  
--  INITIAL EDGES
--  -------- ----------------------------------------
--   3497122 reads are contained
--    182780 reads have no best edges (singleton)
--     21556 reads have only one best edge (spur) 
--              19409 are mutual best
--   2289429 reads have two best edges 
--              52603 have one mutual best edge
--            2235261 have two mutual best edges
--  
--  
--  FINAL EDGES
--  -------- ----------------------------------------
--   3497122 reads are contained
--    190265 reads have no best edges (singleton)
--     22744 reads have only one best edge (spur) 
--              22158 are mutual best
--   2280756 reads have two best edges 
--              44840 have one mutual best edge
--            2234448 have two mutual best edges
--  
--  
--  EDGE FILTERING
--  -------- ------------------------------------------
--         0 reads are ignored
--    114772 reads have a gap in overlap coverage
--      3295 reads have lopsided best edges

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      7555 sequences, total length 6289394072 bp (including 673 repeats of total length 13549827 bp).
--   bubbles:      18809 sequences, total length 399959529 bp.
--   unassembled:  219038 sequences, total length 2696939692 bp.
--
-- Contig sizes based on genome size 1.5gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     5700997            23   150666105
--     20     4666252            53   302431210
--     30     4029873            88   451060474
--     40     3707051           127   601753057
--     50     3408609           169   750572686
--     60     3129766           215   900722834
--     70     2935664           265  1052078652
--     80     2774274           317  1200258931
--     90     2611927           373  1350357714
--    100     2460897           432  1500370510
--    110     2373506           494  1650033297
--    120     2265816           559  1800711360
--    130     2175333           627  1951628512
--    140     2066184           697  2100311876
--    150     1974436           772  2251760547
--    160     1899508           849  2400642330
--    170     1808305           930  2550355898
--    180     1730959          1015  2700827053
--    190     1648945          1104  2850842659
--    200     1577439          1197  3000902279
--    210     1514977          1294  3150857230
--    220     1440486          1396  3301336781
--    230     1374876          1502  3450576922
--    240     1311302          1614  3600461736
--    250     1246874          1732  3751070648
--    260     1172909          1856  3900960437
--    270     1109757          1987  4050627974
--    280     1050229          2126  4200397701
--    290      996614          2273  4350879349
--    300      939015          2427  4500103201
--    310      876998          2593  4650754507
--    320      814659          2771  4800733168
--    330      753575          2962  4950284319
--    340      700246          3168  5100083347
--    350      642873          3392  5250205385
--    360      577111          3639  5400057395
--    370      514593          3914  5550115804
--    380      444445          4227  5700197630
--    390      372618          4592  5850029397
--    400      284261          5052  6000169662
--    410      178233          5703  6150151905
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      7555 sequences, total length 8813787638 bp (including 673 repeats of total length 18996063 bp).
--   bubbles:      18809 sequences, total length 560388451 bp.
--   unassembled:  219038 sequences, total length 3802335204 bp.
--
-- Contig sizes based on genome size 1.5gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     8395979            16   154116565
--     20     7183343            35   301193335
--     30     6251122            58   455673683
--     40     5737586            83   603890384
--     50     5414886           110   754086273
--     60     5062647           138   900074311
--     70     4779999           169  1052235503
--     80     4503264           201  1200519992
--     90     4256679           236  1353302003
--    100     4080296           272  1503294642
--    110     3923577           309  1651056424
--    120     3728924           348  1800389340
--    130     3612279           389  1950707207
--    140     3466379           432  2103019103
--    150     3351644           476  2252751933
--    160     3254933           521  2401848183
--    170     3154371           568  2552511256
--    180     3065179           616  2701966579
--    190     2977235           665  2850015851
--    200     2857322           717  3001589782
--    210     2772154           770  3150700561
--    220     2687792           825  3300743080
--    230     2599076           882  3451799810
--    240     2525262           940  3600070001
--    250     2441648          1001  3751640275
--    260     2352888          1063  3900366982
--    270     2283847          1128  4050967771
--    280     2216184          1195  4201732596
--    290     2145670          1263  4350101787
--    300     2081436          1334  4500140808
--    310     2007967          1408  4651181374
--    320     1944536          1484  4801498667
--    330     1871322          1562  4950036332
--    340     1817802          1644  5101204087
--    350     1745156          1728  5250388970
--    360     1677609          1816  5401049003
--    370     1609777          1907  5550414854
--    380     1545693          2002  5700385721
--    390     1482338          2102  5851440022
--    400     1434987          2204  6000070210
--    410     1380342          2311  6150442908
--    420     1321087          2422  6300242331
--    430     1254387          2539  6450881562
--    440     1190090          2661  6600060730
--    450     1132622          2791  6750730797
--    460     1072135          2927  6900365153
--    470     1017195          3071  7050591304
--    480      958471          3223  7200822303
--    490      903777          3384  7350627689
--    500      834552          3557  7500587008
--    510      777413          3743  7650237688
--    520      712116          3945  7800308912
--    530      640264          4167  7950422871
--    540      575754          4414  8100262274
--    550      489472          4695  8250318165
--    560      403503          5031  8400224815
--    570      304994          5456  8550275505
--    580      175381          6088  8700106052
--
@skoren
Copy link
Member

skoren commented Dec 13, 2024

What's the ploidy of this plant, is the 1.5gb the single haploid size before accounting for the ploidy? I ask because based on the overlap coverage and the k-mer plot in the report, the coverage you have is between 12-16x or 7-9gb which is inline with the assembly. It's expect that HiFi data will separate and assemble all the haplotypes and generate a larger assembly than the haploid genome size (e.g. for human the asm is 6gb not 3gb). You can see this on the FAQ: https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help.

If you haven't yet, I'd suggest running genomescope2 on the k-mer histogram for your genome to get its estimate of size and ploidy. You can then rely on a tool like purge_dups to remove the alt loci in the assembly.

@yaradua
Copy link
Author

yaradua commented Dec 13, 2024

Dear skoren,

Thank you very much for your response. According to genomescope2, the plant is diploid, and its genome size is 1.486 GB.

@skoren
Copy link
Member

skoren commented Dec 13, 2024

Is that measured via illumina data or the same hifi input? This is all one plant tissue/sample right, you wouldn't expect population variability (like if you use a collection of gametes)?

@yaradua
Copy link
Author

yaradua commented Dec 13, 2024

I run genomescope2 using both the illumina and the hifi reads and illumina gives 1.486GB while hifi gives 1.5GB, all with high heterozygosity of 6.13%. Please I need your input on this assembly, I spent a lot of time on it. Thanks

@skoren
Copy link
Member

skoren commented Dec 13, 2024

I don't see anything in the assembly indicating that there is an issue, the logs are consistent with a much larger genome so I'm not sure what you can do to change it. Can you share the k-mer histogram file from Canu's run in the unitigging/0-mercounts folder?

Generally, collapsing haplotypes doesn't work on HiFi data. There are a few suggestions on the FAQ to try to trim the data but I don't think it will make any difference here. I think you have multiple haplotypes in the assembly. You can confirm this using Busco or similar core gene counts and, assuming it is due to haplotype separation, your best best is likely to rely on purge_dups as I initially suggested.

@yaradua
Copy link
Author

yaradua commented Dec 13, 2024

transformed_linear_plot

@skoren
Copy link
Member

skoren commented Dec 13, 2024

Ah, that fit looks quite bogus, it's classifying the main peak as error k-mers and the peak it has identified for the full model doesn't actually exist in the data. So I wouldn't believe that result at all. Have you tried increasing the ploidy to see if you get a better fit?

@yaradua
Copy link
Author

yaradua commented Dec 13, 2024

myassembly.ms22.histogram.txt

Thank you for your time. you are right that the assembler is not identifying the right peaks. I run hifiasm with several tweaks; the result is always about 7.5Gb, and the hifiasm does not identify the correct homozygous and heterozygous coverage. find attached the kmer count and please advise me.

Thank you.

@skoren
Copy link
Member

skoren commented Dec 13, 2024

Thanks, the issue is not the assemblers but the initial genomescope result. The peaks it is identifying and fitting the model to are clearly incorrect and so your assumption of that 1.5gb genome size is incorrect as well. I manually adjusted the k-cov (last parameter) to 9 which is consistent with the peak above and get the following:
linear_plot
which has a much better fit than yours about but still I suspect is under-estimating the genomes size based on the shift of the black model from the true peak.

I think both canu and hifiasm are correct here, the genome is over 3gb and diploid so you get a 7gb+ assembly. That also means you have much lower coverage than you expected for the genome which likely results in a less continuous assembly. I suspect the core genes will show you have almost all of them duplicated but complete and you can remove one haplotype with purge_dups. Nothing on the assembly side to change.

@skoren
Copy link
Member

skoren commented Jan 3, 2025

Idle, issue looks to be truly much larger genome size than original incorrect genomescope estimate.

@skoren skoren closed this as completed Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants