-
Notifications
You must be signed in to change notification settings - Fork 1
Merging BIGSIs
You can merge two BIGSIs into one. You will need to have the config files for the two BIGSIs (see https://github.com/iqbal-lab-org/BIGSI/wiki/Constructing-a-BIGSI for an example)
bigsi merge bigsi1.yaml bigsi2.yaml
After merging, all the samples in bigsi2
will appear in the bigsi1
.
The above method for merging two BIGSIs are slow and each time you can only merge two BIGSIs into one. If you want to build a BIGSI for a large number of samples (for example 100k samples), this is going to take long time. Also, as described in https://github.com/iqbal-lab-org/BIGSI/wiki/Constructing-a-BIGSI, constructing a BIGSI requires a file for each sample that contains the bloom filters for that sample.
We have extended the constructing process and made it more time and space efficient and hence more scalable. For example, it is possible to construct a BIGSI for 100k samples in less than 12 hours with less than 200MB memory required.
Through this new method, starting with a file for each sample that contains the bloom filters, you can merge these bloom filters into bloom matrices. Then you can construct a BIGSI from multiple merged bloom matrices.
The bloom filters for each sample should have already produced. If not, please refer to https://github.com/iqbal-lab-org/BIGSI/wiki/Constructing-a-BIGSI for how to construct bloom filters for each sample.
- Decide how many samples you want to merge per run. The ideal number is hundreds.
- Prepare input files. Each input file should contain two columns, separated by tab. The first column is the absolute file paths to the blooms. The second column is the corresponding sample names.
- Merge blooms
bigsi merge_blooms --from_file merge.bloom.1-300.in --out_file merged.sample.1-300.bloom --num_rows 28000000
- Repeat steps 2 and 3 for the rest of samples.
If in the step above, you produce some merged bloom matrices. Each bloom matrix contains bloom filters for the number of samples in the same input. You will directly build a single BIGSI index from these merged bloom matrices (they don’t need to have the same number of samples in each one, but they do need to have the same number of rows)
- Prepare an input file. The input file should contain two columns, separated by tab. The first column is the absolute file paths to the merged bloom matrices. The second column is the corresponding sample names, separated by commas. E.g.
merged.sample.1-300.bloom sample1,sample2,...sample300
merged.sample.301-600.bloom sample301,sample302,...sample600
- Prepare a config file. The config file should contain information as described in https://github.com/iqbal-lab-org/BIGSI/wiki/Constructing-a-BIGSI
- Insert the merged bloom matrices into the BIGSI index
bigsi large_build --from_file merged.bloom.all.in --config sample.all.yaml