Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: prepare for v0.4.0 #157

Merged
merged 23 commits into from
Jan 2, 2025
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
e8d4007
Merge pull request #120 from kircherlab/master
visze Sep 19, 2024
e647309
chore: Update development (#128)
visze Oct 17, 2024
b112d9c
feat!: igvf outputs (#129)
visze Oct 28, 2024
2d80faf
chore!: supporting only snakemake >=8.24.1 (#130)
visze Oct 28, 2024
1e9157b
refactor!: No min max length for bbmap. default mapq is 30. (#131)
visze Oct 28, 2024
e08c38a
feat!: outlier removal (#132)
visze Nov 5, 2024
2cf2ccf
Merge branch 'master' into development
visze Nov 5, 2024
58c55c3
edit config
visze Nov 5, 2024
163ccdd
Update conventional-prs.yml
visze Nov 5, 2024
64e8d9f
Merge branch 'master' of https://github.com/kircherlab/MPRAsnakeflow …
visze Nov 5, 2024
02076d9
chore: update development (#142)
visze Nov 20, 2024
f1944b8
Merge branches 'development' and 'master' of https://github.com/kirch…
visze Dec 5, 2024
ccbacee
feat: one dna or RNA count file across multiple replicates (#144)
visze Dec 6, 2024
22ce4d7
fix: plot per bc counts correlation when replicates are more than 3 (…
visze Dec 9, 2024
3c91a6a
feat: add label file to count basic configuration (#147)
bioinformaticsguy Dec 9, 2024
f315aab
feat: strand sensitive option (#146)
visze Dec 9, 2024
fd40828
docs: documentation update
visze Dec 9, 2024
9dc0f6f
docs: formatting
visze Dec 9, 2024
e317560
chore: get latest fix (#151)
visze Dec 17, 2024
56b2254
feat: allowing only FW reads with a UMI (#152)
visze Dec 17, 2024
5ed1ef9
feat: add performance tweaks for resource optimization in workflow ru…
visze Dec 18, 2024
854d293
update release-please github actions
visze Jan 2, 2025
153aaca
chore: master to development (#156)
visze Jan 2, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -8,8 +8,10 @@
"release-type": "simple",
"bump-minor-pre-major": true,
"bump-patch-for-minor-pre-major": true,
"draft": true,
"prerelease": true
"draft": false,
"prerelease": true,
"tag-prefix": "v",
"include-component-in-tag": false
}
}
}
5 changes: 5 additions & 0 deletions .github/workflows/release-please.yml
Original file line number Diff line number Diff line change
@@ -14,3 +14,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: googleapis/release-please-action@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
target-branch: ${{ github.ref_name }}
config-file: .github/release-please-config.json
manifest-file: .release-please-manifest.json
2 changes: 1 addition & 1 deletion .release-please-manifest.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
{
".": "0.3.0"
".": "0.3.1"
}
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# Changelog

## [0.3.1](https://github.com/kircherlab/MPRAsnakeflow/compare/MPRAsnakeflow-v0.3.0...MPRAsnakeflow-v0.3.1) (2024-12-17)


### Bug Fixes

* Wrong experiment count plots in QC report ([#149](https://github.com/kircherlab/MPRAsnakeflow/issues/149)) ([d2be468](https://github.com/kircherlab/MPRAsnakeflow/commit/d2be46891650ff9aaab61f750a4b3bc3b65e3e88))

## [0.3.0](https://github.com/kircherlab/MPRAsnakeflow/compare/MPRAsnakeflow-v0.2.0...MPRAsnakeflow-v0.3.0) (2024-11-20)


4 changes: 4 additions & 0 deletions docs/assignment.rst
Original file line number Diff line number Diff line change
@@ -59,6 +59,10 @@ Example of an assignment file using exact matches and read 1 with BC, linker and
.. literalinclude:: ../config/example_assignment_exact_linker.yaml
:language: yaml


If you want to use the strand sensitivity option (e.g. testing enhancer in both directions), you can add the following to the config file: :code:`strand_sensitive: {enable: true}`. Otherwise, MPRAsnakeflow will give you an error because it cannot handle the same sequences in both sense and antisense directions. This is an issue with the mappers because they do not consider the strand and will always call your read ambiguous due to multiple matches.


snakemake
============================

27 changes: 27 additions & 0 deletions docs/cluster.rst
Original file line number Diff line number Diff line change
@@ -22,6 +22,33 @@ Having 30 cores and 10GB of memory.

snakemake --sdm conda --configfile config/config.yaml -c 30 --resources mem_mb=10000 --workflow-profile profiles/default

Performance tweaks: Running specific rules with different resources
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some of the rule swill benefit from multithreading or more memory. This can be specified within your profile, worflow profile or in the command line interface using :code:`--set-resources RULE_NAME:RESOURCE_NAME=VALUE` or :code:`---set-threads RULE_NAME=VALUE`. Before changing resources make sure that you really need the rule by running a dry run getting the list of executed rules only::code:`snakemamake -n --quiet rules`.

Possible rules to tweaks:

:Assignment:

:assignment_hybridFWRead_get_reads_by_cutadapt:
Only needed when using linker option in config. You can add more threads using :code:`--set-threads assignment_hybridFWRead_get_reads_by_cutadapt=4`. Default is always 1 thread.

:assignment_mapping_bbmap:
Only needed when using bbmap for mapping. Memory and threads can be optimized e.g. via :code:`--set-threads assignment_mapping_bbmap=30 --set-resources assignment_mapping_bbmap:mem_mb=10000`. Default is 1 thread and 4GB memory but we recommend to use 30 threads and 10GB if available.

:assignment_mapping_bwa:
Only needed when using bwa for mapping. Memory and threads can be optimized e.g. via :code:`--set-threads assignment_mapping_bwa=30 --set-resources assignment_mapping_bwa:mem_mb:10000`. Default is 1 thread but we recommend to use 30 threads and 10GB if available.

:assignment_collectBCs:
Threads can be optimized e.g. via :code:`--set-threads assignment_collectBCs=30`. Default is 1 thread but we recommend to use 30 threads if available.

:Experiment:

:counts_onlyFW_raw_counts_by_cutadapt:
Only needed when you have only FW reads and use the adapter option. Threads can be optimized e.g. via :code:`--set-threads counts_onlyFW_raw_counts_by_cutadapt=30`. Default is 1 thread.


Running on an HPC using SLURM
-----------------------------

26 changes: 18 additions & 8 deletions docs/config.rst
Original file line number Diff line number Diff line change
@@ -4,11 +4,11 @@
Config File
=====================

The config file is a yaml file that contains the configuration. Different runs can be configured. We recommend using one config file per MPRA experiment or MPRA project. But in theory, many different experiments can be configured in only one file. It is divided into :code:`version` (version of MPRAsnakeflow used), :code:`assignments` (assigment workflow), and :code:`experiments` (count workflow). This is a full example file with default configurations. :download:`config/example_config.yaml <../config/example_config.yaml>`.
The config file is a yaml file that contains the configuration. Different runs can be configured. We recommend using one config file per MPRA experiment or MPRA project. But in theory, many different experiments can be configured in only one file. It is divided into :code:`version` (version of MPRAsnakeflow used), :code:`assignments` (assignment workflow), and :code:`experiments` (count workflow). This is a full example file with default configurations. :download:`config/example_config.yaml <../config/example_config.yaml>`.

.. literalinclude:: ../config/example_config.yaml
:language: yaml
:linenos:
:language: yaml
:linenos:


Note that the config file is controlled by json schema. This means that the config file is validated against the schema. If the config file is not valid, the program will exit with an error message. The schema is located in :download:`workflow/schemas/config.schema.yaml <../workflow/schemas/config.schema.yaml>`.
@@ -17,15 +17,15 @@ Note that the config file is controlled by json schema. This means that the conf
Version settings
----------------

Set the version of the of MPRAsnakeflow this configuration is used. This is important for future updates. The version is used to check if the config file is compatible with the current version of the workflow. If the version is not the same the workflow will exit with an error message.
Set the version of the MPRAsnakeflow this configuration is used. This is important for future updates. The version is used to check if the config file is compatible with the current version of the workflow. If the version is not the same the workflow will exit with an error message.

.. literalinclude:: ../workflow/schemas/config.schema.yaml
:language: yaml
:start-after: start_version
:end-before: start_assignments
:language: yaml
:start-after: start_version
:end-before: start_assignments

:version:
A a string like "0.2.0" or "1.2". When major version "0" is used the minor version should fit with MPRAsnakeflow, e.g. "0.2.0" is compatible with MPRAsnakeflow 0.2.0. as well as 0.2.1 or 0.2.2. When major version greater 0 used then the major version have to fith with MPRAsnakeflow. E.g. config of "1.2.1" fits also with MPRAsnakeflow 1.7 or 1.0.
A string like "0.2.0" or "1.2". When major version "0" is used the minor version should fit with MPRAsnakeflow, e.g. "0.2.0" is compatible with MPRAsnakeflow 0.2.0. as well as 0.2.1 or 0.2.2. When major version greater than 0 is used then the major version has to fit with MPRAsnakeflow. E.g. config of "1.2.1" fits also with MPRAsnakeflow 1.7 or 1.0.

--------------------
Assignment workflow
@@ -95,6 +95,16 @@ For each assignment you want to process you have to give him a name like :code:`
(Optional) Using a simple dictionary to find identical sequences. This is faster but uses only the whole (or center part depending on start/length) of the design file. Cannot find substrings as part of any sequence. Set to false for more correct, but slower, search. Default :code:`true`.
:sequence_collitions:
(Optional) Check if there are identical sequences in the design file. Default :code:`true`.
:strand_sensitive:
(Optional) If is enabled the reads are mapped to the oligos in a strand-sensitive way by adding unique adapters to both ends of the oligo reference as well as the FASTQ files. Then MPRASnakeflow is able to distiguish between sense and antisense. By default this option is not enabled.

:enable:
(Optional) If set to :code:`true` the strand-sensitive mapping is enabled. Default is :code:`false`.
:forward_adapter:
(Optional) Adapter sequence added 5' of the oligo. Default is :code:`AGGACCGGATCAACT`.
:reverse_adapter:
(Optional) Adapter sequence added 3' of the oligo. Default is :code:`TCGGTTCACGCAATG`.


:configs:
After mapping the reads to the design file and extracting the barcodes per oligo, the configuration (using different names) can be used to generate multiple filtering and configuration settings of the final mapping oligo to barcode. Use `<your_config_name>: {}` to use the default values for the keys. Each configuration is a dictionary with the following keys:
3 changes: 3 additions & 0 deletions docs/experiment.rst
Original file line number Diff line number Diff line change
@@ -31,6 +31,9 @@ We allow different flavours of experiment files because sometimes no UMI exists
* :code:`Condition,Replicate,DNA_BC_F,RNA_BC_F`


It is possible to use only one count experiment per condition across replicates (DNA or RNA, but usually only DNA can make sense). E.g. if you expect the same number of inserts/transfections across replicates. If you use the same files for :code:`DNA` or :code:`RNA` MPRAsnakeflow will only run the first replicate and use the counts for all replicates later.


Assignment File or configuration
--------------------------------
Tab separated gzipped file with barcode mapped to sequence. Can be generated using the :ref:`Assignment` workflow. Config file must be configured similar to this:
25 changes: 11 additions & 14 deletions docs/faq.rst
Original file line number Diff line number Diff line change
@@ -7,32 +7,29 @@ Frequently Asked Questions
If you have more question please write us a ticket on `github <https://github.com/kircherlab/MPRAsnakeflow/issues>`_.


Is it possible to differntiate beteween sense and antisense?
No! Or not directly. The reason why we are not able to do this is that reads will map to both sequence strands equally. Then assignment of the barcode becomes ambigous and is discarded. But when dsigning oligos you can add short sequence fragment on the start and on the end of the sequence that ar edifferent sense and antisense. These sequences should not be trimmed away during demultiplexing and have to be in the design file. For the lentiMPRA dsign we have 15bp adpaters on both ends for integration of the sequence. They can be used for that purpose.
Is it possible to differentiate between sense and antisense?
Usually not because reads will map to both sequence strands equally. Then assignment of the barcode becomes ambiguous and is discarded. But we have a workaround that will add unique sequence adapters to both ends to the oligos, for the reference fasta and the fastqs. Now all mapping strategies should be able to differentiate between sense and antisense. To enable use the config :code:`strand_sensitive: {enable: true}`.

The design/reference file check faild, why?
The design/reference file check failed, why?
The design file has to have:
* Unique headers. Each sequence has to have a unique sequence/id strating from :code:`>` to the first whitespace or newline.
* No special characters within the headers. This is because mapping tools create a reference dictionary and cannot handle all characters. In addition most databases (like SRA) have their restricted character set for the header.
* Unique sequences. They have to be different. Otherwise mapper place the read to both IDs and the barcode get ambigous and is discarded. Wenn you allow min/max start/lengths for sequences (e.g. in BWA mapping) be aware that the smalles substring has to be unqiue across all other (sub) sequences.


* Unique headers. Each sequence has to have a unique sequence/id starting from :code:`>` to the first whitespace or newline.
* No special characters within the headers. This is because mapping tools create a reference dictionary and cannot handle all characters. In addition, most databases (like SRA) have their restricted character set for the header.
* Unique sequences. They have to be different in sense and antisense directions. Otherwise, the mapper places the read to both IDs and the barcode gets ambiguous and is discarded. When you allow min/max start/lengths for sequences (e.g. in BWA mapping) be aware that the smallest substring has to be unique across all other (sub) sequences. If you have antisense collisions and want to keep the strand sensitivity you can enable it by using the option :code:`strand_sensitive: {enable: true}` in the config file (see question before).

MPRAsnakeflow is not able to create a Conda environment
If you get a message like::
If you get a message like:

Caused by: json.decoder.JSONDecodeError: Extra data: line 1 column 2785 (char 2784)#

Try to do the following steps ::
Try to do the following steps:

rm -r .snakemake/metadata .snakemake/incomplete

Afterwards try MPRAsnakeflow again. If the above error still occurs, rerun after deleting the entire ``.snakemake`` folder.


Afterwards try MPRAsnakeflow again. If the above error still occurs, rerun after deleting the entire :code:`.snakemake` folder.

Can I use STARR-seq with MPRAsnakeflow?
No! Not yet ;-)


The pipeline is giving an error **"BUG: Out of jobs ready to be started, but not all files built yet."** and won't run. How can I fix this?
Please update snakemake, as this error is highly likely to have occured from snakemake internal issues.
Please update snakemake, as this error is highly likely to have occurred from snakemake internal issues.
1 change: 1 addition & 0 deletions resources/count_basic/config.yml
Original file line number Diff line number Diff line change
@@ -12,6 +12,7 @@ experiments:
type: file
assignment_file: SRR10800986_barcodes_to_coords.tsv.gz
design_file: design.fa
label_file: labels.tsv
configs:
default: {}
outlierZscore:
Loading