README.Rmd

---
title: 'HistonePTM <img src="man/figures/logo.png" align="right" width="240" height="277"/>'
#author: Hassan Hijazi
date: "`r format(Sys.time(), '%d %B %Y')`"
output: 
  github_document:
    toc: true
    toc_depth: 2
---

<!-- README.md is generated from README.Rmd. Please edit that file -->


```{r, include = FALSE, include_graphics = TRUE }
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/logo.png",
  out.width = "100%"
)

```


<!-- badges: start -->

<!-- badges: end -->

## Overview

The goal of `histonePTM` is to make histone PTM analysis less tedious by offering a whole workflow analysis or functions that help build a workflow based on whatever software you are using.

Not only this, other functions allow retreiving data from the internet, manipulate `mgf` files, visualize results in addition to some quality control assessments.

Some functions rely heavily on other functions from well-established packages.

## Installation

You can install the development version of histonePTM from [GitHub](https://github.com/) with:

``` r
# install.packages("remotes")
remotes::install_github("HijaziHassan/histonePTM") # you only have to run the code once to install it on 
                                                  #your hard disk. After that use `library(histonePTM)`.
```
## Contributing

Any contribution is very welcomed. The first version is more adapted to `Proline` software output. But it tried to generalize each function to be generic and very flexible to be applicable for other software outputs. 

## Getting help

If you encouter any bug, a problem, a weired behavior, or have a feature request, please open an
[issue](https://github.com/HijaziHassan/histonePTM/issues).

If you would like to discuss questions related to histone analysis using mass spectrometry, please open a discussion here [discussion](https://github.com/HijaziHassan/histonePTM/discussions).

## Workflows
### Analysis of DDA results using `analyzeHistone()`

If you are using [`Proline`](https://www.profiproteomics.fr/proline/)
 software to validate identifications (IDs) resulted from search engines such as `Mascot`, the function `analyzeHistone()` can:

 * **Isolate** histone peptides based on user-defined histone protein(s).
 * **Normalize** intensities to the total area or intensity within peptide families or total filttered peptides.
 * **Abbreviate** histone peptides.
 * **Rename** PTMs strings into [Proforma](https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00771), [Brno nomenclature](https://www.nature.com/articles/nsmb0205-110), or other more simplistic representation.
 * **Calculate** mean, standard deviation, and coefficient of variation for each ID in each condition.
 * **Remove** and **store** duplications.
 * **Mark** with (**\***) and **store** IDs where the software assigns the same peak apex (i.e. same intensities) to isobaric positional isomers (i.e. K18acK23un and K18unK23ac) which nearly co-elute. 
 * **Filter** unwanted and/or IDs that are -more often than not- false positives (e.g. H3 K37mod).
 * **Filter** some IDs if they are not quantified in a user-defined number of samples. 
 
 This results in 3 Excel file:
 
 * **File 1**: Containing the raw data with several sheets. Sheet 1 contains the raw data of isolated histone peptides without any transformation. The rest of the sheets are filtered data from the original data in the first sheet . E.g. Only N-terminally labeled peptides.
 
 * **File(s) 2**: Peptide-centric. An excel file per histone protein with each sheet containing IDs from the same peptide.
 
 * **File 3**: PTM-centric. An excel file summarizing PTMs with each sheet containing IDs with specific PTM.
 
 All this with flexibility to:
 
 - choose only to analyze (and output results) of user-defined histone protein (e.g. only `H3`).
 - filter IDs with cut-off threshold of missing values.
 - output **File(s) 2** with either removing all unlabeled `me1`, `K37mod` (for H3K27R40 peptide) or both.
 - group **File(s) 2** into one file or save each protein results in a separate file. 
 
**Pre-requisites**

* `Proline` excel output file containing the sheets:

 + `Best PSM from protein sets` which includes IDs and their intensities in each sample. This assumes that IDs with multiple charge states are already summed using post-processing functionality inside `Proline`.
 
 + `Search settings and infos` which includes information about `RAW` files' names and their corresponding search result files' names.
 
* An excel file containing at least three columns:

  + `SampleName`: custom samples names
  + `file`: names of `RAW` files.
  + `Condition`: concentration, WT vs disease ...
other recognized optional columns: `BioReplicate`, and/or `TechReplicate` depending on the experimental design.

For further detailed of this fucntion and other use `?` behind the function name without paraenthesis in R console to get the full documentation (i.e. `?analyzeHistone`).
```{r analyze DDA, warning=FALSE, message=FALSE}
library(histonePTM)

# analyzeHistone(analysisfile, # file name
#                 metafile, #metafile name
#                 hist_prot= c('All','H3', 'H4', 'H2A', 'H2B'), #choose one these options
#                 labeling = c("PA", "TMA", "PIC_PA", "none") # allow reversing labeling when renaming PTMs
#                 NA_threshold, #numeric #optional
#                 norm_method = c('peptide_family', 'peptide_total'),
#                 extra_filter = c("none", 'no_me1', "K37un", "no_me1_K37un"), #optional
#                 output_result= c('single', 'multiple'), #optional
               

```
Some functions used to build-up this workflow among others are shown below:

# 1. PTMs

Rename PTM strings from `Proline` or `Skyline` to have a shorthanded representation.

### 1.1 `ptm_beautify()`

#### Proline

```{r beautify Proline PTM}

#PTM from Proline export, from 'modifications' column of sheet 'Best PSM from protein sets'.
PTM_Proline <- 'Propionyl (Any N-term); Propionyl (K1); Butyryl (K10); Butyryl (K11)'

ptm_beautify(PTM_Proline, lookup = histptm_lookup, software = 'Proline', residue = 'keep')

 
ptm_beautify(PTM_Proline, lookup = histptm_lookup, software = 'Proline', residue = 'remove')


```

### Skyline

Skyline PTMs are enclosed between square brackets (e.g. [+28.0313]) and sometimes they are rounded (e.g [+28]). We don't support rounded numbers since some PTMs like [Ac] and [3Me] are rounded to the same number: +42. Use instead: 'Peptide Modified Sequence Monoisotopic Masses' column. Modified peptides in the 'isolation list' output file ('Comment' column) from Skyline always contains monoisotopic masses of PTMs as well.

```{r beautify Skyline PTM}
PTM_Skyline <- "K[+124.05243]SVPSTGGVK[+56.026215]K[+56.026215]PHR"
 

ptm_beautify(PTM_Skyline, lookup = shorthistptm_mass, software = 'Skyline', residue = 'keep')

ptm_beautify(PTM_Skyline, lookup = shorthistptm_mass, software = 'Skyline', residue = 'remove')


```

### 1.2 `misc_clearLabeling`

Remove the chemical labeling like `propionyl` (PA) or `TMA` which are not biologically relevant.

```{r Clear chemical labeling}
misc_clearLabeling("prNt-cr-pr-pr", labeling = "PA")
```

### 1.3 `ptm_toProForma()`

Convert PTM string to <a href="https://www.psidev.info/proforma">ProForma</a> ProForma (Proteoform and Peptidoform Notation)

```{r Convert PTMed peptide to ProForma}
histonePTM::ptm_toProForma(seq = "KSAPATGGVKKPHR",
                mod = "Propionyl (Any N-term); Lactyl (K1); Dimethyl (K10); Propionyl (K11)")

ptm_toProForma(seq = "KSAPATGGVKKPHR",
               mod = "TMAyl_correct (Any N-term); Butyryl (K1); Trimethyl (K10); Propionyl (K11)")

ptm_toProForma(  seq = "KQLATKVAR",
                 mod = "Propionyl (Any N-term); Propionyl (K1); Propionyl (K6)")


```

### 1.4 `ptm_labelingAssessment()`

Lysine derivatization can go rogue and can label other residues such as S, T, and Y. When using propionic anhydride, this is called ' Overpropionylation'. Hydroxylamine is used to remove this adventitous labeling, so-called "reverse propionylation'. This function help for a quick visual review to see if overpropionylation is limited or enormous.

This for sure assumes that the database search results was run with `Propionyl (STY)` or any other labeling modification as varaible modification.