diff --git a/NEWS.md b/NEWS.md index 951366cc..113ca104 100644 --- a/NEWS.md +++ b/NEWS.md @@ -128,6 +128,7 @@ * The default batchsize of `fetch_pdb()` was changed to 100 (from 200). This was done since more information is retrieved now, which slows to function down and is slightly improved when batch sizes are smaller. * `try_query()` now only retries to retrieve information once if the returned message was "Timeout was reached". In addition, a `timeout` and `accept` argument have been added. * The UniProt database has changed its API, therefore column names have changed as well as the format of data. We adjusted the `fetch_uniprot()` and `fetch_uniprot_proteome()` function accordingly. Please be aware that some columns names might have changed and your code might throw error messages if you did not adjust it accordingly. +* Some typo fixes. Thank you Steffi! # protti 0.3.1 diff --git a/R/barcode_plot.R b/R/barcode_plot.R index fac5a987..a28670cf 100644 --- a/R/barcode_plot.R +++ b/R/barcode_plot.R @@ -123,7 +123,7 @@ barcode_plot <- function(data, size = 0.7 ) + { - if (is.numeric(dplyr::pull(data, {{ colouring }}))){ + if (is.numeric(dplyr::pull(data, {{ colouring }}))) { ggplot2::scale_fill_gradientn(colours = fill_colour_gradient) } else { ggplot2::scale_fill_manual(values = c( diff --git a/R/calculate_diff_abundance.R b/R/calculate_diff_abundance.R index a329f222..d7edd88d 100644 --- a/R/calculate_diff_abundance.R +++ b/R/calculate_diff_abundance.R @@ -3,38 +3,38 @@ #' `r lifecycle::badge('deprecated')` #' This function was deprecated due to its name changing to `calculate_diff_abundance()`. #' -#' @return A data frame that contains differential abundances (\code{diff}), p-values (\code{pval}) -#' and adjusted p-values (\code{adj_pval}) for each protein, peptide or precursor (depending on -#' the \code{grouping} variable) and the associated treatment/reference pair. Depending on the +#' @return A data frame that contains differential abundances (`diff`), p-values (`pval`) +#' and adjusted p-values (`adj_pval`) for each protein, peptide or precursor (depending on +#' the `grouping` variable) and the associated treatment/reference pair. Depending on the #' method the data frame contains additional columns: #' -#' * "t-test": The \code{std_error} column contains the standard error of the differential -#' abundances. \code{n_obs} contains the number of observations for the specific protein, peptide -#' or precursor (depending on the \code{grouping} variable) and the associated treatment/reference pair. +#' * "t-test": The `std_error` column contains the standard error of the differential +#' abundances. `n_obs` contains the number of observations for the specific protein, peptide +#' or precursor (depending on the `grouping` variable) and the associated treatment/reference pair. #' * "t-test_mean_sd": Columns labeled as control refer to the second condition of the -#' comparison pairs. Treated refers to the first condition. \code{mean_control} and \code{mean_treated} -#' columns contain the means for the reference and treatment condition, respectively. \code{sd_control} -#' and \code{sd_treated} columns contain the standard deviations for the reference and treatment -#' condition, respectively. \code{n_control} and \code{n_treated} columns contain the numbers of -#' samples for the reference and treatment condition, respectively. The \code{std_error} column -#' contains the standard error of the differential abundances. \code{t_statistic} contains the +#' comparison pairs. Treated refers to the first condition. `mean_control` and `mean_treated` +#' columns contain the means for the reference and treatment condition, respectively. `sd_control` +#' and `sd_treated` columns contain the standard deviations for the reference and treatment +#' condition, respectively. `n_control` and `n_treated` columns contain the numbers of +#' samples for the reference and treatment condition, respectively. The `std_error` column +#' contains the standard error of the differential abundances. `t_statistic` contains the #' t_statistic for the t-test. -#' * "moderated_t-test": \code{CI_2.5} and \code{CI_97.5} contain the 2.5% and 97.5% -#' confidence interval borders for differential abundances. \code{avg_abundance} contains average -#' abundances for treatment/reference pairs (mean of the two group means). \code{t_statistic} -#' contains the t_statistic for the t-test. \code{B} The B-statistic is the log-odds that the -#' protein, peptide or precursor (depending on \code{grouping}) has a differential abundance +#' * "moderated_t-test": `CI_2.5` and `CI_97.5` contain the 2.5% and 97.5% +#' confidence interval borders for differential abundances. `avg_abundance` contains average +#' abundances for treatment/reference pairs (mean of the two group means). `t_statistic` +#' contains the t_statistic for the t-test. `B` The B-statistic is the log-odds that the +#' protein, peptide or precursor (depending on `grouping`) has a differential abundance #' between the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e, #' about four and a half to one. The probability that there is a differential abundance is #' 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentially #' abundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentially -#' abundant.\code{n_obs} contains the number of observations for the specific protein, peptide or -#' precursor (depending on the \code{grouping} variable) and the associated treatment/reference pair. -#' * "proDA": The \code{std_error} column contains the standard error of the differential -#' abundances. \code{avg_abundance} contains average abundances for treatment/reference pairs -#' (mean of the two group means). \code{t_statistic} contains the t_statistic for the t-test. -#' \code{n_obs} contains the number of observations for the specific protein, peptide or precursor -#' (depending on the \code{grouping} variable) and the associated treatment/reference pair. +#' abundant.`n_obs` contains the number of observations for the specific protein, peptide or +#' precursor (depending on the `grouping` variable) and the associated treatment/reference pair. +#' * "proDA": The `std_error` column contains the standard error of the differential +#' abundances. `avg_abundance` contains average abundances for treatment/reference pairs +#' (mean of the two group means). `t_statistic` contains the t_statistic for the t-test. +#' `n_obs` contains the number of observations for the specific protein, peptide or precursor +#' (depending on the `grouping` variable) and the associated treatment/reference pair. #' #' @keywords internal #' @export @@ -55,101 +55,101 @@ diff_abundance <- #' with protein, peptide or precursor data. Different methods for statistical testing are available. #' #' @param data a data frame containing at least the input variables that are required for the -#' selected method. Ideally the output of \code{assign_missingness} or \code{impute} is used. -#' @param sample a character column in the \code{data} data frame that contains the sample name. -#' Is not required if \code{method = "t-test_mean_sd"}. -#' @param condition a character or numeric column in the \code{data} data frame that contains the +#' selected method. Ideally the output of `assign_missingness` or `impute` is used. +#' @param sample a character column in the `data` data frame that contains the sample name. +#' Is not required if `method = "t-test_mean_sd"`. +#' @param condition a character or numeric column in the `data` data frame that contains the #' conditions. -#' @param grouping a character column in the \code{data} data frame that contains precursor, +#' @param grouping a character column in the `data` data frame that contains precursor, #' peptide or protein identifiers. -#' @param intensity_log2 a numeric column in the \code{data} data frame that contains intensity +#' @param intensity_log2 a numeric column in the `data` data frame that contains intensity #' values. The intensity values need to be log2 transformed. Is not required if -#' \code{method = "t-test_mean_sd"}. -#' @param missingness a character column in the \code{data} data frame that contains missingness -#' information. Can be obtained by calling \code{assign_missingness()}. Is not required if -#' \code{method = "t-test_mean_sd"}. The type of missingness assigned to a comparison does not have -#' any influence on the statistical test. However, if \code{filter_NA_missingness = TRUE} and -#' \code{method = "proDA"}, then comparisons with missingness \code{NA} are filtered out prior +#' `method = "t-test_mean_sd"`. +#' @param missingness a character column in the `data` data frame that contains missingness +#' information. Can be obtained by calling `assign_missingness()`. Is not required if +#' `method = "t-test_mean_sd"`. The type of missingness assigned to a comparison does not have +#' any influence on the statistical test. However, if `filter_NA_missingness = TRUE` and +#' `method = "proDA"`, then comparisons with missingness `NA` are filtered out prior #' to p-value adjustment. -#' @param comparison a character column in the \code{data} data frame that contains information of -#' treatment/reference condition pairs. Can be obtained by calling \code{assign_missingness}. +#' @param comparison a character column in the `data` data frame that contains information of +#' treatment/reference condition pairs. Can be obtained by calling `assign_missingness`. #' Comparisons need to be in the form condition1_vs_condition2, meaning two compared conditions are -#' separated by \code{"_vs_"}. This column determines for which condition pairs differential -#' abundances are calculated. Is not required if \code{method = "t-test_mean_sd"}, in that case +#' separated by `"_vs_"`. This column determines for which condition pairs differential +#' abundances are calculated. Is not required if `method = "t-test_mean_sd"`, in that case #' please provide a reference condition with the ref_condition argument. -#' @param mean a numeric column in the \code{data} data frame that contains mean values for two -#' conditions. Is only required if \code{method = "t-test_mean_sd"}. -#' @param sd a numeric column in the \code{data} data frame that contains standard deviations for -#' two conditions. Is only required if \code{method = "t-test_mean_sd"}. -#' @param n_samples a numeric column in the \code{data} data frame that contains the number of -#' samples per condition for two conditions. Is only required if \code{method = "t-test_mean_sd"}. +#' @param mean a numeric column in the `data` data frame that contains mean values for two +#' conditions. Is only required if `method = "t-test_mean_sd"`. +#' @param sd a numeric column in the `data` data frame that contains standard deviations for +#' two conditions. Is only required if `method = "t-test_mean_sd"`. +#' @param n_samples a numeric column in the `data` data frame that contains the number of +#' samples per condition for two conditions. Is only required if `method = "t-test_mean_sd"`. #' @param ref_condition optional, character value providing the condition that is used as a -#' reference for differential abundance calculation. Only required for \code{method = "t-test_mean_sd"}. +#' reference for differential abundance calculation. Only required for `method = "t-test_mean_sd"`. #' Instead of providing one reference condition, "all" can be supplied, which will create all -#' pairwise condition pairs. By default \code{ref_condition = "all"}. -#' @param filter_NA_missingness a logical value, default is \code{TRUE}. For all methods except -#' \code{"t-test_mean_sd"} missingness information has to be provided. This information can be -#' for example obtained by calling \code{assign_missingness()}. If a reference/treatment pair has -#' too few samples to be considered robust based on user defined cutoffs, it is annotated with \code{NA} -#' as missingness by the \code{assign_missingness()} function. If this argument is \code{TRUE}, -#' these \code{NA} reference/treatment pairs are filtered out. For \code{method = "proDA"} this +#' pairwise condition pairs. By default `ref_condition = "all"`. +#' @param filter_NA_missingness a logical value, default is `TRUE`. For all methods except +#' `"t-test_mean_sd"` missingness information has to be provided. This information can be +#' for example obtained by calling `assign_missingness()`. If a reference/treatment pair has +#' too few samples to be considered robust based on user defined cutoffs, it is annotated with `NA` +#' as missingness by the `assign_missingness()` function. If this argument is `TRUE`, +#' these `NA` reference/treatment pairs are filtered out. For `method = "proDA"` this #' is done before the p-value adjustment. #' @param method a character value, specifies the method used for statistical hypothesis testing. -#' Methods include Welch test (\code{"t-test"}), a Welch test on means, standard deviations and -#' number of replicates ("\code{"t-test_mean_sd"}) and a moderated t-test based on the \code{limma} -#' package (\code{"moderated_t-test"}). More information on the moderated t-test can be found in -#' the \code{limma} documentation. Furthermore, the \code{proDA} package specific method (\code{"proDA"}) +#' Methods include Welch test (`"t-test"`), a Welch test on means, standard deviations and +#' number of replicates (`"t-test_mean_sd"`) and a moderated t-test based on the `limma` +#' package (`"moderated_t-test"`). More information on the moderated t-test can be found in +#' the `limma` documentation. Furthermore, the `proDA` package specific method (`"proDA"`) #' can be used to infer means across samples based on a probabilistic dropout model. This #' eliminates the need for data imputation since missing values are inferred from the model. More -#' information can be found in the \code{proDA} documentation. We do not recommend using the -#' \code{moderated_t-test} or \code{proDA} method if the data was filtered for low CVs or imputation -#' was performed. Default is \code{method = "moderated_t-test"}. +#' information can be found in the `proDA` documentation. We do not recommend using the +#' `moderated_t-test` or `proDA` method if the data was filtered for low CVs or imputation +#' was performed. Default is `method = "moderated_t-test"`. #' @param p_adj_method a character value, specifies the p-value correction method. Possible #' methods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Default -#' method is \code{"BH"}. +#' method is `"BH"`. #' @param retain_columns a vector indicating if certain columns should be retained from the input -#' data frame. Default is not retaining additional columns \code{retain_columns = NULL}. Specific +#' data frame. Default is not retaining additional columns `retain_columns = NULL`. Specific #' columns can be retained by providing their names (not in quotations marks, just like other #' column names, but in a vector). Please note that if you retain columns that have multiple #' rows per grouped variable there will be duplicated rows in the output. #' -#' @return A data frame that contains differential abundances (\code{diff}), p-values (\code{pval}) -#' and adjusted p-values (\code{adj_pval}) for each protein, peptide or precursor (depending on -#' the \code{grouping} variable) and the associated treatment/reference pair. Depending on the +#' @return A data frame that contains differential abundances (`diff`), p-values (`pval`) +#' and adjusted p-values (`adj_pval`) for each protein, peptide or precursor (depending on +#' the `grouping` variable) and the associated treatment/reference pair. Depending on the #' method the data frame contains additional columns: #' -#' * "t-test": The \code{std_error} column contains the standard error of the differential -#' abundances. \code{n_obs} contains the number of observations for the specific protein, peptide -#' or precursor (depending on the \code{grouping} variable) and the associated treatment/reference pair. +#' * "t-test": The `std_error` column contains the standard error of the differential +#' abundances. `n_obs` contains the number of observations for the specific protein, peptide +#' or precursor (depending on the `grouping` variable) and the associated treatment/reference pair. #' * "t-test_mean_sd": Columns labeled as control refer to the second condition of the -#' comparison pairs. Treated refers to the first condition. \code{mean_control} and \code{mean_treated} -#' columns contain the means for the reference and treatment condition, respectively. \code{sd_control} -#' and \code{sd_treated} columns contain the standard deviations for the reference and treatment -#' condition, respectively. \code{n_control} and \code{n_treated} columns contain the numbers of -#' samples for the reference and treatment condition, respectively. The \code{std_error} column -#' contains the standard error of the differential abundances. \code{t_statistic} contains the +#' comparison pairs. Treated refers to the first condition. `mean_control` and `mean_treated` +#' columns contain the means for the reference and treatment condition, respectively. `sd_control` +#' and `sd_treated` columns contain the standard deviations for the reference and treatment +#' condition, respectively. `n_control` and `n_treated` columns contain the numbers of +#' samples for the reference and treatment condition, respectively. The `std_error` column +#' contains the standard error of the differential abundances. `t_statistic` contains the #' t_statistic for the t-test. -#' * "moderated_t-test": \code{CI_2.5} and \code{CI_97.5} contain the 2.5% and 97.5% -#' confidence interval borders for differential abundances. \code{avg_abundance} contains average -#' abundances for treatment/reference pairs (mean of the two group means). \code{t_statistic} -#' contains the t_statistic for the t-test. \code{B} The B-statistic is the log-odds that the -#' protein, peptide or precursor (depending on \code{grouping}) has a differential abundance +#' * "moderated_t-test": `CI_2.5` and `CI_97.5` contain the 2.5% and 97.5% +#' confidence interval borders for differential abundances. `avg_abundance` contains average +#' abundances for treatment/reference pairs (mean of the two group means). `t_statistic` +#' contains the t_statistic for the t-test. `B` The B-statistic is the log-odds that the +#' protein, peptide or precursor (depending on `grouping`) has a differential abundance #' between the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e, #' about four and a half to one. The probability that there is a differential abundance is #' 4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentially #' abundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentially -#' abundant.\code{n_obs} contains the number of observations for the specific protein, peptide or -#' precursor (depending on the \code{grouping} variable) and the associated treatment/reference pair. -#' * "proDA": The \code{std_error} column contains the standard error of the differential -#' abundances. \code{avg_abundance} contains average abundances for treatment/reference pairs -#' (mean of the two group means). \code{t_statistic} contains the t_statistic for the t-test. -#' \code{n_obs} contains the number of observations for the specific protein, peptide or precursor -#' (depending on the \code{grouping} variable) and the associated treatment/reference pair. +#' abundant.`n_obs` contains the number of observations for the specific protein, peptide or +#' precursor (depending on the `grouping` variable) and the associated treatment/reference pair. +#' * "proDA": The `std_error` column contains the standard error of the differential +#' abundances. `avg_abundance` contains average abundances for treatment/reference pairs +#' (mean of the two group means). `t_statistic` contains the t_statistic for the t-test. +#' `n_obs` contains the number of observations for the specific protein, peptide or precursor +#' (depending on the `grouping` variable) and the associated treatment/reference pair. #' -#' For all methods execept \code{"proDA"}, the p-value adjustment is performed only on the -#' proportion of data that contains a p-value that is not \code{NA}. For \code{"proDA"} the -#' p-value adjustment is either performed on the complete dataset (\code{filter_NA_missingness = TRUE}) -#' or on the subset of the dataset with missingness that is not \code{NA} (\code{filter_NA_missingness = FALSE}). +#' For all methods execept `"proDA"`, the p-value adjustment is performed only on the +#' proportion of data that contains a p-value that is not `NA`. For `"proDA"` the +#' p-value adjustment is either performed on the complete dataset (`filter_NA_missingness = TRUE`) +#' or on the subset of the dataset with missingness that is not `NA` (`filter_NA_missingness = FALSE`). #' @import dplyr #' @import tidyr #' @importFrom rlang .data enquo ensym as_name as_label expr := !! @@ -762,7 +762,8 @@ missingness type is assigned.\n The created comparisons are: \n", prefix = "\n", purrr::map_dfr(~ dplyr::mutate(.x, adj_pval = p.adjust(.data$pval, method = p_adj_method))) %>% dplyr::select(-"n_obs", -"n_approx") %>% dplyr::rename({{ grouping }} := "name", - std_error = "se") %>% + std_error = "se" + ) %>% dplyr::left_join(proDA_missingness, by = c(rlang::as_name(rlang::enquo(grouping)), "comparison")) message("DONE", appendLF = TRUE) diff --git a/R/calculate_go_enrichment.R b/R/calculate_go_enrichment.R index 87fea458..8db67eaa 100644 --- a/R/calculate_go_enrichment.R +++ b/R/calculate_go_enrichment.R @@ -235,7 +235,7 @@ calculate_go_enrichment <- function(data, if (!(plot_style %in% c("barplot", "heatmap"))) stop("Invalid plot_style. Available styles: barplot, heatmap") - if(length(barplot_fill_colour) < 2) stop('Please provide at least two colours to "barplot_fill_colour"!') + if (length(barplot_fill_colour) < 2) stop('Please provide at least two colours to "barplot_fill_colour"!') if (!stringr::str_detect(plot_cutoff, pattern = "^(pval|adj_pval) (top\\d+|\\d+(\\.\\d+)?)$")) { stop("Invalid format for plot_cutoff. Please provide the type (pval or adj_pval) followed by @@ -520,7 +520,7 @@ if you used the right organism ID.", prefix = "\n", initial = "")) { if (!missing(group)) { if (y_axis_free) { - if(facet_n_col == 1){ + if (facet_n_col == 1) { ggforce::facet_col(rlang::new_formula(NULL, rlang::enquo(group)), scales = "free_y", space = "free") } else { ggplot2::facet_wrap(rlang::new_formula(NULL, rlang::enquo(group)), scales = "free_y", ncol = facet_n_col) diff --git a/R/drc_4p_plot.R b/R/drc_4p_plot.R index 26de6f9f..910e24f3 100644 --- a/R/drc_4p_plot.R +++ b/R/drc_4p_plot.R @@ -173,7 +173,7 @@ drc_4p_plot <- function(data, ) + ggplot2::geom_point(size = 2, col = "#5680C1") + { - if(nrow(input_curve_plot) != 1) { + if (nrow(input_curve_plot) != 1) { ggplot2::geom_ribbon( data = input_curve_plot, ggplot2::aes( @@ -188,7 +188,7 @@ drc_4p_plot <- function(data, } } + { - if(nrow(input_curve_plot) != 1) { + if (nrow(input_curve_plot) != 1) { ggplot2::geom_line( data = input_curve_plot, ggplot2::aes( @@ -286,7 +286,7 @@ drc_4p_plot <- function(data, ggplot2::ggplot(data = x, ggplot2::aes(x = {{ dose }}, y = {{ response }})) + ggplot2::geom_point(size = 2, col = "#5680C1") + { - if(nrow(y) != 1) { + if (nrow(y) != 1) { ggplot2::geom_ribbon( data = y, ggplot2::aes( @@ -301,7 +301,7 @@ drc_4p_plot <- function(data, } } + { - if(nrow(y) != 1) { + if (nrow(y) != 1) { ggplot2::geom_line( data = y, ggplot2::aes( diff --git a/R/fetch_chebi.R b/R/fetch_chebi.R index 5b0bf205..c79a2f53 100644 --- a/R/fetch_chebi.R +++ b/R/fetch_chebi.R @@ -207,8 +207,10 @@ fetch_chebi <- function(relation = FALSE, stars = c(3), timeout = 60) { chebi_names_clean <- chebi_names %>% dplyr::distinct(.data$COMPOUND_ID, .data$NAME, .data$TYPE) %>% - dplyr::rename(ID = "COMPOUND_ID", - TYPE_NAME = "TYPE") %>% + dplyr::rename( + ID = "COMPOUND_ID", + TYPE_NAME = "TYPE" + ) %>% dplyr::bind_rows(chebi_compounds_names_clean) chebi <- chebi_compounds_clean %>% diff --git a/R/fetch_pdb.R b/R/fetch_pdb.R index 005593c1..5718f1b7 100644 --- a/R/fetch_pdb.R +++ b/R/fetch_pdb.R @@ -897,7 +897,7 @@ fetch_pdb <- function(pdb_ids, batchsize = 100, show_progress = TRUE) { combined <- polymer_entities %>% dplyr::full_join(nonpolymer_entities, by = c("pdb_ids", "auth_asym_ids"), relationship = "many-to-many") %>% - dplyr::left_join(rcsb_binding_affinity, by = "pdb_ids", relationship = "many-to-many" ) %>% + dplyr::left_join(rcsb_binding_affinity, by = "pdb_ids", relationship = "many-to-many") %>% dplyr::left_join(additional_info, by = "pdb_ids") %>% dplyr::left_join(crystal_growth_info, by = "pdb_ids") %>% dplyr::left_join(nmr_info, by = "pdb_ids") %>% diff --git a/R/fit_drc_4p.R b/R/fit_drc_4p.R index 67046f59..9451b27e 100644 --- a/R/fit_drc_4p.R +++ b/R/fit_drc_4p.R @@ -321,19 +321,23 @@ fit_drc_4p <- function(data, list(unique(.data$n_vector_add_rev))[[1]][2] )$pval <= 0.1) %>% dplyr::ungroup() %>% - tidyr::replace_na(list("pval_vector" = TRUE, - "pval_vector_rev" = TRUE, - "pval_vector_add" = TRUE, - "pval_vector_add_rev" = TRUE, - "mean_vector" = 0, - "mean_vector_rev" = 0, - "mean_vector_add" = 0, - "mean_vector_add_rev" = 0)) %>% + tidyr::replace_na(list( + "pval_vector" = TRUE, + "pval_vector_rev" = TRUE, + "pval_vector_add" = TRUE, + "pval_vector_add_rev" = TRUE, + "mean_vector" = 0, + "mean_vector_rev" = 0, + "mean_vector_add" = 0, + "mean_vector_add_rev" = 0 + )) %>% dplyr::group_by({{ grouping }}) %>% - dplyr::mutate(mean_vector = min(.data$mean_vector) == .data$mean_vector, - mean_vector_rev = min(.data$mean_vector_rev) == .data$mean_vector_rev, - mean_vector_add = min(.data$mean_vector_add) == .data$mean_vector_add, - mean_vector_add_rev = min(.data$mean_vector_add_rev) == .data$mean_vector_add_rev) %>% + dplyr::mutate( + mean_vector = min(.data$mean_vector) == .data$mean_vector, + mean_vector_rev = min(.data$mean_vector_rev) == .data$mean_vector_rev, + mean_vector_add = min(.data$mean_vector_add) == .data$mean_vector_add, + mean_vector_add_rev = min(.data$mean_vector_add_rev) == .data$mean_vector_add_rev + ) %>% dplyr::mutate(dose_MNAR = ifelse((all((.data$lower_vector & .data$enough_replicates == FALSE & .data$mean_vector) | (!.data$lower_vector & .data$enough_replicates == TRUE & !.data$mean_vector)) & .data$pval_vector) | diff --git a/R/normalise.R b/R/normalise.R index d829a623..12dcd2ae 100644 --- a/R/normalise.R +++ b/R/normalise.R @@ -60,7 +60,7 @@ normalise <- if (!(method %in% c("median"))) { stop("Invalid method. Available methods: median") } - + if (method == "median") { median_normalised <- data %>% dplyr::distinct() %>% diff --git a/R/qc_charge_states.R b/R/qc_charge_states.R index 89c9f42c..5b9c8251 100644 --- a/R/qc_charge_states.R +++ b/R/qc_charge_states.R @@ -130,9 +130,11 @@ qc_charge_states <- if (interactive == FALSE) { ggplot2::geom_text( data = label_positions, - aes(y = label_y, - label = round(.data$charge_per, digits = 1)), - vjust = 1.5 + aes( + y = label_y, + label = round(.data$charge_per, digits = 1) + ), + vjust = 1.5 ) } } + @@ -192,9 +194,11 @@ qc_charge_states <- if (interactive == FALSE) { ggplot2::geom_text( data = label_positions, - aes(y = label_y, - label = round(.data$charge_per, digits = 1)), - vjust = 1.5 + aes( + y = label_y, + label = round(.data$charge_per, digits = 1) + ), + vjust = 1.5 ) } } + diff --git a/R/qc_data_completeness.R b/R/qc_data_completeness.R index 9a197cde..9b721b75 100644 --- a/R/qc_data_completeness.R +++ b/R/qc_data_completeness.R @@ -78,7 +78,7 @@ qc_data_completeness <- function(data, dplyr::group_by({{ sample }}) %>% dplyr::summarise(completeness = sum(!is.na({{ intensity }})) / dplyr::n() * 100, .groups = "drop") %>% dplyr::mutate({{ digestion }} := .y) - ) %>% + ) %>% filter(.data$completeness > 0) } else { result <- data %>% diff --git a/R/qc_missed_cleavages.R b/R/qc_missed_cleavages.R index 26be22ac..f707a89b 100644 --- a/R/qc_missed_cleavages.R +++ b/R/qc_missed_cleavages.R @@ -136,9 +136,11 @@ intensities or set remove_na_intensities to FALSE", if (interactive == FALSE) { ggplot2::geom_text( data = label_positions, - aes(y = .data$label_y, - label = round(.data$mc_percent, digits = 1)), - vjust = 1.5 + aes( + y = .data$label_y, + label = round(.data$mc_percent, digits = 1) + ), + vjust = 1.5 ) } } + @@ -206,9 +208,11 @@ intensities or set remove_na_intensities to FALSE", if (interactive == FALSE) { ggplot2::geom_text( data = label_positions, - aes(y = .data$label_y, - label = round(.data$mc_percent, digits = 1)), - vjust = 1.5 + aes( + y = .data$label_y, + label = round(.data$mc_percent, digits = 1) + ), + vjust = 1.5 ) } } + diff --git a/R/qc_peptide_type.R b/R/qc_peptide_type.R index c6f55ea2..9f97cd89 100644 --- a/R/qc_peptide_type.R +++ b/R/qc_peptide_type.R @@ -126,9 +126,11 @@ qc_peptide_type <- function(data, ggplot2::geom_col(col = "black", size = 1) + ggplot2::geom_text( data = label_positions, - aes(y = label_y, - label = round(.data$peptide_type_percent, digits = 1)), - vjust = 1.5 + aes( + y = label_y, + label = round(.data$peptide_type_percent, digits = 1) + ), + vjust = 1.5 ) + ggplot2::labs( title = "Peptide types per .raw file", @@ -221,9 +223,11 @@ qc_peptide_type <- function(data, ggplot2::geom_col(col = "black", size = 1) + ggplot2::geom_text( data = label_positions, - aes(y = label_y, - label = round(.data$peptide_type_percent, digits = 1)), - vjust = 1.5 + aes( + y = label_y, + label = round(.data$peptide_type_percent, digits = 1) + ), + vjust = 1.5 ) + ggplot2::labs( title = "Peptide type intensity per .raw file", diff --git a/R/qc_proteome_coverage.R b/R/qc_proteome_coverage.R index cde30477..80774ab5 100644 --- a/R/qc_proteome_coverage.R +++ b/R/qc_proteome_coverage.R @@ -66,7 +66,7 @@ qc_proteome_coverage <- function(data, proteome <- fetch_uniprot_proteome(organism_id, reviewed = reviewed) - if(is(proteome, "character")){ + if (is(proteome, "character")) { # UniProt information could not be fetched. message("UniProt information could not be fetched") return(NULL) diff --git a/R/qc_sequence_coverage.R b/R/qc_sequence_coverage.R index 48b5b9ce..4cd60e48 100644 --- a/R/qc_sequence_coverage.R +++ b/R/qc_sequence_coverage.R @@ -47,7 +47,6 @@ qc_sequence_coverage <- function(data, coverage, sample = NULL, interactive = FALSE) { - # Validate inputs if (!all(c(rlang::as_name(rlang::enquo(protein_identifier)), rlang::as_name(rlang::enquo(coverage))) %in% colnames(data))) { stop("Column names for protein_identifier and coverage must exist in the dataset.") @@ -76,10 +75,12 @@ qc_sequence_coverage <- function(data, boundary = 0, size = 1 ) + - ggplot2::geom_vline(data = result %>% dplyr::distinct(.data$median_coverage, {{ sample }}), - mapping = aes(xintercept = .data$median_coverage), - linewidth = 1, - linetype = "dashed") + + ggplot2::geom_vline( + data = result %>% dplyr::distinct(.data$median_coverage, {{ sample }}), + mapping = aes(xintercept = .data$median_coverage), + linewidth = 1, + linetype = "dashed" + ) + ggplot2::labs( title = "Protein coverage distribution", x = "Coverage [%]", diff --git a/man/barcode_plot.Rd b/man/barcode_plot.Rd index 0d6a5f97..0efbbfd3 100644 --- a/man/barcode_plot.Rd +++ b/man/barcode_plot.Rd @@ -22,7 +22,7 @@ barcode_plot( \arguments{ \item{data}{a data frame containing differential abundance, start and end peptide or precursor positions and protein length.} -\item{start_position}{a numeric olumn in the data frame containing the start positions for each peptide or precursor.} +\item{start_position}{a numeric column in the data frame containing the start positions for each peptide or precursor.} \item{end_position}{a numeric column in the data frame containing the end positions for each peptide or precursor.} diff --git a/man/calculate_diff_abundance.Rd b/man/calculate_diff_abundance.Rd index 019d34e1..6489a3a7 100644 --- a/man/calculate_diff_abundance.Rd +++ b/man/calculate_diff_abundance.Rd @@ -77,7 +77,7 @@ is done before the p-value adjustment.} \item{method}{a character value, specifies the method used for statistical hypothesis testing. Methods include Welch test (\code{"t-test"}), a Welch test on means, standard deviations and -number of replicates ("\code{"t-test_mean_sd"}) and a moderated t-test based on the \code{limma} +number of replicates (\code{"t-test_mean_sd"}) and a moderated t-test based on the \code{limma} package (\code{"moderated_t-test"}). More information on the moderated t-test can be found in the \code{limma} documentation. Furthermore, the \code{proDA} package specific method (\code{"proDA"}) can be used to infer means across samples based on a probabilistic dropout model. This diff --git a/man/qc_missed_cleavages.Rd b/man/qc_missed_cleavages.Rd index 98499536..d1cf4fe3 100644 --- a/man/qc_missed_cleavages.Rd +++ b/man/qc_missed_cleavages.Rd @@ -57,6 +57,8 @@ default settings remove grouping variables without quantitative information (int These will not be used for the calculation of missed cleavage percentages. } \examples{ +library(dplyr) + set.seed(123) # Makes example reproducible # Create example data @@ -66,7 +68,8 @@ data <- create_synthetic_data( n_replicates = 3, n_conditions = 2, method = "effect_random" -) +) \%>\% + mutate(intensity_non_log2 = 2^peptide_intensity_missing) # Calculate missed cleavage percentages qc_missed_cleavages( @@ -74,7 +77,7 @@ qc_missed_cleavages( sample = sample, grouping = peptide, missed_cleavages = n_missed_cleavage, - intensity = peptide_intensity_missing, + intensity = intensity_non_log2, method = "intensity", plot = FALSE ) @@ -85,7 +88,7 @@ qc_missed_cleavages( sample = sample, grouping = peptide, missed_cleavages = n_missed_cleavage, - intensity = peptide_intensity_missing, + intensity = intensity_non_log2, method = "intensity", plot = TRUE ) diff --git a/tests/testthat/test-workflow.R b/tests/testthat/test-workflow.R index 0f0345fd..d73b419d 100644 --- a/tests/testthat/test-workflow.R +++ b/tests/testthat/test-workflow.R @@ -333,26 +333,27 @@ test_that("calculate_diff_abundance works", { test_that("correct_lip_for_abundance works", { data <- rapamycin_10uM - diff_lip = data %>% - dplyr::mutate(fg_intensity_log2 = log2(fg_quantity)) %>% - assign_missingness( - sample = r_file_name, - condition = r_condition, - intensity = fg_intensity_log2, - grouping = eg_precursor_id, - ref_condition = "control", - retain_columns = "pg_protein_accessions") %>% - calculate_diff_abundance( - sample = r_file_name, - condition = r_condition, - grouping = eg_precursor_id, - intensity_log2 = fg_intensity_log2, - comparison = comparison, - method = "t-test", - retain_columns = "pg_protein_accessions" - ) + diff_lip <- data %>% + dplyr::mutate(fg_intensity_log2 = log2(fg_quantity)) %>% + assign_missingness( + sample = r_file_name, + condition = r_condition, + intensity = fg_intensity_log2, + grouping = eg_precursor_id, + ref_condition = "control", + retain_columns = "pg_protein_accessions" + ) %>% + calculate_diff_abundance( + sample = r_file_name, + condition = r_condition, + grouping = eg_precursor_id, + intensity_log2 = fg_intensity_log2, + comparison = comparison, + method = "t-test", + retain_columns = "pg_protein_accessions" + ) - diff_trp = data %>% + diff_trp <- data %>% dplyr::group_by(pg_protein_accessions, r_file_name) %>% dplyr::mutate(pg_quantity = sum(fg_quantity)) %>% dplyr::distinct( @@ -366,16 +367,19 @@ test_that("correct_lip_for_abundance works", { sample = r_file_name, condition = r_condition, intensity = pg_intensity_log2, - grouping = pg_protein_accessions, - ref_condition = "control") %>% - calculate_diff_abundance(sample = r_file_name, - condition = r_condition, - grouping = pg_protein_accessions, - intensity_log2 = pg_intensity_log2, - comparison = comparison, - method = "t-test") - - corrected_satterthwaite = correct_lip_for_abundance( + grouping = pg_protein_accessions, + ref_condition = "control" + ) %>% + calculate_diff_abundance( + sample = r_file_name, + condition = r_condition, + grouping = pg_protein_accessions, + intensity_log2 = pg_intensity_log2, + comparison = comparison, + method = "t-test" + ) + + corrected_satterthwaite <- correct_lip_for_abundance( lip_data = diff_lip, trp_data = diff_trp, protein_id = pg_protein_accessions, @@ -384,7 +388,7 @@ test_that("correct_lip_for_abundance works", { method = "satterthwaite" ) - corrected_no_df_approximation = correct_lip_for_abundance( + corrected_no_df_approximation <- correct_lip_for_abundance( lip_data = diff_lip, trp_data = diff_trp, protein_id = pg_protein_accessions, @@ -394,14 +398,14 @@ test_that("correct_lip_for_abundance works", { ) expect_is(corrected_satterthwaite, "data.frame") - expect_equal(corrected_satterthwaite$adj_diff[1], 2.474938, tolerance=1e-3) - expect_equal(corrected_satterthwaite$adj_std_error[1], 0.1531189, tolerance=1e-3) - expect_equal(corrected_satterthwaite$adj_pval[1], 1.561211e-05, tolerance=1e-3) - expect_equal(corrected_satterthwaite$df[1], 10.13124, tolerance=1e-3) + expect_equal(corrected_satterthwaite$adj_diff[1], 2.474938, tolerance = 1e-3) + expect_equal(corrected_satterthwaite$adj_std_error[1], 0.1531189, tolerance = 1e-3) + expect_equal(corrected_satterthwaite$adj_pval[1], 1.561211e-05, tolerance = 1e-3) + expect_equal(corrected_satterthwaite$df[1], 10.13124, tolerance = 1e-3) expect_is(corrected_no_df_approximation, "data.frame") - expect_equal(corrected_no_df_approximation$adj_diff[1], 2.474938, tolerance=1e-3) - expect_equal(corrected_no_df_approximation$adj_std_error[1], 0.1531189, tolerance=1e-3) + expect_equal(corrected_no_df_approximation$adj_diff[1], 2.474938, tolerance = 1e-3) + expect_equal(corrected_no_df_approximation$adj_std_error[1], 0.1531189, tolerance = 1e-3) expect_equal(corrected_no_df_approximation$df[1], 6) }) @@ -788,4 +792,3 @@ test_that("calculate_aa_scores works", { expect_equal(nrow(aa_fingerprint), 45) expect_equal(ncol(aa_fingerprint), 3) }) - diff --git a/vignettes/quality_control_workflow.Rmd b/vignettes/quality_control_workflow.Rmd index e9ed87e4..e50a6d85 100644 --- a/vignettes/quality_control_workflow.Rmd +++ b/vignettes/quality_control_workflow.Rmd @@ -111,7 +111,7 @@ $$CV = \frac{standard ~ deviation}{mean} * 100$$ Since the CV function works only with raw values, we will backtransform our log2 transformed data and create a new column called `raw_intensity` from the `peptide_intensity_missing` column which contains peptide intensities. To create the new column, we are using the function `mutate()` from the R package `dplyr`. We also make use of the pipe operator `%>%` included in the R package `magrittr` which takes the output of the preceding function and supplies it as the first argument of the following function. Using `%>%` makes code easier to read and follow. -The "combined" group of CVs contains CVs across all samples and not only across the replicates of a certain condition. Ideally the combined CVs should always be the group with the highest CV. This would indicates that CVs within conditions are low and the treatment had and effect that causes an increase in overall CVs. If the treatment did not have an effect or only a very small one then the combined CV could be as high as the CVs of individual conditions. If an individual condition has a higher CV than the combined CV then there was potentially a problem with one or multiple of the samples. You can check if there are suspicious samples behaving differently from the rest in the other quality control metrics. If there is a consistent different behaviour these samples should potentially be excluded for analysis. Sometimes a condition can also just be noisy because of the treatment in this condition that causes a larger variation. Therefore comparing individual CVs with the combined CV should only be done with caution. +The "combined" group of CVs contains CVs across all samples and not only across the replicates of a certain condition. Ideally the combined CVs should always be the group with the highest CV. This would indicate that CVs within conditions are low and the treatment had and effect that causes an increase in overall CVs. If the treatment did not have an effect or only a very small one then the combined CV could be as high as the CVs of individual conditions. If an individual condition has a higher CV than the combined CV then there was potentially a problem with one or multiple of the samples. You can check if there are suspicious samples behaving differently from the rest in the other quality control metrics. If there is a consistent different behaviour these samples should potentially be excluded for analysis. Sometimes a condition can also just be noisy because of the treatment in this condition that causes a larger variation. Therefore comparing individual CVs with the combined CV should only be done with caution. ```{r qc_cvs, eval = test_protti, fig.width = 6, fig.height = 4, fig.align = "center"} input <- data %>% @@ -206,7 +206,7 @@ qc_peptide_type( The function `qc_intensity_distribution()` plots all precursor, peptide or protein intensities for each sample as a boxplot if method "boxplot" is selected. This is helpful for quick assessment of any major sample losses or measurement issues. -Rrun intensities can also be assessed by plotting the median run intensities as a line plot. This helps you quickly assess if there are any trends in your data. The function to use for this analysis is `qc_median_intensities()`. +Run intensities can also be assessed by plotting the median run intensities as a line plot. This helps you quickly assess if there are any trends in your data. The function to use for this analysis is `qc_median_intensities()`. ```{r qc_intensity_distribution_boxplot, eval = test_protti, fig.width = 6, fig.height = 4, fig.align = "center"} qc_intensity_distribution(