Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Fix fetch uniprot bug" #240

Merged
merged 1 commit into from
Mar 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 3 additions & 6 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: protti
Title: Bottom-Up Proteomics and LiP-MS Quality Control and Data Analysis Tools
Version: 0.7.0.9000
Version: 0.7.0
Authors@R:
c(person(given = "Jan-Philipp",
family = "Quast",
Expand Down Expand Up @@ -43,7 +43,7 @@ Imports:
methods,
R.utils,
stats
RoxygenNote: 7.3.1
RoxygenNote: 7.2.3
Suggests:
testthat,
covr,
Expand All @@ -64,10 +64,7 @@ Suggests:
igraph,
stringi,
STRINGdb,
iq,
scales,
farver,
ggforce
iq
Depends:
R (>= 4.0)
URL: https://github.com/jpquast/protti, https://jpquast.github.io/protti/
Expand Down
56 changes: 0 additions & 56 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,3 @@
# protti 0.7.0.9000

## New features

* `calculate_treatment_enrichment()` received additional arguments.
* `fill_colours`: a character value that can be used to provide custom colours to the plot.
* `fill_by_group`: a logical value that specifies if the bars in the plot should be filled according to group.
* `facet_n_col`: specifies the number of columns in the facet plot if a `group` column was provided.
* `calculate_go_enrichment()` got additional arguments.
* `facet_n_col`: determines the number of columns the faceted plot should have if a group column is provided.
* `plot_title`: specifies the title of the plot.
* `min_n_detected_proteins_in_process`: argument for plotting that specifies the minimum number of proteins a GO term needs to be detected for.
* `enrichment_type`: specifies what kind of enrichment should be calculated. It can be "all", "enrichment" or "deenrichment". This argument affects how the `fisher.test()` calculates the enrichment. A two-sided test will be used for "all", while a one-sided test in the specific direction will be used for "enriched" or "deenriched".
* `barplot_fill_colour`: specifies the colours used to fill the bars in the barplot. Needs always at least two values one for deenriched the other for enriched.
* `plot_style`: We added a new plot type to the function. The standard plot is still the default and is called "barplot", while the new plot type is "heatmap". The heatmap plot is especially useful for comparing GO enrichments of multiple groups.
* `heatmap_fill_colour`: specifies the colours used for the colour gradient of heatmap plots.
* `heatmap_fill_colour_rev`: a logical value that specifies if the colour gradient should be reversed.
* `plot_cutoff`: is now more flexible. You can provide any number with the "top" cutoff. E.g. "top10", "top5".
* `barcode_plot()` received additional arguments.
* `facet_n_col`: determines the number of columns the faceted plot.
* `fill_colour_gradient`: specifies the colours used for the colour gradient if the `colouring` column is continous.
* `fill_colour_discrete`: specifies the colours used for the fill colours if the `colouring` column is discrete.
* Added `mako_colours` to the package that contain 256 colours of the "mako" colour gradient.
* `drc_4p_plot()` received additional arguments.
* `facet_title_size`: determines the size of the facet titles.
* `export_height`: determines the output height of an exported plot in inches.
* `export_width`: determines the output width of an exported plot in inches.
* `fit_drc_4p()` and `parallel_fit_drc_4p()` have been updated in the latest version of **protti**, leading to slight adjustments in their computational results compared to previous versions.
* We added new arguments:
* `anova_cutoff` lets you define the ANOVA adjusted p-value cutoff (default 0.05).
* `n_replicate_completeness` replaces `replicate_completeness`. Now we encourage you to provide a discrete number of minimal replicates instead of a fraction that is multiplied with the total number of replicates. This is particularly important to ensure that thresholds between different datasets and data completeness levels are reproducible.
* `n_condition_completeness` replaces `condition_completeness`. Same as above, we encourage you to provide the minimal number of conditions that need to meet the replicate completeness criteria as a number instead of a fraction.
* `complete_doses` is a new optional argument that should be provided if the dataset is small and potentially incomplete. This ensures that no matter if any doses are missing from the provided data or not, the MNAR of the curve is calculated correctly. We would recommend always providing it to ensure proper reproducibility.
* Curves that were previously annotated in the `dose_MNAR` column are now part of the hits. To get back to the old output you can just exclude them again from the ranked results.
* The major change to the function is that now all provided features (e.g. peptides) are also part of the output no matter if a curve was fit or not. To get back to the original output you can remove all features without a fit, but please note that statistics such as the ANOVA p-value adjustment were computed on the complete dataset and might need to be readjusted by running the p-value adjustment again.
* Another major change to the function was the way the `filter` argument works. This argument controls if significance statistics should be annotated in the data.
* `"pre"`: This previously filtered curves by the completeness as well as the ANOVA adjusted p-value prior to fitting curves. Now it only filters by completeness. This also allows it to be an option for the `parallel_fit_drc_4p()` function.
* `"post"`: Is still the default value and still just annotates the data without any filtering.
* In general we would now recommend using `"pre"` to remove usually not trustworthy features with too few complete concentrations from the data before p-value adjustment and curve fittings. This will solidify your confidence that features without a dose-response behavior are true negative. The point is that it is better to not include any features with too few values because they are potentially false negative.

## Bug fixes

* `normalise()` now correctly works with grouped data. Previously it would only correctly work with ungrouped data frames. Now you can group the data to calculate group specific normalisations. If you want to compute a global normalisation for the dataset, you need to ungroup the data before using the function as usual. This fixes issue #209.
* `qc_sequence_coverage()` now correctly displays medians in faceted plot. This fixes issue #202 and #213.
* `fit_drc_4p()` and `parallel_fit_drc_4p()` now correctly calculates the ANOVA p-value. Previously the number of observations for each concentration was not provided correctly.
* `fetch_uniprot()` now correctly retrieves information if an input ID was also part of a non-conform input ID combination. When e.g. `c("P02545", "P02545;P20700")` was provided, previously the `"P02545"` accession was dropped from the `input_id` column even though it is also present on its own and not only in combination with `"P20700"`. The new output now contains 3 rows, one for each ID, with `"P02545"` having one row with the `input_id` ``"P02545"` and one with the `input_id` `"P02545;P20700"`. This also means that the `input_id` column now always contains the provided input IDs and not only if they were non-conform input ID combinations.

## Additional Changes

* For `fit_drc_4p()` and `parallel_fit_drc_4p()` the arguments `replicate_completeness` and `condition_completeness` are now deprecated. Please use `n_replicate_completeness` and `n_condition_completeness` instead.
* Improved label positions of `qc_charge_states()`, `qc_peptide_type()` and `qc_missed_cleavages()`. Also made appearance more uniform between methods `"count"` and `"intensity"`.
* `fetch_uniprot()` now returns nothing instead of a partial output if some of the requested batches could not be retrieved due to database issues (e.g. timeout because of too many requests). This addresses issue #203, which requests this change, because the warning message regarding the partial output can be easily missed and users might wrongfully assume that all information was retrieved successfully from UniProt.
* `find_peptide()` now preserves the groups of the original data. This does not affect any of the calculations.
* `calculate_sequence_coverage()` now works on grouped data.

# protti 0.7.0

## New features
Expand Down Expand Up @@ -131,7 +76,6 @@
* The default batchsize of `fetch_pdb()` was changed to 100 (from 200). This was done since more information is retrieved now, which slows to function down and is slightly improved when batch sizes are smaller.
* `try_query()` now only retries to retrieve information once if the returned message was "Timeout was reached". In addition, a `timeout` and `accept` argument have been added.
* The UniProt database has changed its API, therefore column names have changed as well as the format of data. We adjusted the `fetch_uniprot()` and `fetch_uniprot_proteome()` function accordingly. Please be aware that some columns names might have changed and your code might throw error messages if you did not adjust it accordingly.
* Some typo fixes. Thank you Steffi!

# protti 0.3.1

Expand Down
49 changes: 17 additions & 32 deletions R/barcode_plot.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,19 @@
#' Plots a "barcode plot" - a vertical line for each identified peptide. Peptides can be colored based on an additional variable. Also differential
#' abundance can be displayed.
#'
#' @param data a data frame containing differential abundance, start and end peptide or precursor positions and protein length.
#' @param start_position a numeric column in the data frame containing the start positions for each peptide or precursor.
#' @param end_position a numeric column in the data frame containing the end positions for each peptide or precursor.
#' @param protein_length a numeric column in the data frame containing the length of the protein.
#' @param coverage optional, numeric column in the data frame containing coverage in percent. Will appear in the title of the barcode if provided.
#' @param colouring optional, column in the data frame containing information by which peptide or precursors should
#' @param data Data frame containing differential abundance, start and end peptide or precursor positions and protein length.
#' @param start_position Column in the data frame containing the start positions for each peptide or precursor.
#' @param end_position Column in the data frame containing the end positions for each peptide or precursor.
#' @param protein_length Column in the data frame containing the length of the protein.
#' @param coverage Optional, column in the data frame containing coverage in percent. Will appear in the title of the barcode if provided.
#' @param colouring Optional argument, column in the data frame containing information by which peptide or precursors should
#' be colored.
#' @param fill_colour_gradient a vector that contains colours that should be used to create a colour gradient
#' for the barcode plot bars if the `colouring` argument is continuous. Default is `mako_colours`.
#' @param fill_colour_discrete a vector that contains colours that should be used to fill the barcode plot bars
#' if the `colouring` argument is discrete. Default is `protti_colours`.
#' @param protein_id optional, column in the data frame containing protein identifiers. Required if only one protein
#' @param protein_id Optional argument, column in the data frame containing protein identifiers. Required if only one protein
#' should be plotted and the data frame contains only information for this protein.
#' @param facet optional, column in the data frame containing information by which data should be faceted. This can be
#' @param facet Optional argument, column in the data frame containing information by which data should be faceted. This can be
#' protein identifiers. Only 20 proteins are plotted at a time, the rest is ignored. If more should be plotted, a mapper over a
#' subsetted data frame should be created.
#' @param facet_n_col a numeric value that specifies the number of columns the faceted plot should have
#' if a column name is provided to group. The default is 4.
#' @param cutoffs optional argument specifying the log2 fold change and significance cutoffs used for highlighting peptides.
#' @param cutoffs Optional argument specifying the log2 fold change and significance cutoffs used for highlighting peptides.
#' If this argument is provided colouring information will be overwritten with peptides that fulfill this condition.
#' The cutoff should be provided in a vector of the form c(diff = 2, pval = 0.05). The name of the cutoff should reflect the
#' column name that contains this information (log2 fold changes, p-values or adjusted p-values).
Expand Down Expand Up @@ -59,11 +53,8 @@ barcode_plot <- function(data,
protein_length,
coverage = NULL,
colouring = NULL,
fill_colour_gradient = protti::mako_colours,
fill_colour_discrete = c("#999999", protti::protti_colours),
protein_id = NULL,
facet = NULL,
facet_n_col = 4,
cutoffs = NULL) {
# Check if there is more than one protein even though protein_id was specified.
if (!missing(protein_id)) {
Expand Down Expand Up @@ -92,7 +83,7 @@ barcode_plot <- function(data,
fc <- cutoffs[1]
sig <- cutoffs[2]

colouring <- sym("Change")
colouring <- sym("change")

data <- data %>%
dplyr::mutate({{ colouring }} := ifelse(((!!ensym(fc_name) >= fc | !!ensym(fc_name) <= -fc) & !!ensym(sig_name) <= sig), "Changed", "Unchanged")) %>%
Expand All @@ -102,13 +93,12 @@ barcode_plot <- function(data,
# Add coverage to protein ID name if present.
if (!missing(coverage) & !missing(facet)) {
data <- data %>%
dplyr::mutate({{ facet }} := paste0({{ facet }}, " (", round({{ coverage }}, digits = 1), "%)"))
mutate({{ facet }} := paste0({{ facet }}, " (", round({{ coverage }}, digits = 1), "%)"))
}
if (!missing(coverage) & !missing(protein_id)) {
data <- data %>%
dplyr::mutate({{ protein_id }} := paste0({{ protein_id }}, " (", round({{ coverage }}, digits = 1), "%)"))
mutate({{ protein_id }} := paste0({{ protein_id }}, " (", round({{ coverage }}, digits = 1), "%)"))
}

# Create plot
data %>%
ggplot2::ggplot() +
Expand All @@ -122,22 +112,17 @@ barcode_plot <- function(data,
),
size = 0.7
) +
{
if (is.numeric(dplyr::pull(data, {{ colouring }}))) {
ggplot2::scale_fill_gradientn(colours = fill_colour_gradient)
} else {
ggplot2::scale_fill_manual(values = c(
fill_colour_discrete
))
}
} +
ggplot2::scale_fill_manual(values = c(
"#999999", "#5680C1", "#B96DAD", "#64CACA", "#81ABE9", "#F6B8D1", "#99F1E4", "#9AD1FF", "#548BDF", "#A55098", "#3EB6B6",
"#87AEE8", "#CA91C1", "#A4E0E0", "#1D4F9A", "#D7ACD2", "#49C1C1"
)) +
ggplot2::scale_x_continuous(limits = c(0, 100), expand = c(0, 0)) +
ggplot2::scale_y_continuous(limits = NULL, expand = c(0, 0)) +
ggplot2::labs(x = "Protein Sequence", title = {
if (!missing(protein_id)) unique(dplyr::pull(data, {{ protein_id }}))
}) +
{
if (!missing(facet)) ggplot2::facet_wrap(rlang::new_formula(NULL, rlang::enquo(facet)), ncol = facet_n_col)
if (!missing(facet)) ggplot2::facet_wrap(rlang::new_formula(NULL, rlang::enquo(facet)))
} +
ggplot2::theme(
plot.title = ggplot2::element_text(size = 20),
Expand Down
Loading
Loading