-
Notifications
You must be signed in to change notification settings - Fork 9
/
template_QC_report.Rmd
1278 lines (1078 loc) · 58.3 KB
/
template_QC_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'RAVED: Quality Control for Gene Expression Microarray Data -- GSE4917'
author: 'Mengyuan Kan (mengykan@upenn.edu)'
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: TRUE
depth: 3
editor_options:
chunk_output_type: console
---
***
This report shows the QC steps for gene expression microarry data from GEO study, including:
* GEO data download
* Phenotype file preparation
* Quality control metrics measurement and outlier detection
Mannually change the variables for GEO ID (geo_id), data directory (datadir) and result directory (resdir).
```{r var, eval=T, echo=T}
# GEO id
geo_id="GSE4917"
# directory stores GEO data
datadir="data"
# directory stores generated files
resdir="results"
```
**Note that** three variables, platform (platform), geo_GPL (GPL id for analysis if the samples in the study were scanned on multiple platforms) and normdata (whether the expression matrix is normalized), need to be **manually** re-defined in the following steps after look into the datasets. A shortname_func function is suggested to be updated.
```{r var2, eval=T, echo=T}
platform="Affymetrix"
geo_GPL=""
normdata=FALSE
# The shortname_func function shortens the sample name shown in the plots. To start, define shortname_func <- function(x){x}
shortname_func <- function(x){gsub("^(.*)\\.(cel|CEL).gz","\\1",x)}
```
Install the prerequisite R packages if they do not exist
* GEOquery
* oligo
* affy (Affymetrix microarray-specific QC analysis)
* viridis (heatmap color)
* ggplot2
* gplots (heatmap2 plot)
* Hmisc (compute hoeffd (Hoeffding's D statistics) for MA metrics)
* devtools (compute pca)
* dplyr
* pander
```{r check_version, eval=T, echo=F}
rversion <- as.numeric(paste0(R.Version()$major, ".", gsub("(\\d)\\..*", "\\1", R.Version()$minor)))
```
```{r pkginstall_func, eval=T, echo=F}
pkginstall_func <- function(pkgs, rversion, Bioconductor=FALSE) {
if (Bioconductor) {
if (rversion<3.6) {
source("http://bioconductor.org/biocLite.R")
sapply(biocLite, pkgs)
} else {
sapply(pkgs, BiocManager::install)
}
} else {sapply(pkgs, install.packages)}
}
```
```{r pkg, eval=T, echo=F, message=F, warning=F, results="hide"}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
all.pkgs <- installed.packages()[,"Package"]
bioc.pkgs <- c("GEOquery", "ArrayExpress", "preprocessCore", "oligo", "affy", "viridis", "preprocessCore") # Bioconductor packages
rcran.pkgs <- c("knitr", "DT", "ggplot2", "gplots", "Hmisc", "devtools", "dplyr", "pander")
if (rversion<3.6) {bioc.pkgs <- c(bioc.pkgs, "viridis")} else {rcran.pkgs <- c(rcran.pkgs, "viridis")}
miss.bioc.pkgs <- bioc.pkgs[!which(bioc.pkgs%in%all.pkgs)]
miss.rcran.pkgs <- rcran.pkgs[!which(rcran.pkgs%in%all.pkgs)]
if (length(miss.bioc.pkgs)>0) {pkginstall_func(pkgs=miss.bioc.pkgs, rversion = rversion, Bioconductor=T)}
if (length(miss.rcran.pkgs)>0) {pkginstall_func(pkgs=miss.rcran.pkgs, rversion = rversion, Bioconductor=F)}
```
Load the necessary libraries. Load affy and dplyr packages later since they will mask other functions.
```{r lib, eval=T, echo=F, message=F, warning=F}
if (grepl("^GSE", geo_id)) {
library(GEOquery); geo=TRUE; arrayexpr=FALSE} else if (grepl("^E-", geo_id)) {
library(ArrayExpress); geo=FALSE; arrayexpr=TRUE
}
library(knitr)
library(oligo)
library(viridis) # heatmap colour
library(ggplot2)
library(gplots) # heatmap.2 plot
library(Hmisc) # compute hoeffd (Hoeffding's D statistics) for MA plot
library(devtools) # compuate PCs
library(preprocessCore) # quantile normalization
library(pander)
```
## GEO/ArrayExpress Data Download and Phenotype Preparation
```{r, eval=geo, echo=F}
knitr::asis_output("### GEO Dataset Download")
```
```{r, eval=geo, echo=F}
knitr::asis_output("Download GEO series matrix files if available.")
```
```{r geo_download, eval=geo, echo=F, message=F, warning=F}
# For GEO data
if (geo) {
# check if GEO matrix file exists
geo_fn <- list.files(path=datadir)[grepl(geo_id,list.files(path=datadir))&grepl("matrix.txt.gz$",list.files(path=datadir))] # check if GEO matrix file exists
if (length(geo_fn)==0) { # GEO matrix file is not downloaded
gselms <- getGEO(geo_id, destdir=datadir, GSEMatrix = TRUE) # dowanload matrix file
if (length(gselms)>1) { # multiple platform
gpls=sapply(gselms,annotation)
cat("This study was performed in multiple platforms:\n")
cat(unname(gpls),"\n")
cat("Samples from same platform shoud be analyzed together. Assign a platform to the variable geo_GPL in the session coding.\n")
if (geo_GPL=="") {stop("This study has multiple platforms. Please assign the platform to the variable geo_GPL in the session coding.")} else {cat("Use platform", geo_GPL, "\n"); idx=which(grepl(geo_GPL,gpls))}
} else {idx=1}
gse <- gselms[[idx]]
} else if (length(geo_fn)==1) { # GEO matrix file is alreadly downloaded and only has one platform
gse <- getGEO(filename=paste0(datadir,"/",geo_fn),GSEMatrix = TRUE)
} else { # GEO matrix file is alreadly downloaded and has multiple platforms
cat("This study was performed in multiple platforms:\n")
cat(geo_fn)
cat("Samples from same platform shoud be analyzed together. Assign a platform to the variable geo_GPL in the session coding.\n")
if (geo_GPL=="") {stop("This study has multiple platforms. Please assign the platform to the variable geo_GPL in the session coding.")} else {cat("Use platform", geo_GPL, "\n"); geo_fn <- geo_fn[grep(geo_GPL,geo_fn)];gse <- getGEO(filename=paste0(datadir,"/",geo_fn),GSEMatrix = TRUE)}
}
}
```
```{r, eval=arrayexpr, echo=F}
knitr::asis_output("### ArrayExpress Dataset Download")
```
```{r, eval=arrayexpr, echo=F}
knitr::asis_output("Download ArrayExpress dataset if available.")
```
```{r arrayexpr_download, eval=arrayexpr, echo=F, message=F, warning=F, message="hide"}
# For ArrayExpress data
if (arrayexpr) {
if (!file.exists(paste0(datadir,"/",geo_id,"/data"))) {dir.create(paste0(datadir,"/",geo_id,"/data"), recursive = T)}
gse <- ArrayExpress(geo_id, path=paste0(datadir,"/",geo_id,"/data"), save=T)
}
```
Show expression dataset features
```{r gse, eval=T, echo=F}
gse
```
**Manual inspection:** Re-define the platform variable "platform" (i.e. "Affymetrix", "Agilent", "Illumina").
### Modify raw phenotype information
Obtain raw phenotype information from the GEO dataset and generated a summary of all the phenotypic variables for overview.
For continuous variables, show the summary table. For categorical variables, only show the first five levels of variables.
Generate a variable, suppldata (whether supplementary data are available), based on whether the column supplementary_file is none.
```{r phenosumm_raw, eval=T, echo=F, results="asis"}
pheno.raw <- pData(phenoData(gse))
for (x in names(pheno.raw)) {
vec=pheno.raw[,x]
if (!is.numeric(vec)) {
# create an empty data frame to save any tables with non-native characters
tb_nonnative=data.frame()
vec <- factor(vec)
if (nlevels(vec)>5) {res=table(droplevels(vec[vec%in%levels(vec)[1:5]]))} else {res=table(vec)}
res=data.frame(res)
names(res) <- c(x,"counts")
}
if (is.numeric(vec)){res=summary(vec)}
# Try if any existing non-native characters cause pander error
if (class(try(pandoc.table(res, justify='left',split.tables=Inf, caption=x), silent = T))=="try-error") {tb_nonnative=res}
if (nrow(tb_nonnative)>1){print(res)}
}
```
Automatically define the variable "suppldata" (i.e. TRUE or FALSE) showing whether supplementary data is available. Note that samples with supplementary file column equals "None" are excluded from analysis.
```{r suppldata_define_geo, eval=T, echo=F}
# For GEO data
if (geo) {
if (all(c("supplementary_file","supplementary_file.1")%in%names(pheno.raw))) {
if (all(grepl("cel|CEL",pheno.raw$supplementary_file.1))) {pheno.raw$supplementary_file=pheno.raw$supplementary_file.1}
}
# check if there is missing supplementary_file
if (all(pheno.raw$supplementary_file=="NONE")) {
suppldata=FALSE
} else if (any(pheno.raw$supplementary_file=="NONE")) {
sample_nosuppl=as.character(pheno.raw$geo_accession[pheno.raw$supplementary_file=="NONE"])
pheno.raw=pheno.raw[which(pheno.raw$supplementary_file!="NONE"),]
cat("Not all samples have defined supplementary file path:\n")
cat(paste(sample_nosuppl,collapse=", "),"\n")
cat("These samples are excluded from analysis\n")
suppldata=TRUE
} else {suppldata=TRUE}
}
```
```{r suppldata_define_arrayexpr, eval=arrayexpr, echo=F}
# For ArrayExpress data
if (arrayexpr) {
# check if there is missing supplementary_file
if (all(table(pheno.raw$Array.Data.File)==1)) {
suppldata=TRUE
} else {suppldata=FALSE}
# create a pseudo geo_accession for ArrayExpress
pheno.raw$geo_accession=pheno.raw$Array.Data.File
}
```
Defined suppldata variable
```{r suppldata_show, eval=T, echo=F}
cat("suppldata =",as.character(suppldata))
```
#### Phenotype information modification
**This step requires mannual inspection.**
These raw phenotypic variables are not informative (e.g. description, characteristics_ch1 and source_name_ch1) and not created in a consice way. Select useful phenotype variables and manually modify them using a standard format (e.g. GEO_ID, Donor, Disease, Treatment, Age, Gender).
These columns are required: GEO_ID, Donor, Disease, Treatment. Column naming is rigid for these columns, because pipeline scripts will recognize these name strings, but the column order can be changed.
```{r pheno_new, eval=T, echo=T, message=F, warning=F, results="hide"}
library(dplyr)
cols <- c("title","geo_accession","source_name_ch1")
pheno <- pheno.raw %>%
dplyr::select(cols) %>%
dplyr::mutate(GEO_ID=geo_accession) %>%
dplyr::mutate(Donor=gsub("^.*biological (.*)$","\\1",title)) %>%
dplyr::mutate(Sample=paste(geo_accession,Donor,sep="_")) %>%
dplyr::mutate(Tissue="MCF10A-Myc") %>%
dplyr::mutate(treatment_time=gsub("^.*for (\\d+.*)$","\\1",source_name_ch1)) %>%
dplyr::mutate(treatment_time=gsub(" ","",treatment_time)) %>%
dplyr::mutate(treatment_drug=ifelse(grepl("dexamethasone",source_name_ch1),"dex","control")) %>%
dplyr::mutate(Treatment=paste(treatment_drug,treatment_time,sep="_")) %>%
dplyr::mutate(Disease="nonasthma") %>%
dplyr::mutate(Age="35") %>%
dplyr::mutate_if(is.character,as.factor) %>%
dplyr::select(-one_of(cols)) # remove original columns
detach("package:dplyr")
```
### Raw intensity data download
Download supplementary raw data files if available. Analysis using raw intensity data is only for data from Affymetrix platform. For other platforms, the expression matrix is derive from ExpressionSet object (gse object from GEOquery), but the batch (scan date) information is obtained from the supplementary files and used for differential expression analysis.
```{r suppl_download, eval=T, echo=F}
# The suppdownload_func function downloads the supplimentary raw data files from GEO and extract the zip file
suppldownload_func <- function() {
getGEOSuppFiles(geo_id,baseDir=datadir) #download GEO files
untar(paste0(datadir,"/",geo_id,"/",geo_id,"_RAW.tar"), exdir=paste0(datadir,"/",geo_id,"/data")) # extract the zip file
}
# If supplementary data is available, download supplimentary raw data files
if (suppldata) {
# The sampall_func function obtains the supplementary filenames of all samples of interest
sampall_func <- function() {
if (geo) {basename(as.character(pheno.raw$supplementary_file))}
else if (arrayexpr) {{basename(as.character(pheno.raw$Array.Data.File))}}
}
# The existall_func function check whether all supplementary files in GEO phenotype exist in the data directory
existall_func <- function() {
raw_fn=list.files(path=paste0(datadir,"/",geo_id,"/data"))
# check if all supplementary_file name from GEO phenotype are within the downloaded folder
return(all(sapply(sampall_func(),function(x)x%in%raw_fn)))
}
# The rawall_func function obtains all files in the data directory with full path
rawall_func <- function() {
raw_fn=list.files(path=paste0(datadir,"/",geo_id,"/data"))
# obtain supplementary data with path
paste0(datadir,"/",geo_id,"/data/",raw_fn[which(raw_fn%in%sampall_func())]) # only select cel files from datadir
}
# Download raw data files (e.g. .cel) if available.
# Check whether the supplementary files already exist. Otherwise download from GEO
samp_exist=existall_func()
if (!samp_exist) {
suppldownload_func()
}
samp_exist=existall_func() # updated the existing samples
if (!samp_exist) {stop("The .cel files obtained from GEO do not include all the samples of interest")}
}
```
For data from Affymetrix platform, the raw.data object is generated from importing supplementary raw data files (usually .cel files) using R oligo package.
```{r primeview_pkg, eval=T, echo=F, message=F, warning=F, results="hide"}
# check if the platform is Affymetrix primeview array.
GPL_ID=annotation(gse)
if (GPL_ID=="GPL15207") { # install pd.primeview.hs.entrezg if it does not exist
if (!"pd.primeview.hs.entrezg"%in%installed.packages()[,1]) {install.packages("http://mbni.org/customcdf/22.0.0/entrezg.download/pd.primeview.hs.entrezg_22.0.0.tar.gz",repos = NULL)}
cat("This platform is GPL15207 Affymetrix PrimeView.\n")
cat("Since it lacks corresponding annotation dataset, for differential expression (DE) analysis, use GEO expression matrix instead of raw intensity data.\n")
cat("set 'usesuppl=FALSE' and inspect whether gene expression matrix is normalized for 'normdata' setting\n")
cat("Still use raw intensity data for QC\n")
}
```
```{r rawdata_affy, eval=T, echo=F, message=F, warning=F, results="hide"}
if (platform=="Affymetrix"&suppldata) {
# Read in the raw data and generate an object "raw.data" under the ExpressionFeatureSet (oligo class).
raw.files=rawall_func()
if (GPL_ID=="GPL15207") {raw.data <- read.celfiles(raw.files,pkgname="pd.primeview.hs.entrezg")}
else {raw.data <- read.celfiles(raw.files)}
}
```
For data from other platforms or without raw data files (supplementary data), the raw.data object is derived from GEO expression matrix.
```{r rawdata_agil, eval=T, echo=F}
if (!(platform=="Affymetrix"&suppldata)) {
raw.data=gse
}
```
Assign phenotype data from gse object to raw data object.
```{r pData_assign, eval=T, echo=F}
# assign phenotype data to raw expression data
if (suppldata) {pheno <- pheno[order(pheno$GEO_ID),]} # if using supplementary data, order pheno object by GEO_ID, because the files are loaded by this order
pData(raw.data) <- pheno
row.names(pData(raw.data)) <- sampleNames(protocolData(raw.data))
# Check if the sample names derived from expression data match those in phenotype file
rowname <- gsub("^(.*).(cel|CEL|txt|Txt).gz","\\1",row.names(pData(raw.data)))
matching <- mapply(grepl, x=rowname, pattern=as.character(pData(raw.data)$GEO_ID))
if (FALSE %in% matching) {stop("The sample names derived from expression data do not match those in phenotype file. Please check!")}
```
Retrieve scan date information from raw.data object for batch effect adjustment. For Affymetrix data, scan date information is imported by oligo with raw data files. For Agilent platform, scan date information is derived from the raw data files (usually in the 3rd line of a .txt file) for batch effect adjustment.
```{r scandate, eval=T, echo=F, message=F, warning=F, results="hide"}
if (platform=="Affymetrix"&suppldata) {
# As the scan date and scan time are usually joined by "T" or a white space, use both pattern to split the date with time
pheno$ScanDate_Group <- sapply(strsplit(as.character(protocolData(raw.data)$dates), "T| "), function(x) {x[[1]]})
} else if (platform=="Agilent"&suppldata) {
dates <- sapply(rawall_func(),function(x){
lines=read.table(x,nrows=3,header=F,sep="\t")
date=lines[3,which(lines[2,]=="Scan_Date")] # grep Scan_Date index in 2nd line, and obtain scan date from 3rd line
})
pheno$ScanDate_Group <- sapply(strsplit(as.character(dates), "T| "), function(x) {x[[1]]}) # split "06-16-2011 11:08:06" and use the first part
}
pheno$ScanDate_Group <- as.factor(pheno$ScanDate_Group)
pData(raw.data) <- pheno # update pData
row.names(pData(raw.data)) <- sampleNames(protocolData(raw.data))
```
Show the summary of phenotype variables and the sample size for different groups
```{r pheno_check_raw, eval=T, warning=F,results="asis",echo=F}
# show the first five rows
pandoc.table(head(pData(raw.data),5), split.tables=Inf,caption="Show the first 5 rows of the modified phenotype file")
# show the groups of interest
avail_group=c("Tissue","Disease","Treatment")[c("Tissue","Disease","Treatment")%in%names(pheno)]
res=as.data.frame(table(pheno[,avail_group]))
names(res) <- c(avail_group,"Count")
pandoc.table(res[which(res$Count>0),], split.tables=Inf, caption="Sample size in different tissue and disease/treatment groups")
# show samples in different batch
if ("ScanDate_Group"%in%names(pheno)) {
res=as.data.frame(table(droplevels(pheno[,"ScanDate_Group"])))
names(res) <- c("ScanDate_Group","Count")
pandoc.table(res, split.tables=Inf, caption="Sample size in different batch")
} else {cat("No scan date information.")}
```
Assign colors to scan date or disease/treatment if scan date is not available.
```{r usrdefine_utility, eval=T, echo=F}
# assign colours to Scan Date for plots
colours=c("#1B9E77", "#D95F02", "#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D", "#666666", "#8DD3C7", "#FFFFB3", "#BEBADA", "#FB8072", "#80B1D3", "#FDB462", "#B3DE69", "#FCCDE5", "#D9D9D9", "#BC80BD", "#CCEBC5", "#FFED6F") # first 8 colour names derived from Dark2, and last 12 names from Set3
vars=c("ScanDate_Group","Treatment","Disease")
leveluse=sapply(vars,function(x){if(x%in%names(pheno)){nlevel=nlevels(pheno[,x])}else{nlevel=0};nlevel>1})
if (any(leveluse)){varuse=names(which(leveluse)[1])}else{stop("None of the following variables scan date/Disease/Treatment has >1 level. Check the dataset!")} # varible assigned color in plot
# assign colour to corresponding variable (scan date if available otherwise disease/treatment)
i=nlevels(pheno[,varuse])
colour_list <- colours[1:i]
names(colour_list) <- levels(pheno[,varuse])
```
If gene expression matrix data is used, check if they are normalized/log-transformed. After **manual inspection**, assign a logistic variable "normdata" (whether needs log2 transformation/normalization or not for QC). If normdata is FALSE, we generate boxplots for log2-transformed and Quantile-normalization of log2-transformed data. Note that if the data are normalized, it is not likely to detect the outliers based on the intensity metrices.
If negative/zero intensity values are present, convert them to NAs.
```{r infinite_convert, eval=T, echo=F, fig.height=10, fig.width=12}
# check if any negative/zero intensity value in the expression data
if (any(apply(exprs(raw.data),2,function(x){min(x,na.rm=T)})<=0)) {
cat("Negative or zero intensity values are observed. Convert them to NA.\n")
}
exprs(raw.data)=apply(exprs(raw.data),2,function(x){replace(x,which(x<=0),NA)})
```
```{r check_norm, eval=T, echo=F, warning=F}
if (!suppldata|platform!="Affymetrix"|GPL_ID=="GPL15207") {
if (GPL_ID=="GPL15207") {
cat("Since platform GPL15207 Affymetrix PrimeView lacks corresponding annotation dataset, use the gse expression matrix instead.")
}
boxplot(exprs(gse),col=colour_list,main="Probe Intensity matrix of raw data",xaxt="n") # note gse here are non-coverted expression values
legend("topright",legend=names(colour_list),fill=colour_list,cex=0.8)
if (!normdata) {
boxplot(log2(exprs(raw.data)),col=colour_list,main="Probe Intensity matrix of log2 data",xaxt="n")
legend("topright",legend=names(colour_list),fill=colour_list,cex=0.8)
boxplot(normalize.quantiles(log2(exprs(raw.data))),col=colour_list,main="Probe Intensity matrix of qnormed log2 data",xaxt="n")
legend("topright",legend=names(colour_list),fill=colour_list,cex=0.8)
}
}
```
**Manual inspection:** Re-define the variable "normdata" (i.e. TRUE or FALSE) showing whether the expression data is normalized or not.
```{r pheno_withoutQC, eval=T, warning=F, echo=F}
if (geo_GPL=="") {pheno_fn_withoutQC=paste0(resdir,"/",geo_id,"_Phenotype_withoutQC.txt"); pheno_fn_withQC=paste0(resdir,"/",geo_id,"_Phenotype_withQC.txt")} else
{pheno_fn_withoutQC=paste0(resdir,"/",geo_id,"_",geo_GPL,"_Phenotype_withoutQC.txt");pheno_fn_withQC=paste0(resdir,"/",geo_id,"_",geo_GPL,"_Phenotype_withQC.txt")}
write.table(pheno,pheno_fn_withoutQC,col.names=T,row.names=F,sep="\t",quote=F)
```
## Quality Control for Microarray Data
The major QC steps and scoring methods for outliers were adapted from [arrayQualityMetrics](https://bioconductor.org/packages/release/bioc/html/arrayQualityMetrics.html). The threshold to determine an outlier used in arrayQualityMetrics is the boxplot's upper whisker, i.e. values beyond 1.5 times the interquartile range, which is also applied to our pipeline. The following QC metrics are included in a routine analysis. The QC metrics used for outlier detection are marked with an asterisk.
* Boxplots and density plots for raw probe intensities*
* RNA degradation plots
* Density plots for perfect match (PM) and mismatch (MM) probe
* MA plots*
* Spatial plots*
* Boxplots for the normalized unscaled standard error (NUSE)*
* Boxplots for the relative log expression (RLE)*
* Heatmap and dendrogram for distance between arrays*
* Principal component analysis (PCA) plots
All the above steps can be processed in data from Affymetrix gene expression array. For data from other platforms, metrics for "raw"" proble intensities, MA plots, heatmap for array distance and PCAs can be processed.
Use the prepared phenotype file.
```{r pheno_readin, eval=T, echo=F}
pheno <- read.table(pheno_fn_withoutQC, header=T, sep="\t")
```
### Raw Probe Intensity Boxplots and Density Histograms
The log2-transformed/normalized intensity distributions of all samples (arrays) are expected to have the similar scale (i.e. the similar positions and widths of the boxes). Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ka) between log-intensity distribution for one array and the pooled array data, where an array with a Ka beyond the upper whisker is designated as an outlier.
```{r raw_intensity_utility, eval=T, echo=F, warning=F}
# The subsamp function randomly selects 20000 probes
subsamp <- function(x,num=20000, seed=123) {
set.seed(seed)
subsample=num # if number of probes are >20000, randomly select 20000 probes for plot or compute
if (nrow(x)>subsample) {
ss = sample(nrow(x), subsample)
Mss = x[ss,,drop=FALSE]
} else {
ss = TRUE
Mss = x
}
Mss
}
# The outlier_KS_func function computes KS statistics for outlier detection
outlier_KS_func = function(exprs) { # matrix (row: probe intensities/RLE values etc., col: array (e.g. sample))
fx = ecdf(as.vector(exprs)) # get empirical cumulative distribution function of the data
KS=suppressWarnings(apply(exprs, 2, function(v)ks.test(v, y = fx, alternative="two.sided")$statistic))
stats = stats::fivenum(KS, na.rm = TRUE) # Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum)
iqr = diff(stats[c(2, 4)]) # lagged difference between the lower-hinge and upper-hinge
coef = 1.5
th = (stats[4] + coef * iqr)
list(threshold = th, stats=KS, outlier = which(KS > th))
}
# The boxplot_func function generates boxplots for raw data metrics (e.g. probe intensities, RLE, NUSE)
boxplot_func <- function(Mss,outlier,ylab) {
# use * to mark the outliers in boxplot
array_name <- shortname_func(colnames(Mss))
outlier <- shortname_func(outlier)
array_name[array_name%in%outlier] <- paste0("*",outlier)
# boxplot raw intensity by array
ylim = quantile(Mss, probs = c(0.01, 0.99), na.rm=TRUE) # create range of y-axsis
# create data frame for plot
df <- data.frame(
sample_id=rep(colnames(Mss),each=nrow(Mss)),
values=as.numeric(Mss),
scandate=rep(pData(raw.data)[,varuse],each=nrow(Mss)) # for color
)
cols <- colour_list
ggplot(df, aes(sample_id,values,fill=scandate)) + geom_boxplot(outlier.colour=NA) +
coord_flip() + theme_bw() +
ylim(ylim) +
scale_x_discrete(labels=array_name) +
ylab(ylab) +
scale_fill_manual(varuse,values=cols) +
theme(axis.title.y=element_blank())
}
# The densplot_func function plots density curve for raw data metrics (e.g. probe intensity)
densplot_func <- function(Mss) {
# create data frame for plot
df <- data.frame(
sample_id=rep(colnames(Mss),each=nrow(Mss)),
values=as.numeric(Mss),
scandate=rep(pData(raw.data)[,varuse],each=nrow(Mss)) # for color
)
cols <- colour_list
ggplot(df,aes(x=values,colour=scandate)) + geom_line(aes(group=sample_id),stat="density") +
theme_bw() +
xlab("Raw Probe Intensities") +
ylab("Density") +
scale_color_manual(varuse,values=cols)
}
# The raw_intensity_func function outputs raw probe intensity metrics
raw_intensity_func <- function() {
if (!normdata) {Mss=log2(subsamp(exprs(raw.data))) # use log2 transformed raw probe intensity
} else {
Mss=subsamp(exprs(raw.data))
}
outlier_res=outlier_KS_func(Mss)
outlier=names(outlier_res$outlier)
boxplot=boxplot_func(Mss=Mss, outlier=outlier, ylab="Raw Probe Intensities")
densplot=densplot_func(Mss=Mss)
return(list(outlier=outlier,boxplot=boxplot,densplot=densplot))
}
```
1. Outlier detection for log2 raw probe intensity/normalized intensity
Compute the Kolmogorov-Smirnov statistic Ka between each array's (i.e. sample) values (i.e. log2 transformed raw probe intensity values) and the
pooled, overall distribution of the values.
```{r raw_intensity_outlier, eval=T, echo=F, warning=F}
res_intensity=raw_intensity_func()
outlier_intensity = res_intensity$outlier
cat(length(outlier_intensity), "outlier(s) are detected in the raw intensity metrics.\n")
if (length(outlier_intensity)>0) {cat("They are: ", shortname_func(outlier_intensity))}
```
2. Boxplots for log2 raw probe intensity
```{r raw_intensity_boxplot, eval=T, echo=F, message=F, warning=F, fig.height=8, fig.width=6}
res_intensity$boxplot
```
3. Density curves for log2 raw probe intensity
The intensity curves of all samples (arrays) are expected to have the similar shapes and ranges. Samples with deviated curves are likely to have problematic experiments. For example, high levels of background will shift an array's distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.
```{r raw_intensity_densplot, eval=T, echo=F, message=F, warning=F}
res_intensity$densplot
```
### RNA Digestion
Overall RNA quality can be assessed by RNA degradation plots. In the gene expression array, each probe is represented by a probe set. Each probe set is 11-20 probes (pairs of oligos). This plot shows the average intensity of each probe across all probe sets, ordered from the 5' to the 3' end. It is expected that probe intensities are lower at the 5' end of a probe set when compared to the 3' end as RNA degradation starts from the 5' end of a molecule. RNA which is too degraded will shows a very high slope from 5' to 3'. Thus, the standardized slope of the RNA degradation plot serves as quantitative indicator of the RNA degradation.
This step requires R package affy that outputs the probes in each probe set matrix ordered from 5' to 3', while this function is not implemented in the oligo package.
```{r RNAdegaffy_utility, eval=T, echo=F}
# The RNAdegaffy_func function compute mean PM intensities in each probe position following 5' to 3' order. Adopted from the AffyRNAdeg function in the affy package but include a step that randomly selects 20,000 probe sets.
RNAdegaffy_func <- function(data){ # input a list of probe set matrix with rows as probe ids and columns as samples
{
names <- colnames(data[[1]])
probe.set.size <- function(x) {
size <- dim(x)[1]
return(size)
}
max.num <- sapply(data, probe.set.size) # get the number of probes in each probe set
tab <- (table(max.num)) # summarize the frequencies of probe numbers in probe sets
ord <- order(-as.numeric(tab)) # order the frequency from large to small
K <- as.numeric(names(tab))[ord[1]] # K is the number of probes appearing in most probe sets
data <- data[max.num == K] # select data of probe sets only have K number of probes
}
subsample=20000
if (length(data)>subsample) { # randomly select 10000 probe sets
set.seed(12345)
ss = sample(length(data),subsample)
data = data[ss,drop=FALSE]
}
N <- length(data) # number of probe sets
n <- dim(data[[1]])[2] # number of samples
# create two matrices: number of samples * number of probes representing a probe set
mns <- matrix(nrow = n, ncol = K) # create matrix for mean values
sds <- mns # create matrix for sds values
get.row <- function(x, i = 1) {return(x[i, ])} # function to get each row (i.e. probe id, i) from one probe set x (i.e. probe list[[x]])
rowstack <- function(x, i = 1) {return(t(sapply(x, get.row, i)))} # function to combine the rows obtained using get.row (pms across samples by probe sets) to get a table (row: samples column: probe sets) and transpose the table (row: probe sets, column: samples)
for (i in 1:K) { # get probe id (position) from 1 to K from each probe set
data.stack <- rowstack(data, i) # get the probe pm values in a specific probe position across all samples from each probe set (rows are samples and columns are probe sets)
if(dim(data[[1]])[2]==1) data.stack <- t(data.stack)
mns[, i] <- colMeans(data.stack) # get the mean values at one probe position across all probe sets
sds[, i] <- apply(data.stack, 2, sd) # get the sd values at one probe position across all probe sets
}
mns.orig <- mns # store the original mns data matrix
mn <- mns[, 1] # select values in the first probe position
mns <- sweep(mns, 1, mn) # adjust for the intensity at the first probe position
mns <- mns/(sds/sqrt(N)) # adjust for standard error
lm.stats <- function(x) {
index <- 0:(length(x) - 1)
ans <- summary(lm(x ~ index))$coefficients[2, c(1, 4)] # use linear model fit the relationship between intensity and probe position
return(ans)
}
stats <- apply(mns, 1, lm.stats)
answer <- list(N, names, mns.orig, sds/sqrt(N), stats[1,], stats[2, ])
names(answer) <- c("N", "sample.names", "means.by.number","ses", "slope", "pvalue")
return(answer)
}
# The RNAdeg_func function generates RNA degradation plots
# return a logical variable whether this array type can be read by affy.
RNAdeg_func <- function() {
# 1. Read in raw data as an AffyBatch object
library(affy) # for Affymetrix microarray-specific QC analysis
raw.data.affy <- read.affybatch(rawall_func(),compress=T)
# 2. Obtain a list of probe sets with a matrix of oligos (probes) by samples as an input and compute statistics of the mean PM intensities from 5' to 3' probe positions.
PM_list <- affy::pm(raw.data.affy,LIST=T)
PM_list <- lapply(PM_list,log2)
raw.data.rnadeg <- RNAdegaffy_func(PM_list) # Compute mean PM intensity for probes following 5' to 3' order.
# 3. Plot 5' to 3' mean PM intensity
status.cols <- unlist(lapply(pData(raw.data)[,varuse],function(x)colour_list[x])) # colour list to corresponding scan date list
plotAffyRNAdeg(raw.data.rnadeg,cols=status.cols)
legend("topleft",legend=names(colour_list),fill=colour_list,cex=0.6)
detach("package:affy", unload=TRUE) # detach the affy package
}
```
```{r affy_exclude, eval=T, echo=F}
# arrays not in affy package
affy_exclude = c('hta.2.0', 'pd.hugene.2.0.st')
```
```{r RNAdeg_plot, eval=T, echo=F, message=F, warning=F}
if (platform=="Affymetrix"&suppldata) {
if (any(sapply(affy_exclude, function(y){grepl(y, annotation(raw.data))}))) {cat("The affy package is not designed for this array type.\n")} else {RNAdeg_func()}
}
```
### Distribution of Perfect Match (PM) and Mismatch (MM)
There are two paired probe types: perfect match (PM) and of mismatched (MM) probes. A PM probe matches a strand of cDNA, while the corresponding MM probe differs from the PM by a change in the central nucleotide. A probe set is called present if the intensity value of PM is significantly larger than MM. However, the Affymetrix approach is under attack because between 15%-30% of the MM are greater than the PM. For some newer arrays, MM probes are not used. If the number of PMs is not equal to that of MMs, this might be a PM-only array.
If both PM and MM are present, the density curves of log2 PM and MM intensities are generated, where MM probes are expected to have smaller log2-intensity at the peak than PM probes due to their nonspecific hybridization.
```{r PMMM_utility, eval=T, echo=F}
PMMM_func <- function() {
if (class(raw.data)=="GeneFeatureSet") {
message("GeneFeatureSet does not have MM probes. No plots will be generated.")
} else {
PM <- log2(pm(raw.data))
MM <- log2(mm(raw.data))
if(nrow(PM) != nrow(MM)) {
message("This might be a PM-only array. No plots will be generated.")
rm(PM,MM)
} else {
subsample=20000
if(nrow(PM)>subsample) { # randomly select 20000 probe sets
sel = sample(nrow(PM), subsample)
sPM = PM[sel, ]
sMM = MM[sel, ]
} else {
sPM = PM
sMM = MM
}
rm(PM,MM)
df <- data.frame(
values=c(as.numeric(sPM),as.numeric(sMM)),
types=c(rep("PM",each=length(as.numeric(sPM))),rep("MM",each=length(as.numeric(sMM))))
)
cols=colours[1:2]
ggplot(df,aes(x=values,colour=types)) + geom_line(aes(group=types),stat="density") +
theme_bw() +
xlab("Intensity") +
ylab("Density") +
scale_color_manual(values=cols) +
theme(legend.title=element_blank())
}
}
}
```
```{r PMMM_plot, eval=T, echo=F}
if (platform=="Affymetrix"&suppldata&GPL_ID!="GPL15207") {PMMM_func()}
```
### MA Plots
MA plots allow pairwise comparison of the log-intensity of each array to a reference array and identification of intensity-dependent biases.
The y-axis of the MA-plot shows the log-ratio intensity of one array to the reference median array, which is called M (minus). M = log2(I1)-log2(I2) (I1: the intensity of the array studied; I2: the median intensity across arrays)
The x-axis indicates the average log-intensity of both arrays, which is called A (add). A = 1/2\*(log2(I1)+log2(I2))
It is expected that the probe levels do not differ systematically within a group of replicates, so that the MA-plot is centered on the y-axis (y=0 or M=0) from low to high intensities.
```{r MA_utility, eval=T, echo=F}
# The MAcal_func function computes M and A matrices, while use the intensity of 20000 randomly selected probes
MAcal_func <- function(x) { # matrix (row: probe intensities, col: array (samples)
medArray = rowMedians(x, na.rm=TRUE)
M = x - medArray
A = (x + medArray)/2
subsample=20000
if(nrow(M)>subsample) {
set.seed(12345)
sel = sample(nrow(M), subsample)
sM = M[sel, ]
sA = A[sel, ]
} else {
sM = M
sA = A
}
list(M=sM,A=sA) # return a list with M and A data matrices
}
# The outlier_MA_func function computes the Hoeffding's statistic (Da) statistics for outlier detection
outlier_MA_func <- function(exprs) { # list with M and A data matrices
M=exprs$M
A=exprs$A
Dstats = sapply(1:ncol(M), function(x){hoeffd(A[,x], M[,x])$D[1,2]})
names(Dstats) <- colnames(M)
Dthresh = 0.15
list(threshold=Dthresh, stats=Dstats, outlier = which(Dstats > Dthresh))
}
# The MAplot_func function plots samples with the first 4 highest and lowest values of Da. The value of Da for each sample is shown in the panel headings. Outliers marked with * have Da values >0.15.
MAplot_func <- function(sMA, outlier_res) {
# select arrays with top 4 highest and lowest Da
stats_order <- order(outlier_res$stats)
column_sel <- stats_order[c(1:4,(ncol(sMA$M)-3):ncol(sMA$M))]
stats_sel <- round(outlier_res$stats[column_sel],2)
scandate_sel <- pData(raw.data)[,varuse][column_sel]
M_sel <- sMA$M[,column_sel]
A_sel <- sMA$A[,column_sel]
# use * to mark the outliers
array_name <- shortname_func(colnames(M_sel))
outlier_MA=shortname_func(names(outlier_res$outlier))
array_name[array_name%in%outlier_MA] <- paste0("*",array_name[array_name%in%outlier_MA])
array_name <- paste0(array_name," (D=",stats_sel,")") # add D statistics to corresponding samples
# create data frame for plot
df <- data.frame(
sample_id=rep(array_name,each=nrow(M_sel)),
scandate=rep(scandate_sel,each=nrow(M_sel)),
M=as.numeric(M_sel),
A=as.numeric(A_sel)
)
# MA plots
ggplot(df,aes(x=A,y=M,color=scandate)) + geom_point(alpha=0.1) + theme_bw() +
scale_color_manual(varuse,values=colour_list) +
facet_wrap(~sample_id,ncol=2)
}
# The MA_func function outputs MA metrics
MA_func <- function(){
# Compute M-A metrics
if (!normdata) {
sMA <- MAcal_func(log2(exprs(raw.data)))
} else {
sMA <- MAcal_func(exprs(raw.data))
}
outlier_res <- outlier_MA_func(exprs=sMA)
outlier <- names(outlier_res$outlier)
plot <- MAplot_func(sMA=sMA, outlier_res=outlier_res)
return(list(outlier=outlier, plot=plot))
}
```
1. Outlier detection for MA plots
Outlier detection is applied by computing a Hoeffding's statistic (Da) on the joint distribution of A and M for each array, where an array with a Da >0.15 is designated as an outlier.
```{r MA_outlier, eval=T, echo=F, message=F, warning=F}
res_MA=MA_func()
outlier_MA=res_MA$outlier
cat(length(outlier_MA), "outliers are detected in the MA metrics.\n")
if (length(outlier_MA)>0) {cat("They are: ", shortname_func(outlier_MA))}
```
2. MA plots
MA plots of the samples with the 4 highest and lowest Hoeffding's statistics.
```{r MA_plot, eval=T, echo=F, message=F, warning=F, fig.height=10, fig.width=7}
res_MA$plot
```
### Spatial Distribution
Spatial plots show an artificial colored image of an array's spatial distribution of intensities that indicate spatial variation in an array. Log-intensities of probes are plotted by their corresponding spatial x and y-coordinate in the array and are expected to be uniformly distributed if the array data has good quality. The rank scale is applied for plotting as it has the potential to amplify patterns that are small in amplitude but systematic within an array.
The affy package is required to obtain the AffyBatch object that contains information of spatial x- and y-coordinate, while this function is not implemented in the oligo package.
```{r spatial_utility, eval=T, echo=F}
# The affyspatial_func function computes spatial x- and y-coordinate using raw data object classed as AffyBatch
affyspatial_func <- function() {
library(affy)
raw.data.affy <- read.affybatch(rawall_func(),compress=T)
maxc = ncol(raw.data.affy) # number of probes in x-coordinate
maxr = nrow(raw.data.affy) # number of probes in y-coordinate
sx = rep(seq_len(maxc), each = maxr) ## spatial x-coordinate
sy = rep(seq_len(maxr), maxc) ## spatial y-coordinate
M = log2(affy::exprs(raw.data.affy))
detach("package:affy", unload=TRUE)
numArrays = dim(M)[2]
return(list(M=M,numArrays=numArrays,sx=sx,sy=sy))
}
# The outlier_spatial_func function computes the Fourier coefficients for outlier detection
outlier_spatial_func <- function(affy_spatial_list) {
sx=affy_spatial_list$sx # spatial x-coordinate
sy=affy_spatial_list$sy # spatial y-coordinate
M=affy_spatial_list$M
numArrays=affy_spatial_list$numArrays
maxx = max(sx, na.rm=TRUE)
maxy = max(sy, na.rm=TRUE)
stat_spatial = numeric(numArrays)
for(a in seq_len(numArrays)) {
mat = matrix(NA_real_, nrow=maxy, ncol=maxx)
mat[cbind(sy, sx) ] = M[, a]
pg = fft(mat) ## periodogram, computes the discrete fourier transform
npg = Re(pg*Conj(pg))
npg[1,1] = 0 ## drop the constant component
stat_spatial[a] = sqrt(sum(npg[1:4, 1:4]) / sum(npg)) # low frequency power
}
names(stat_spatial)=colnames(M)
stats = stats::fivenum(stat_spatial, na.rm = TRUE) # Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum)
iqr = diff(stats[c(2, 4)]) # lagged difference between the lower-hinge and upper-hinge
coef = 1.5
th = (stats[4] + coef * iqr)
list(threshold = th, stats=stat_spatial, outlier = which(stat_spatial > th))
}
# The spatplot_func function plots samples with the first 4 highest and lowest values of Fa. The value of Fa for each sample is shown in the panel headings.
spatplot_func <- function(raw.data.spatial,outlier_res) {
# select arrays with top 4 highest and lowest Da
stats_order <- order(outlier_res$stats)
column_sel <- stats_order[c(1:4,(ncol(raw.data.spatial$M)-3):ncol(raw.data.spatial$M))]
stats_sel <- round(outlier_res$stats[column_sel],2)
M_sel <- raw.data.spatial$M[,column_sel]
# apply rank to expression data
M_sel = apply(M_sel, 2, rank)
# use * to mark the outliers
array_name=shortname_func(colnames(M_sel))
outlier_spatial=shortname_func(names(outlier_res$outlier))
array_name[array_name%in%outlier_spatial] <- paste0("*",array_name[array_name%in%outlier_spatial])
array_name <- paste0(array_name," (F=",stats_sel,")") # add F statistics to corresponding samples
# create variables for plot
df <- data.frame(
sample_id=rep(array_name,each=nrow(M_sel)),
M=as.numeric(M_sel),
row=rep(raw.data.spatial$sy,ncol(M_sel)),
column=rep(raw.data.spatial$sx,ncol(M_sel))
)
# spatial distribution plots
ggplot(df,aes(x=row,y=column,fill=M)) + geom_tile() +
theme_bw() +
xlab("Raw Probe Intensiry in X") + ylab("Raw Probe Intensiry in Y") +
scale_fill_gradientn(name="Ranked Intensity",colours=viridis(256,option="B")) +
facet_wrap(~sample_id,ncol=2)
}
# The spatial_func function outputs spatial metrics
spatial_func <- function() {
raw.data.spatial=affyspatial_func()
outlier_res=outlier_spatial_func(affy_spatial_list=raw.data.spatial)
outlier=names(outlier_res$outlier)
plot=spatplot_func(raw.data.spatial=raw.data.spatial,outlier_res=outlier_res)
return(list(outlier=outlier,plot=plot))
}
```
1. Outlier detection for spatial plots
Outlier detection is applied by computing a sum of the absolute values of low frequency Fourier coefficients (Fa) across all probe sets for each array, where an array with a Fa beyond the upper whisker is designated as an outlier.
```{r spatial_outlier, eval=T, echo=F, message=F, warning=F}
if (platform=="Affymetrix"&suppldata) {
if (any(sapply(affy_exclude, function(y){grepl(y, annotation(raw.data))}))) {cat("The affy package is not designed for this array type.\n")} else {
res_spatial=spatial_func()
outlier_spatial=res_spatial$outlier
cat(length(outlier_spatial), "outlier(s) are detected in the spatial metrics.\n")
if (length(outlier_spatial)>0) {cat("They are: ", shortname_func(outlier_spatial))}
}
}
```
2. Spatial distribution plots
Spatial distribution plots of samples with the 4 highest and lowest values of Fa. The value of Fa for each sample is shown in the panel headings. Outliers marked with * have Fa values of large scale spatial structures.
```{r spatial_plot, eval=T, echo=F, fig.height=10, fig.width=7}
if (platform=="Affymetrix"&suppldata) {
if (any(sapply(affy_exclude, function(y){grepl(y, annotation(raw.data))}))) {cat("The affy package is not designed for this array type.\n")} else {
res_spatial$plot
}
}
```
### Relative Log Expression (RLE) Distribution
The normalized unscaled standard error (NUSE) and relative log expression (RLE) boxplots indicate probe set homogeneity in one array, where the metrics are derived from a fitted probe level model by the fitProbeLevelModel function (oligo). The RLE plots represent the distribution of the ratio between the log-intensity of a probe set and the median log-intensity of the corresponding probe set across all arrays, expected to be centered near 0, as a log scale is applied. Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ra) between RLE distribution for one array and the pooled array data, where an array with a Ra beyond the upper whisker is designated as an outlier
```{r RLE_utility, eval=T, echo=F, message=F, warning=F}
# The fitPLM_func function generates RLE and NUSE matrices
fitPLM_func <- function(raw.data) {
exprs(raw.data) <- log2(exprs(raw.data)) # The fitProbeLevelModel function needs ExpressionFeatureSet object as an input. Assign log transformed expression values to a new object
fitPLM <- fitProbeLevelModel(raw.data)
# RLE
M_RLE <- RLE(fitPLM, type="values") # generate RLE matrix
Mss_RLE <- subsamp(M_RLE,seed=1234) # use the subsamp function to reduce RLE data with randomly selected 20000 probes
# NUSE
M_NUSE <- NUSE(fitPLM, type="values") # generate NUSE matrix
Mss_NUSE <- subsamp(M_NUSE, seed=1234) # use the subsamp function to reduce RLE data with randomly selected 20000 probes
return(list(Mss_RLE=Mss_RLE,Mss_NUSE=Mss_NUSE))
}
# The RLE_func function outputs RLE metrics
RLE_func <- function(Mss) {
outlier_res=outlier_KS_func(Mss) # compute KS statistics to detect outliers
outlier=names(outlier_res$outlier)
plot=boxplot_func(Mss,outlier,"RLE")
return(list(outlier=outlier,plot=plot))
}
```
1. Outlier detection for RLE
Compute the Kolmogorov-Smirnov statistic Ra between each array's (i.e. sample) values (i.e. relative log expression values) and the pooled, overall distribution of the values. Detect outliers that are deviated from the threshold.
```{r RLE_outlier, eval=T, echo=F, message=F, warning=F}
if (platform=="Affymetrix"&suppldata&GPL_ID!="GPL15207") {
fitPLM=fitPLM_func(raw.data)
res_RLE=RLE_func(Mss=fitPLM$Mss_RLE)
outlier_RLE=res_RLE$outlier
cat(length(outlier_RLE), "outlier(s) are detected in RLE metrics.\n")
if (length(outlier_RLE)>0) {cat("They are: ", shortname_func(outlier_RLE))}
}
```
2. Boxplot for RLE
Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.
```{r RLE_plot, eval=T, echo=F, message=F, warning=F, fig.height=8, fig.width=6}
if (platform=="Affymetrix"&suppldata&GPL_ID!="GPL15207") {res_RLE$plot}
```
### Normalized Unscaled Standard Error (NUSE) Outlier Detection and Plots
The NUSE plots show the distribution of normalized standard error estimates, expected to be centered near 1. Outlier detection is applied by computing an upper hinge (Na) across all probe sets for each array, where an array with a Na beyond the upper whisker is designated as an outlier.
```{r NUSE_utility, eval=T, echo=F, message=F, warning=F}
# The function outlier_upperquartile_func function computes upper 75% quantile for outlier detection
outlier_upperquartile_func <- function(exprs) { # matrix (row: NUSE values, col: array (e.g. sample))
upperquartile = apply(exprs, 2, quantile, na.rm=TRUE, probs=0.75)
stats = stats::fivenum(upperquartile, na.rm = TRUE) # Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum)
iqr = diff(stats[c(2, 4)]) # lagged difference between the lower-hinge and upper-hinge
coef = 1.5
th = (stats[4] + coef * iqr)
list(threshold = th, outlier = which(upperquartile > th))
}
# The NUSE_func function outputs the NUSE metrics
NUSE_func <- function(Mss) {
outlier_res = outlier_upperquartile_func(Mss)
outlier=names(outlier_res$outlier)
boxplot=boxplot_func(Mss,outlier,"NUSE")
return(list(outlier=outlier,boxplot=boxplot))
}
```
1. Outlier detection for NUSE
Compute 75% quantile Na of each array's NUSE values Detect outliers that have larger Na deviated from the threshold.
```{r NUSE_outlier, eval=T, echo=F, message=F, warning=F, results="asis"}
if (platform=="Affymetrix"&suppldata&GPL_ID!="GPL15207") {