Title: | Basic Functions to Investigate Metabolomics Data Matrices |
---|---|
Description: | A set of functions to investigate raw data from (metabol)omics experiments intended to be used on a raw data matrix, i.e. following peak picking and signal deconvolution. Functions can be used to normalize data, detect biomarkers and perform sample classification. A detailed description of best practice usage may be found in the publication <doi:10.1007/978-1-4939-7819-9_20>. |
Authors: | Jan Lisec [aut, cre] |
Maintainer: | Jan Lisec <[email protected]> |
License: | GPL-3 |
Version: | 1.4.5 |
Built: | 2024-10-27 05:22:21 UTC |
Source: | https://github.com/janlisec/metabolomicsbasics |
AdjustSymbols
will generate plotting character and color vectors based
on experimental factors.#'
AdjustSymbols(cols = NULL, pchs = NULL, colorset = NULL, symbolset = NULL)
AdjustSymbols(cols = NULL, pchs = NULL, colorset = NULL, symbolset = NULL)
cols |
Factor (color output) or numeric (grey-scale output) vector or NULL (omitted). |
pchs |
Factor vector or NULL (omitted). |
colorset |
Color definitions for the factor levels of 'cols' (can be omitted to use default values). |
symbolset |
Plotting character definitions for the factor levels of 'pchs' (can be omitted to use default values). |
Using a fixed color and symbol scheme indicating samples from different groups throughout all figures of a analysis workflow is a reasonable decision. This function allows to specify both and attach it to a sample table for further use.
Either a vector (if one parameter of 'cols' and 'pchs' remains NULL), a data frame with columns 'cols' and 'pchs' (if both are provided and of equal length) or a list of length 2 (if both are provided and of different length). Will be used by several plotting functions of the package internally.
# return color vector x <- gl(6, 3) y <- as.numeric(x) plot(y, bg = AdjustSymbols(cols = x), pch = 21, cex = 2) plot(y, bg = AdjustSymbols(cols = y), pch = 21, cex = 2) plot(y, bg = AdjustSymbols(cols = x, colorset = 1:6), pch = 21, cex = 2) plot(y, pch = AdjustSymbols(pchs = x), cex = 2) plot(y, bg = 2, pch = AdjustSymbols(pchs = x, symbolset = 1:6), cex = 2) # load data and plot using provided color scheme raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam head(sam) plot(y = raw[, 1], x = as.numeric(sam$GT), pch = sam$pchs, bg = sam$cols) # change colors to greyscale head(AdjustSymbols(cols = sam$GT, pchs = sam$Origin)) tmp.set <- grDevices::rainbow(length(levels(sam$GT))) head(AdjustSymbols(cols = sam$GT, pchs = sam$Batch, colorset = tmp.set)) plot(raw[, 1] ~ sam$GT, col = unique_labels(sam = sam, g = "GT")[, "cols"]) sam$cols <- AdjustSymbols(cols = as.numeric(sam$GT)) plot(raw[, 1] ~ sam$GT, col = unique_labels(sam = sam, g = "GT")[, "cols"]) #'
# return color vector x <- gl(6, 3) y <- as.numeric(x) plot(y, bg = AdjustSymbols(cols = x), pch = 21, cex = 2) plot(y, bg = AdjustSymbols(cols = y), pch = 21, cex = 2) plot(y, bg = AdjustSymbols(cols = x, colorset = 1:6), pch = 21, cex = 2) plot(y, pch = AdjustSymbols(pchs = x), cex = 2) plot(y, bg = 2, pch = AdjustSymbols(pchs = x, symbolset = 1:6), cex = 2) # load data and plot using provided color scheme raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam head(sam) plot(y = raw[, 1], x = as.numeric(sam$GT), pch = sam$pchs, bg = sam$cols) # change colors to greyscale head(AdjustSymbols(cols = sam$GT, pchs = sam$Origin)) tmp.set <- grDevices::rainbow(length(levels(sam$GT))) head(AdjustSymbols(cols = sam$GT, pchs = sam$Batch, colorset = tmp.set)) plot(raw[, 1] ~ sam$GT, col = unique_labels(sam = sam, g = "GT")[, "cols"]) sam$cols <- AdjustSymbols(cols = as.numeric(sam$GT)) plot(raw[, 1] ~ sam$GT, col = unique_labels(sam = sam, g = "GT")[, "cols"]) #'
CheckForOutliers
will evaluate a numeric vector and check
if outliers within groups based on group .
CheckForOutliers( x = NULL, group = NULL, n_sd = 3, method = c("idx", "logical", "dist") )
CheckForOutliers( x = NULL, group = NULL, n_sd = 3, method = c("idx", "logical", "dist") )
x |
Numeric vector. |
group |
Factor vector of length(x). |
n_sd |
Cutoff for outliers in E being mean(group)+-n_sd*sd(group) where group values are calculated without the outlier candidate. |
method |
Different variants of the result value. See details. |
The numeric will be split by groups and each value will be evaluated
with respect to its distance to the group mean (calculated out of the other
values in the group). Distance here means the number of standard deviations
the value is off the group mean. With different choices of method
the output can be switched from the calculated fold-distances to a boolean
of length(x) or and Index vector giving the outliers directly (see examples).
Depending on the selected method. See details.
set.seed(0) x <- runif(10) x[1] <- 2 group <- gl(2, 5) CheckForOutliers(x, group, method = "dist") CheckForOutliers(x, group, method = "logical") CheckForOutliers(x, group, method = "idx") graphics::par(mfrow = c(1, 2)) bg <- c(3, 2)[1 + CheckForOutliers(x, group, method = "logical")] graphics::plot(x = as.numeric(group), y = x, pch = 21, cex = 3, bg = bg, main = "n_sd=3", las = 1, xlim = c(0.5, 2.5)) bg <- c(3, 2)[1 + CheckForOutliers(x, group, n_sd = 4, method = "logical")] graphics::plot(x = as.numeric(group), y = x, pch = 21, cex = 3, bg = bg, main = "n_sd=4", las = 1, xlim = c(0.5, 2.5)) graphics::par(mfrow = c(1, 1)) # load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # no missing data in this matrix all(is.finite(raw)) # check for outliers (computing n-fold sd distance from group mean) tmp <- apply(raw, 2, CheckForOutliers, group = sam$GT, method = "dist") # plot a histogram of the observed distances graphics::hist(tmp, breaks = seq(0, ceiling(max(tmp))), main = "n*SD from mean", xlab = "n") # Calculate the amount of values exceeding five-sigma and compare with a standard gaussian table(tmp > 5) round(100 * sum(tmp > 5) / length(tmp), 2) gauss <- CheckForOutliers(x = rnorm(prod(dim(raw))), method = "dist") sapply(1:5, function(i) { data.frame("obs" = sum(tmp > i), "gauss" = sum(gauss > i)) }) # compare a PCA w/wo outliers RestrictedPCA( dat = raw, sam = sam, use.sam = sam$GT %in% c("Mo17", "B73"), group.col = "GT", fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch", medsd = TRUE ) raw_filt <- raw raw_filt[tmp > 3] <- NA RestrictedPCA( dat = raw_filt, sam = sam, use.sam = sam$GT %in% c("Mo17", "B73"), group.col = "GT", fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch", medsd = TRUE )
set.seed(0) x <- runif(10) x[1] <- 2 group <- gl(2, 5) CheckForOutliers(x, group, method = "dist") CheckForOutliers(x, group, method = "logical") CheckForOutliers(x, group, method = "idx") graphics::par(mfrow = c(1, 2)) bg <- c(3, 2)[1 + CheckForOutliers(x, group, method = "logical")] graphics::plot(x = as.numeric(group), y = x, pch = 21, cex = 3, bg = bg, main = "n_sd=3", las = 1, xlim = c(0.5, 2.5)) bg <- c(3, 2)[1 + CheckForOutliers(x, group, n_sd = 4, method = "logical")] graphics::plot(x = as.numeric(group), y = x, pch = 21, cex = 3, bg = bg, main = "n_sd=4", las = 1, xlim = c(0.5, 2.5)) graphics::par(mfrow = c(1, 1)) # load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # no missing data in this matrix all(is.finite(raw)) # check for outliers (computing n-fold sd distance from group mean) tmp <- apply(raw, 2, CheckForOutliers, group = sam$GT, method = "dist") # plot a histogram of the observed distances graphics::hist(tmp, breaks = seq(0, ceiling(max(tmp))), main = "n*SD from mean", xlab = "n") # Calculate the amount of values exceeding five-sigma and compare with a standard gaussian table(tmp > 5) round(100 * sum(tmp > 5) / length(tmp), 2) gauss <- CheckForOutliers(x = rnorm(prod(dim(raw))), method = "dist") sapply(1:5, function(i) { data.frame("obs" = sum(tmp > i), "gauss" = sum(gauss > i)) }) # compare a PCA w/wo outliers RestrictedPCA( dat = raw, sam = sam, use.sam = sam$GT %in% c("Mo17", "B73"), group.col = "GT", fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch", medsd = TRUE ) raw_filt <- raw raw_filt[tmp > 3] <- NA RestrictedPCA( dat = raw_filt, sam = sam, use.sam = sam$GT %in% c("Mo17", "B73"), group.col = "GT", fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch", medsd = TRUE )
ClassificationCV
will perform a classification using SVM's
and/or Decision Trees including cross validation on a data set according to
a provided grouping vector.
ClassificationCV( d = NULL, g = NULL, n = 1, k = 1, rand = F, method = c("svm", "C50", "rpart", "ropls"), method.control = list(), silent = FALSE )
ClassificationCV( d = NULL, g = NULL, n = 1, k = 1, rand = F, method = c("svm", "C50", "rpart", "ropls"), method.control = list(), silent = FALSE )
d |
Data matrix or data.frame with named rows (samples) and columns (traits). |
g |
Group-vector, factor. |
n |
Replicates of classifications. |
k |
Number of folds per replicate. |
rand |
Randomize Group-vector (and apply according n and k to this randomization). |
method |
Currently |
method.control |
A list of parameters, forwarded to the respective classification function. |
silent |
Logical. Set TRUE to suppress progress bar and warnings. |
This function allows to demonstrate the functionality of different
classification tools with respect to building classifiers for metabolomics data.
Check the examples in ClassificationWrapper
for automatic
multi-fold analysis.
A list of classification results which can be analyzed for accuracy, miss-classified samples and more.
ClassificationHistogram
will plot the results of ClassificationWrapper.
ClassificationHistogram(out_classific = NULL, breaks = seq(0, 1, 0.05), ...)
ClassificationHistogram(out_classific = NULL, breaks = seq(0, 1, 0.05), ...)
out_classific |
Output of ClassificationWrapper. |
breaks |
Breaks for histogram. |
... |
Passed on to |
No further details.
Returns NULL invisibly.
ClassificationWrapper
will do classification using SVM's and/or
Decision Trees including cross validation.
ClassificationWrapper( d = NULL, g = NULL, n = 100, n_rand = 1, k = 5, method = c("C50", "svm", "rpart", "ropls"), train = NULL, method.control = list(), silent = FALSE )
ClassificationWrapper( d = NULL, g = NULL, n = 100, n_rand = 1, k = 5, method = c("C50", "svm", "rpart", "ropls"), train = NULL, method.control = list(), silent = FALSE )
d |
data, matrix or data.frame !! needs row/col-names. |
g |
Group-vector, factor. |
n |
replicates of classifications, i.e. number of different split into folds. |
n_rand |
different number of randomizations, see Details. |
k |
Fold cross validation. |
method |
Currently |
train |
Either NULL (random permutations) or an index vector for a training subset out of |
method.control |
A list of parameters, forwarded to the selected methods function. |
silent |
Logical. Set TRUE to suppress progress bar and warnings. |
Parameter 'n_rand' will influence how permutation testing for robustness is conducted. If n_rand=1 than samples will be permuted exactly one time and subjected to n replications (with respect to fold splitting). If n_rand>1, samples will be permuted this many times but number of replications will be lowered to limit processing time. A good compromise is to balance both, using less replications than for observed data but on several randomizations.
#' Classification results as list.
raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam gr <- sam$Origin # establish a basic rpart model and render a fancy plot including the accuracy class_res <- ClassificationWrapper(d = raw, g = gr, method = c("rpart", "svm"), n = 3, k = 3) ClassificationHistogram(class_res)
raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam gr <- sam$Origin # establish a basic rpart model and render a fancy plot including the accuracy class_res <- ClassificationWrapper(d = raw, g = gr, method = c("rpart", "svm"), n = 3, k = 3) ClassificationHistogram(class_res)
find_boundaries
will determine peak boundaries within a BPC or mass trace.
find_boundaries( int = NULL, rt = NULL, p = which.max(int), k = 3, bl = min(int), local_min = int[p] )
find_boundaries( int = NULL, rt = NULL, p = which.max(int), k = 3, bl = min(int), local_min = int[p] )
int |
The measured intensity of the ion mass (obviously ordered according to consecutive RTs). |
rt |
The respective retention times (can be omitted as currently not used). |
p |
The anticipated peak position (as index of int) if several peaks are within the mass trace. |
k |
The smoothing window parameter (provided to runmed). |
bl |
The baseline value. Can be provided explicitly if automatic determination is insufficient. |
local_min |
This is practically the upper end of the baseline. It can be set to avoid boundary detection at local minima (e.g. for peaks suffering ion suppression). |
It is yet another peak finder or, more precisely, it is a function to identify two RT values which flank a intensity maximum which is required if one would like to integrate the peak area.
Numeric vector of length=2 specifying the start and end index of the peak.
int <- sin(seq(-0.75 * pi, 1.75 * pi, by = 0.1)) plot(int) abline(v = find_boundaries(int = int))
int <- sin(seq(-0.75 * pi, 1.75 * pi, by = 0.1)) plot(int) abline(v = find_boundaries(int = int))
MBoxplot
will generate an annotated boxplot. A unifying
function for MS-data Boxplots based on \'raw\' and \'sam\'.
MBoxplot( pk = pk, raw = NULL, sam = NULL, met = NULL, g = NULL, flt = NULL, an = NULL, plot_sample_n = FALSE, txt = NULL, cex.txt = 0.5, plot_rel_axis = NULL, ... )
MBoxplot( pk = pk, raw = NULL, sam = NULL, met = NULL, g = NULL, flt = NULL, an = NULL, plot_sample_n = FALSE, txt = NULL, cex.txt = 0.5, plot_rel_axis = NULL, ... )
pk |
Colname of raw to plot if |
raw |
Plotting data as samples (rows) x metabolites (cols). |
sam |
Sample table. |
met |
Containing at minimum columns for annotation (see parameter |
g |
Grouping vector if |
flt |
Filter to exclude certain samples (T/F) vector. |
an |
Switch to include annotation (from met) in the boxplot providing a character vector of colnames from |
plot_sample_n |
Amend each box with the number of finite values which were a basis for plotting this group. |
txt |
Character vector with information per sample to be plotted on top of the box as text. |
cex.txt |
Specify size of annotation text. |
plot_rel_axis |
Specify one level of |
... |
Further options parsed to |
Usually metabolomics experiments are conducted on multiple replicates
of a sample group. Boxplots allow to quickly access potential differences
between measurement values of several groups. MBoxplot
can be nicely
used to generate QC plots for all metabolites prior and after normalization,
in absolute or relative scale and sorted according to significance.
Nothing. Will produce a plot (or file if specified).
x <- data.frame("y" = runif(36), "GT" = gl(3, 12), "TP" = factor(rep(rep(1:3, each = 4), 3))) x <- cbind(x, AdjustSymbols(cols = x$GT, pchs = x$TP)) MBoxplot( pk = "y", raw = x, sam = x, met = data.frame("Peak" = "y", "Test" = I("info")), g = interaction(x$GT, x$TP), an = "Test", plot_n_samples = TRUE, txt = rownames(x) )
x <- data.frame("y" = runif(36), "GT" = gl(3, 12), "TP" = factor(rep(rep(1:3, each = 4), 3))) x <- cbind(x, AdjustSymbols(cols = x$GT, pchs = x$TP)) MBoxplot( pk = "y", raw = x, sam = x, met = data.frame("Peak" = "y", "Test" = I("info")), g = interaction(x$GT, x$TP), an = "Test", plot_n_samples = TRUE, txt = rownames(x) )
This data frame contains the metabolite definition of 112 metabolites according to the cols of raw.
met
met
An object of class data.frame
with 112 rows and 2 columns.
Jan Lisec [email protected]
<https://onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2011.04689.x>
MetaboliteANOVA
will perform an ANOVA on columns
of a data matrix according to a specified model.
MetaboliteANOVA( dat = NULL, sam = NULL, model = NULL, method = "none", silent = FALSE )
MetaboliteANOVA( dat = NULL, sam = NULL, model = NULL, method = "none", silent = FALSE )
dat |
Data matrix (e.g. of metabolite). |
sam |
Sample table (same number of row as 'dat' and containing all columns specified in 'model'. |
model |
ANOVA model. May include +, * and : together with column names of sam (cf. Examples). |
method |
The method to be used in column wise multiple testing adjustment, see |
silent |
Logical. Shall the function print warnings to the console? |
The function is a wrapper for lm
including some sanity checks.
It will accept a data matrix (traits in columns), sample information (data.frame)
and a potential model as input, compute an ANOVA per column and return
the respective P-values in a named matrix for further plotting or export.
A named matrix of P-values (rows=metabolites/traits; cols=ANOVA factors).
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # compute P-values according to specified ANOVA model (simple and complex) head(m1 <- MetaboliteANOVA(dat = raw, sam = sam, model = "GT")) head(m2 <- MetaboliteANOVA(dat = raw, sam = sam, model = "GT+Batch+Order+MP")) # compare P-values for one factor determined in both models hist(log10(m2[, "GT"]) - log10(m1[, "GT"]), main = "")
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # compute P-values according to specified ANOVA model (simple and complex) head(m1 <- MetaboliteANOVA(dat = raw, sam = sam, model = "GT")) head(m2 <- MetaboliteANOVA(dat = raw, sam = sam, model = "GT+Batch+Order+MP")) # compare P-values for one factor determined in both models hist(log10(m2[, "GT"]) - log10(m1[, "GT"]), main = "")
msconvert
is calling ProteoWizards MSConvert as a command line tool on Windows.
msconvert( files = NULL, msc_exe = "C:\\Program Files\\ProteoWizard\\ProteoWizard 3.0.11856\\msconvert.exe", args = c("--filter \"peakPicking cwt snr=0.01 peakSpace=0.1 msLevel=1\"", "--filter \"scanTime [0,3600]\"", "--filter \"metadataFixer\"", "--mzML", "--32", "--zlib") )
msconvert( files = NULL, msc_exe = "C:\\Program Files\\ProteoWizard\\ProteoWizard 3.0.11856\\msconvert.exe", args = c("--filter \"peakPicking cwt snr=0.01 peakSpace=0.1 msLevel=1\"", "--filter \"scanTime [0,3600]\"", "--filter \"metadataFixer\"", "--mzML", "--32", "--zlib") )
files |
A character vector of MS data files (wiff, raw, d, ...). |
msc_exe |
The path to the installed 'msconvert.exe'. |
args |
The arguments passed to msconvert on the command line (see details for documentation). |
It is a quick and dodgy function to show how to convert vendor MS data
into an open format (mzML). You will have to download/install MSConvert prior
to usage, and probably adjust the arguments according to your needs. Arguments
are documented here https://proteowizard.sourceforge.io/tools/msconvert.html.
If you don't know where the msconvert.exe is installed you can check for the correct
path using list.files(path="C:/", pattern="^msconvert.exe$", recursive = TRUE)
.
Only some informative outputs printed to the console. The specified MS data files will be converted to mzML within the same folder.
PlotMetabolitePCA
will show PC1 and PC2 of a pcaMethods
object and generate a flexible plot.
PlotMetabolitePCA( pca_res = NULL, sam = NULL, g = NULL, medsd = FALSE, text.col = "ID", legend.x = "bottomleft", comm = NULL )
PlotMetabolitePCA( pca_res = NULL, sam = NULL, g = NULL, medsd = FALSE, text.col = "ID", legend.x = "bottomleft", comm = NULL )
pca_res |
A pcaRes object from the pcaMethods package. |
sam |
Sample table including columns 'cols', 'pchs' (for data point color and shape) and 'ID' (to label data points) 'Group' (to split cols for legend) 'MP' (to adjust point size). |
g |
Can be a factor vector of length=nrow(sam) and will influence legend and 'medsd'. |
medsd |
Calculate mean and sd for groups and overlay PCA plot with this information. |
text.col |
Data points may be overlaid by textual information, e.g. sample ID and 'text.col' specifies the column name of 'sam' to use for this purpose. |
legend.x |
Position of a legend or NULL to omit it. |
comm |
Will print commentary text to the bottom right of the plot (can be a character vector). |
See examples.
A vector fo similar length as input but with various name components removed.
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # calculate pca Result using pcaMethods and plot pca_res <- pcaMethods::pca(raw, method = "rnipals", scale = c("none", "pareto", "uv")[2]) PlotMetabolitePCA(pca_res = pca_res, sam = sam, g = sam$GT) # plot without legend and Group means instead PlotMetabolitePCA( pca_res = pca_res, sam = sam, g = sam$GT, legend.x = NULL, text.col = NULL, medsd = TRUE, comm = LETTERS[1:4] ) # readjust symbols before plotting sam$Group <- interaction(sam$Origin, sam$Class, sep = "_") sam[, c("cols", "pchs")] <- AdjustSymbols(cols = sam$Group, pchs = sam$Group) PlotMetabolitePCA(pca_res = pca_res, sam = sam, g = sam$Group)
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # calculate pca Result using pcaMethods and plot pca_res <- pcaMethods::pca(raw, method = "rnipals", scale = c("none", "pareto", "uv")[2]) PlotMetabolitePCA(pca_res = pca_res, sam = sam, g = sam$GT) # plot without legend and Group means instead PlotMetabolitePCA( pca_res = pca_res, sam = sam, g = sam$GT, legend.x = NULL, text.col = NULL, medsd = TRUE, comm = LETTERS[1:4] ) # readjust symbols before plotting sam$Group <- interaction(sam$Origin, sam$Class, sep = "_") sam[, c("cols", "pchs")] <- AdjustSymbols(cols = sam$Group, pchs = sam$Group) PlotMetabolitePCA(pca_res = pca_res, sam = sam, g = sam$Group)
PlotPValueHist
will take a named matrix of P-values
(i.e. numeric between 0..1) and plot histograms for each column. In
the easiest case this matrix is generated by MetaboliteANOVA.
PlotPValueHist( out = NULL, method = "BH", xl = "ANOVA P-values", yl = "Number of metabolites", frac.col = NULL, ... )
PlotPValueHist( out = NULL, method = "BH", xl = "ANOVA P-values", yl = "Number of metabolites", frac.col = NULL, ... )
out |
matrix/data.frame; P-value table from 'MetaboliteANOVA.R' with factors in named columns and trait P-values in rows. |
method |
Multiple testing correction method applied, piped to |
xl |
xlab. |
yl |
ylab. |
frac.col |
Render histogram bars in stacked colors according to provided color vector (should be a vector of valid color names of length=nrow(out)). |
... |
Passed on to |
See examples.
NULL. Will generate a P-value histogram plot.
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # compute P-values according to specified ANOVA model (simple and complex) head(pvals <- MetaboliteANOVA(dat = raw, sam = sam, model = "GT+Batch+Order")) PlotPValueHist(out = pvals) # adjust multiple testing correction method and y lable PlotPValueHist(out = pvals, method = "none", yl = "Number of Genes") # color bars (by chance or according to a metabolite group) PlotPValueHist(out = pvals, frac.col = rep(2:3, length.out = nrow(pvals))) met <- MetabolomicsBasics::met met$Name[grep("ine$", met$Name)] PlotPValueHist(out = pvals, frac.col = 2 + 1:nrow(pvals) %in% grep("ine$", met$Name))
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # compute P-values according to specified ANOVA model (simple and complex) head(pvals <- MetaboliteANOVA(dat = raw, sam = sam, model = "GT+Batch+Order")) PlotPValueHist(out = pvals) # adjust multiple testing correction method and y lable PlotPValueHist(out = pvals, method = "none", yl = "Number of Genes") # color bars (by chance or according to a metabolite group) PlotPValueHist(out = pvals, frac.col = rep(2:3, length.out = nrow(pvals))) met <- MetabolomicsBasics::met met$Name[grep("ine$", met$Name)] PlotPValueHist(out = pvals, frac.col = 2 + 1:nrow(pvals) %in% grep("ine$", met$Name))
PolarCoordHeterPlot
will draw a plot in polar coordinates
visualizing heterosis effects according to a layout by Swanson-Wagner,
where plot radius represents log2 of fold change between lowest and highest
genotype and plot angle represents the ratio between lowest, intermediate
and highest genotype.
PolarCoordHeterPlot( x, gt = c("P1", "P1xP2", "P2"), rev_log = NULL, exp_fac = 1, thr = 1, plot_lab = c("none", "text", "graph"), col = NULL )
PolarCoordHeterPlot( x, gt = c("P1", "P1xP2", "P2"), rev_log = NULL, exp_fac = 1, thr = 1, plot_lab = c("none", "text", "graph"), col = NULL )
x |
Data matrix with measurement values (traits in rows and genotypes in columns). |
gt |
Character vector of length=3 indicating P1, F and P2. These are used to filter by column name from x. |
rev_log |
If you've log transformed your data, you might want to revert the log transformation. |
exp_fac |
Expansion factor to increase figure size. |
thr |
Alpha level used in ANOVA to filter insignificant rows. Keep thr=1 to include all matrix rows. |
plot_lab |
Show 'text' or 'graph' style labels of the polar sections (or keep 'none' to omit). |
col |
Provide a color vector of length nrow(x). |
See examples.
Will generate a plot in polar coordinates and return the x/y coordinates of the data points invisibly.
# using the provided experimental data raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam x <- t(raw) colnames(x) <- sam$GT gt <- c("B73","B73xMo17","Mo17") PolarCoordHeterPlot(x=x, gt=gt, plot_lab="graph", thr=0.01, rev_log=exp(1)) coord <- PolarCoordHeterPlot(x=x, gt=gt, thr=0.01, rev_log=exp(1)) points(x=coord$x[3], coord$y[3], pch=22, cex=4, col=2) # using random data gt <- c("P1","P1xP2","P2") set.seed(0) x <- matrix(rnorm(150), nrow = 10, dimnames = list(paste0("M",1:10), sample(rep(gt, 5)))) x[1:4,1:6] PolarCoordHeterPlot(x=x, gt=gt) # using text style labels for the sections PolarCoordHeterPlot(x=x, gt=gt, plot_lab="text", exp_fac=0.75) # reverting the order of parental genotypes PolarCoordHeterPlot(x=x, gt=c("P2","P1xP2","P1"), plot_lab="text", exp_fac=0.75) # using graph style labels for the sections PolarCoordHeterPlot(x=x, gt=c("P2","P1xP2","P1"), plot_lab="graph") # coloring data points PolarCoordHeterPlot(x=x, gt=gt, col=1:10) # applying ANOVA P value threshold to input rows PolarCoordHeterPlot(x=x, gt=gt, col=1:10, thr=0.5) PolarCoordHeterPlot(x=x, gt=gt, plot_lab="graph", col=1:10, thr=0.5)
# using the provided experimental data raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam x <- t(raw) colnames(x) <- sam$GT gt <- c("B73","B73xMo17","Mo17") PolarCoordHeterPlot(x=x, gt=gt, plot_lab="graph", thr=0.01, rev_log=exp(1)) coord <- PolarCoordHeterPlot(x=x, gt=gt, thr=0.01, rev_log=exp(1)) points(x=coord$x[3], coord$y[3], pch=22, cex=4, col=2) # using random data gt <- c("P1","P1xP2","P2") set.seed(0) x <- matrix(rnorm(150), nrow = 10, dimnames = list(paste0("M",1:10), sample(rep(gt, 5)))) x[1:4,1:6] PolarCoordHeterPlot(x=x, gt=gt) # using text style labels for the sections PolarCoordHeterPlot(x=x, gt=gt, plot_lab="text", exp_fac=0.75) # reverting the order of parental genotypes PolarCoordHeterPlot(x=x, gt=c("P2","P1xP2","P1"), plot_lab="text", exp_fac=0.75) # using graph style labels for the sections PolarCoordHeterPlot(x=x, gt=c("P2","P1xP2","P1"), plot_lab="graph") # coloring data points PolarCoordHeterPlot(x=x, gt=gt, col=1:10) # applying ANOVA P value threshold to input rows PolarCoordHeterPlot(x=x, gt=gt, col=1:10, thr=0.5) PolarCoordHeterPlot(x=x, gt=gt, plot_lab="graph", col=1:10, thr=0.5)
This data set contains a matrix of log10-transformed ion intensities from a maize root metabolomics study for in total 112 metabolites in 120 samples.
raw
raw
An object of class matrix
(inherits from array
) with 120 rows and 112 columns.
Jan Lisec [email protected]
<https://onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2011.04689.x>
RemoveFactorsByANOVA
will remove variance from data using an ANOVA model.
RemoveFactorsByANOVA( y = NULL, sam = NULL, fmod = NULL, kmod = NULL, output = c("y_norm", "y_lm", "anova_y", "anova_y_norm", "boxplot")[1], remove_outliers = 0 )
RemoveFactorsByANOVA( y = NULL, sam = NULL, fmod = NULL, kmod = NULL, output = c("y_norm", "y_lm", "anova_y", "anova_y_norm", "boxplot")[1], remove_outliers = 0 )
y |
Data vector (or data matrix) to normalize (numeric + in same order as sam). |
sam |
data.frame containing the factors or numerical vars for ANOVA model. |
fmod |
Full model describing the experimental setting (provided as character string). |
kmod |
Reduced model describing all the biological factors to keep (provided as character string). |
output |
Should be |
remove_outliers |
Should be a numeric integer x (with $x=0$ : no effect; $x>=1$ remove all values which have error e with $e > abs(mean + x * sd)$ ). |
See examples.
Depends on output
. Usually the normalized data vector (or matrix).
# set up sample information sam <- data.frame( "GT" = gl(4, 10), "TR" = rep(gl(2, 5), 4), "Batch" = sample(gl(2, 20)), "Order" = sample(seq(-1, 1, length.out = 40)) ) # set up artificial measurement data set.seed(1) # specify main effects m1 <- c(5, 6, 2, 9)[sam$GT] + c(-2, 2)[sam$TR] m2 <- c(5, -6, 2, 4)[sam$GT] + c(-2, 2)[sam$TR] # add run order bias and noise m1 <- m1 + c(-3, 3)[sam$Batch] + 3 * sam$Order + rnorm(nrow(sam), sd = 0.5) m2 <- m2 - 5 * sam$Order + rnorm(nrow(sam), sd = 0.8) dat <- data.frame(m1, m2) # apply function to remove variance # full model incorporating all relevant factors defined in sample table fmod <- "GT*TR+Batch+Order" # reduced model: factors to be kept from full model; everything elso will be removed from the data kmod <- "GT*TR" RemoveFactorsByANOVA(y = dat[, "m1"], sam = sam, fmod = fmod, kmod = kmod, output = "anova_y") RemoveFactorsByANOVA(y = dat[, "m1"], sam = sam, fmod = fmod, kmod = kmod, output = "anova_y_norm")
# set up sample information sam <- data.frame( "GT" = gl(4, 10), "TR" = rep(gl(2, 5), 4), "Batch" = sample(gl(2, 20)), "Order" = sample(seq(-1, 1, length.out = 40)) ) # set up artificial measurement data set.seed(1) # specify main effects m1 <- c(5, 6, 2, 9)[sam$GT] + c(-2, 2)[sam$TR] m2 <- c(5, -6, 2, 4)[sam$GT] + c(-2, 2)[sam$TR] # add run order bias and noise m1 <- m1 + c(-3, 3)[sam$Batch] + 3 * sam$Order + rnorm(nrow(sam), sd = 0.5) m2 <- m2 - 5 * sam$Order + rnorm(nrow(sam), sd = 0.8) dat <- data.frame(m1, m2) # apply function to remove variance # full model incorporating all relevant factors defined in sample table fmod <- "GT*TR+Batch+Order" # reduced model: factors to be kept from full model; everything elso will be removed from the data kmod <- "GT*TR" RemoveFactorsByANOVA(y = dat[, "m1"], sam = sam, fmod = fmod, kmod = kmod, output = "anova_y") RemoveFactorsByANOVA(y = dat[, "m1"], sam = sam, fmod = fmod, kmod = kmod, output = "anova_y_norm")
ReplaceMissingValues
will replace missing values within
a numeric matrix based on a principal component analysis.
ReplaceMissingValues(x, ncomp = 10, silent = FALSE)
ReplaceMissingValues(x, ncomp = 10, silent = FALSE)
x |
Numeric matrix. |
ncomp |
Number of components to be used. |
silent |
FALSE, suppress messages setting silent=TRUE. |
The 'nipals' algorithm is used to basically perform a PCA on the sparse matrix. Missing values are imputed based on the major components observed. Please check also the 'impute.nipals' function from mixOmics which should basically give the same functionality since the 04/2021 update.
A matrix of similar dimensions as x without missing values.
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam idx <- apply(raw, 2, CheckForOutliers, group = sam$GT, n_sd = 5, method = "logical") sum(idx) # 215 values would be classified as outlier using a five-sigma band old_vals <- raw[idx] # keep outlier values for comparison raw_filt <- raw raw_filt[idx] <- NA raw_means <- apply(raw, 2, function(x) { sapply(split(x, sam$GT), mean, na.rm = TRUE)[as.numeric(sam$GT)] })[idx] raw_repl <- ReplaceMissingValues(x = raw_filt) new_vals <- raw_repl[idx] par(mfrow = c(2, 1)) breaks <- seq(-0.7, 1.3, 0.05) hist(raw_means - old_vals, breaks = breaks, main = "", xlab = "Outliers", las = 1) hist(raw_means - new_vals, breaks = breaks, main = "", xlab = "Replaced values", las = 1)
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam idx <- apply(raw, 2, CheckForOutliers, group = sam$GT, n_sd = 5, method = "logical") sum(idx) # 215 values would be classified as outlier using a five-sigma band old_vals <- raw[idx] # keep outlier values for comparison raw_filt <- raw raw_filt[idx] <- NA raw_means <- apply(raw, 2, function(x) { sapply(split(x, sam$GT), mean, na.rm = TRUE)[as.numeric(sam$GT)] })[idx] raw_repl <- ReplaceMissingValues(x = raw_filt) new_vals <- raw_repl[idx] par(mfrow = c(2, 1)) breaks <- seq(-0.7, 1.3, 0.05) hist(raw_means - old_vals, breaks = breaks, main = "", xlab = "Outliers", las = 1) hist(raw_means - new_vals, breaks = breaks, main = "", xlab = "Replaced values", las = 1)
RestrictedPCA
combines an ANOVA based on 'fmod' and
restricts a PCA using the ANOVA result as a filter.
RestrictedPCA( dat = NULL, sam = NULL, use.sam = NULL, group.col = NULL, text.col = NULL, fmod = NULL, sign.col = NULL, p.adjust.method = "none", P = 0.01, pcaMethods.scale = "pareto", n.metab.min = 20, ... )
RestrictedPCA( dat = NULL, sam = NULL, use.sam = NULL, group.col = NULL, text.col = NULL, fmod = NULL, sign.col = NULL, p.adjust.method = "none", P = 0.01, pcaMethods.scale = "pareto", n.metab.min = 20, ... )
dat |
Metabolite matrix (samples x metabolites). |
sam |
Sample definition dataframe. |
use.sam |
Numeric index vector (or logical) to select specific samples to be included in the analysis or NULL to include all. |
group.col |
Column used for legend creation (column name from sam). |
text.col |
Column used for text annotation of data points (column name from sam). |
fmod |
ANOVA model to calculate before PCA. |
sign.col |
Which column(s) of the ANOVA result shall be used for P-value filtering (specify column names or leave on NULL to filter on all). |
p.adjust.method |
Method use to adjust P-values (e.g. none, BH or bonferroni). |
P |
P-value threshold used as a cutoff after P-value adjustment. |
pcaMethods.scale |
pcaMethods scale parameter (usually pareto for metabolite data). |
n.metab.min |
Minimum number of metabolites kept for PCA calculation (even if they exceed P). |
... |
Handed through to |
‘fmod' should be something like ’GT*TR+Batch' to perform an ANOVA with these factors defined as columns in sam.
Will generate a PCA plot (generated by PlotMetabolitePCA internally) restricted based on an ANOVA result based on MetaboliteANOVA.
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # standard behavior RestrictedPCA(dat = raw, sam = sam, group.col = "GT") ## Not run: # apply multiple testing using a strict P-value cutoff, # dont show a legend but plot group mean values and sd's as overlay RestrictedPCA( dat = raw, sam = sam, group.col = "GT", p.adjust.method = "BH", P = 10^-10, fmod = "GT+Batch+Order", sign.col = "GT", medsd = T, legend.x = NULL ) # limit to a subset of samples, switching the ANOVA selection of by setting P=1 # and adding text (from \code{sam}) to each data point RestrictedPCA( dat = raw, sam = sam, use.sam = which(sam$GT %in% c("Mo17", "B73")), group.col = "GT", fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch" ) ## End(Not run)
# load raw data and sample description raw <- MetabolomicsBasics::raw sam <- MetabolomicsBasics::sam # standard behavior RestrictedPCA(dat = raw, sam = sam, group.col = "GT") ## Not run: # apply multiple testing using a strict P-value cutoff, # dont show a legend but plot group mean values and sd's as overlay RestrictedPCA( dat = raw, sam = sam, group.col = "GT", p.adjust.method = "BH", P = 10^-10, fmod = "GT+Batch+Order", sign.col = "GT", medsd = T, legend.x = NULL ) # limit to a subset of samples, switching the ANOVA selection of by setting P=1 # and adding text (from \code{sam}) to each data point RestrictedPCA( dat = raw, sam = sam, use.sam = which(sam$GT %in% c("Mo17", "B73")), group.col = "GT", fmod = "GT+Batch+Order", P = 1, sign.col = "GT", legend.x = NULL, text.col = "Batch" ) ## End(Not run)
This data frame contains the sample definition of 120 samples according to the rows of raw.
sam
sam
An object of class data.frame
with 120 rows and 10 columns.
Jan Lisec [email protected]
<https://onlinelibrary.wiley.com/doi/10.1111/j.1365-313X.2011.04689.x>
spectra_format_converter
will generate a matrix with mz and int
columns out of a text representation of a spectrum.
spectra_format_converter(txt = NULL, m_prec = 3, i_prec = 0)
spectra_format_converter(txt = NULL, m_prec = 3, i_prec = 0)
txt |
Sample table. |
m_prec |
Mass precision of output spectrum. |
i_prec |
Intensity precision of output spectrum. |
See examples.
Matrix with mz and int columns.
spectra_format_converter(txt = "57.1:100 58.0001:10") spectra_format_converter(txt = "58.0001:10 57.1:100", m_prec = 4)
spectra_format_converter(txt = "57.1:100 58.0001:10") spectra_format_converter(txt = "58.0001:10 57.1:100", m_prec = 4)
sumformula_from_CAS
will try to get a chemical sum formula from a CAS ID.
sumformula_from_CAS(x = NULL)
sumformula_from_CAS(x = NULL)
x |
Vector of CAS IDs. |
tbd.
A character vector of length input vector.
## Not run: x <- readLines("C:/Users/jlisec/Documents/Francesco Russo/RECTOX/RECTOX_GC-EI-MS_CASRN") sf <- sumformula_from_CAS(x = x) ## End(Not run)
## Not run: x <- readLines("C:/Users/jlisec/Documents/Francesco Russo/RECTOX/RECTOX_GC-EI-MS_CASRN") sf <- sumformula_from_CAS(x = x) ## End(Not run)
unique_labels
will generate a dataframe with color and
plotting character specification out of a sample table definition.
unique_labels(sam = NULL, g = NULL)
unique_labels(sam = NULL, g = NULL)
sam |
Sample table. |
g |
Either column name from sam containing factor column or factor of same length as sam. |
If a color/symbol specification exists for a sample set containing replicate groups this function will help in retrieving this information per group which is useful in boxplot or legend functions (cf. examples).
Dataframe with group levels names and their color and plotting character specification.
sam <- MetabolomicsBasics::sam unique_labels(sam = sam, g = "GT")
sam <- MetabolomicsBasics::sam unique_labels(sam = sam, g = "GT")
unique_subformula_masses
will generate a numeric vector of
potential sub formula masses regarding a chemical formula as input.
unique_subformula_masses(fml, names = TRUE, check_validity = FALSE)
unique_subformula_masses(fml, names = TRUE, check_validity = FALSE)
fml |
Chemical formula. |
names |
Return respective sub formulas as vector names. |
check_validity |
Filter for chemically valid formulas. |
In mass spectrometry precursor masses are often fragmented and these
fragments are recorded as MS^2 spectra. A frequent task is then to compute
potential chemical formulas for the obtained MS^2 masses. The function
unique_subformula_masses
follows the reverse approach. It allows to
calculate all masses that could be potential breakdown products of a precursor
formula
A named numeric vector. The names are the sub formulas for the calculated exact masses given as numeric.
# specify a formula and calculate all potential combinatorial masses fml <- c("C6H12O6", "C11H16NO4PS", "C24H51O4P")[1] tmp <- unique_subformula_masses(fml = fml) length(tmp); any(duplicated(tmp)) hist(tmp, breaks=seq(floor(min(tmp))-1, ceiling(max(tmp))), main=fml) # do the same as above but check for chemical plausibility tmp2 <- unique_subformula_masses(fml = fml, check_validity=TRUE) length(tmp2) hist(tmp2, breaks=seq(floor(min(tmp2))-1, ceiling(max(tmp2))), main=fml) mz <- 147 tmp[abs(tmp-mz)<0.5] tmp2[abs(tmp2-mz)<0.5]
# specify a formula and calculate all potential combinatorial masses fml <- c("C6H12O6", "C11H16NO4PS", "C24H51O4P")[1] tmp <- unique_subformula_masses(fml = fml) length(tmp); any(duplicated(tmp)) hist(tmp, breaks=seq(floor(min(tmp))-1, ceiling(max(tmp))), main=fml) # do the same as above but check for chemical plausibility tmp2 <- unique_subformula_masses(fml = fml, check_validity=TRUE) length(tmp2) hist(tmp2, breaks=seq(floor(min(tmp2))-1, ceiling(max(tmp2))), main=fml) mz <- 147 tmp[abs(tmp-mz)<0.5] tmp2[abs(tmp2-mz)<0.5]