| Title: | Variable Selection in Partial Least Squares |
|---|---|
| Description: | Interfaces and methods for variable selection in Partial Least Squares. The methods include filter methods, wrapper methods and embedded methods. Both regression and classification is supported. |
| Authors: | Kristian Hovde Liland [aut, cre] (ORCID: <https://orcid.org/0000-0001-6468-9423>), Tahir Mehmood [ctb], Solve Sæbø [ctb] |
| Maintainer: | Kristian Hovde Liland <[email protected]> |
| License: | GPL (>=2) |
| Version: | 0.10.0 |
| Built: | 2026-05-15 09:35:26 UTC |
| Source: | https://github.com/khliland/plsvarsel |
A backward variable elimination procedure for elimination of non informative variables.
bve_pls(y, X, ncomp = 10, ratio = 0.75, VIP.threshold = 1)bve_pls(y, X, ncomp = 10, ratio = 0.75, VIP.threshold = 1)
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 10). |
ratio |
the proportion of the samples to use for calibration (default = 0.75). |
VIP.threshold |
thresholding to remove non-important variables (default = 1). |
Variables are first sorted with respect to some importance measure, and usually one of the filter measures described above are used. Secondly, a threshold is used to eliminate a subset of the least informative variables. Then a model is fitted again to the remaining variables and performance is measured. The procedure is repeated until maximum model performance is achieved.
Returns a vector of variable numbers corresponding to the model having lowest prediction error.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
I. Frank, Intermediate least squares regression method, Chemometrics and Intelligent Laboratory Systems 1 (3) (1987) 233-242.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(gasoline, package = "pls") with( gasoline, bve_pls(octane, NIR) )data(gasoline, package = "pls") with( gasoline, bve_pls(octane, NIR) )
Sequential selection of variables based on squared covariance with response and intermediate deflation (as in Partial Least Squares).
covSel(X, Y, nvar)covSel(X, Y, nvar)
X |
|
Y |
|
nvar |
maximum number of variables |
selected |
an integer vector of selected variables |
scores |
a matrix of score vectors |
loadings |
a matrix of loading vectors |
Yloadings |
a matrix of Y loadings |
J.M. Roger, B. Palagos, D. Bertrand, E. Fernandez-Ahumada. CovSel: Variable selection for highly multivariate and multi-response calibration: Application to IR spectroscopy. Chemom Intel Lab Syst. 2011;106(2):216-223. P. Mishra, A brief note on a new faster covariate's selection (fCovSel) algorithm, Journal of Chemometrics 36(5) 2022.
data(gasoline, package = "pls") sels <- with(gasoline, covSel(NIR, octane, 5)) matplot(t(gasoline$NIR), type = "l") abline(v = sels$selected, col = 2)data(gasoline, package = "pls") sels <- with(gasoline, covSel(NIR, octane, 5)) matplot(t(gasoline$NIR), type = "l") abline(v = sels$selected, col = 2)
Extract the index of influential variables based on threshold defiend for LW (loading weights), RC (regression coef), JT (jackknife testing) and VIP (variable importance on projection).
filterPLSR( y, X, ncomp = 10, ncomp.opt = c("minimum", "same"), validation = "LOO", LW.threshold = NULL, RC.threshold = NULL, URC.threshold = NULL, FRC.threshold = NULL, JT.threshold = NULL, VIP.threshold = NULL, SR.threshold = NULL, sMC.threshold = NULL, mRMR.threshold = NULL, WVC.threshold = NULL, ... )filterPLSR( y, X, ncomp = 10, ncomp.opt = c("minimum", "same"), validation = "LOO", LW.threshold = NULL, RC.threshold = NULL, URC.threshold = NULL, FRC.threshold = NULL, JT.threshold = NULL, VIP.threshold = NULL, SR.threshold = NULL, sMC.threshold = NULL, mRMR.threshold = NULL, WVC.threshold = NULL, ... )
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 10). |
ncomp.opt |
use the number of components corresponding to minimum error (minimum)
or |
validation |
type of validation in the PLS modelling (default = "LOO"). |
LW.threshold |
threshold for Loading Weights if applied (default = NULL). |
RC.threshold |
threshold for Regression Coefficients if applied (default = NULL). |
URC.threshold |
threshold for Unit normalized Regression Coefficients if applied (default = NULL). |
FRC.threshold |
threshold for Fitness normalized Regression Coefficients if applied (default = NULL). |
JT.threshold |
threshold for Jackknife Testing if applied (default = NULL). |
VIP.threshold |
threshold for Variable Importance on Projections if applied (default = NULL). |
SR.threshold |
threshold for Selectivity Ration if applied (default = NULL). |
sMC.threshold |
threshold for Significance Multivariate Correlation if applied (default = NULL). |
mRMR.threshold |
threshold for minimum Redundancy Maximum Releveance if applied (default = NULL). |
WVC.threshold |
threshold for Weighted Variable Contribution if applied (default = NULL). |
... |
additional paramters for |
Filter methods are applied for variable selection with PLSR. This function can return selected variables and Root Mean Squared Error of Cross-Validation for various filter methods and determine optimum numbers of components.
Returns a list of lists containing filters (outer list), their selected variables, optimal numbers of components and prediction accuracies.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least Squares Regression, Chemometrics and Intelligent Laboratory Systems 118 (2012) 62-69.
VIP (SR/sMC/LW/RC/URC/FRC/mRMR), filterPLSR, spa_pls,
stpls, truncation, bve_pls, mcuve_pls,
ipw_pls, ga_pls, rep_pls, WVC_pls, T2_pls.
data(gasoline, package = "pls") ## Not run: with( gasoline, filterPLSR(octane, NIR, ncomp = 10, "minimum", validation = "LOO", RC.threshold = c(0.1,0.5), SR.threshold = 0.5)) ## End(Not run)data(gasoline, package = "pls") ## Not run: with( gasoline, filterPLSR(octane, NIR, ncomp = 10, "minimum", validation = "LOO", RC.threshold = c(0.1,0.5), SR.threshold = 0.5)) ## End(Not run)
A subset search algorithm inspired by biological evolution theory and natural selection.
ga_pls(y, X, GA.threshold = 10, iters = 5, popSize = 100)ga_pls(y, X, GA.threshold = 10, iters = 5, popSize = 100)
y |
vector of response values ( |
X |
numeric predictor |
GA.threshold |
the change for a zero for mutations and initialization (default = 10). (The ratio of non-selected variables for each chromosome.) |
iters |
the number of iterations (default = 5). |
popSize |
the population size (default = 100). |
1. Building an initial population of variable sets by setting bits for each variable randomly, where bit '1' represents selection of corresponding variable while '0' presents non-selection. The approximate size of the variable sets must be set in advance.
2. Fitting a PLSR-model to each variable set and computing the performance by, for instance, a leave one out cross-validation procedure.
3. A collection of variable sets with higher performance are selected to survive until the next "generation".
4. Crossover and mutation: new variable sets are formed 1) by crossover of selected variables between the surviving variable sets, and 2) by changing (mutating) the bit value for each variable by small probability.
5. The surviving and modified variable sets form the population serving as input to point 2.
Returns a vector of variable numbers corresponding to the model having lowest prediction error.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
K. Hasegawa, Y. Miyashita, K. Funatsu, GA strategy for variable selection in QSAR studies: GA-based PLS analysis of calcium channel antagonists, Journal of Chemical Information and Computer Sciences 37 (1997) 306-310.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(gasoline, package = "pls") # with( gasoline, ga_pls(octane, NIR, GA.threshold = 10) ) # Time-consumingdata(gasoline, package = "pls") # with( gasoline, ga_pls(octane, NIR, GA.threshold = 10) ) # Time-consuming
An iterative procedure for variable elimination.
ipw_pls( y, X, ncomp = 10, no.iter = 10, IPW.threshold = 0.01, filter = "RC", scale = TRUE ) ipw_pls_legacy(y, X, ncomp = 10, no.iter = 10, IPW.threshold = 0.1)ipw_pls( y, X, ncomp = 10, no.iter = 10, IPW.threshold = 0.01, filter = "RC", scale = TRUE ) ipw_pls_legacy(y, X, ncomp = 10, no.iter = 10, IPW.threshold = 0.1)
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 10). |
no.iter |
the number of iterations (default = 10). |
IPW.threshold |
threshold for regression coefficients (default = 0.1). |
filter |
which filtering method to use (among "RC", "SR", "LW", "VIP", "sMC") |
scale |
standardize data (default=TRUE, as in reference) |
This is an iterative elimination procedure where a measure of predictor importance is computed after fitting a PLSR model (with complexity chosen based on predictive performance). The importance measure is used both to re-scale the original X-variables and to eliminate the least important variables before subsequent model re-fitting
The IPW implementation was corrected in plsVarSel version 0.9.5. For backward
compatibility the old implementation is included as ipw_pls_legacy.
Returns a vector of variable numbers corresponding to the model having lowest prediction error.
Kristian Hovde Liland
M. Forina, C. Casolino, C. Pizarro Millan, Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems, Journal of Chemometrics 13 (1999) 165-184.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, setDA.
data(gasoline, package = "pls") with( gasoline, ipw_pls(octane, NIR) )data(gasoline, package = "pls") with( gasoline, ipw_pls(octane, NIR) )
For each number of components LDA/QDA models are created from the scores of the supplied PLS model and classifications are performed.
lda_from_pls(model, grouping, newdata, ncomp)lda_from_pls(model, grouping, newdata, ncomp)
model |
|
grouping |
vector of grouping labels |
newdata |
predictors in the same format as in the |
ncomp |
maximum number of PLS components |
matrix of classifications
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(mayonnaise, package = "pls") mayonnaise <- within(mayonnaise, {dummy <- model.matrix(~y-1,data.frame(y=factor(oil.type)))}) pls <- plsr(dummy ~ NIR, ncomp = 10, data = mayonnaise, subset = train) with(mayonnaise, { classes <- lda_from_pls(pls, oil.type[train], NIR[!train,], 10) colSums(oil.type[!train] == classes) # Number of correctly classified out of 42 })data(mayonnaise, package = "pls") mayonnaise <- within(mayonnaise, {dummy <- model.matrix(~y-1,data.frame(y=factor(oil.type)))}) pls <- plsr(dummy ~ NIR, ncomp = 10, data = mayonnaise, subset = train) with(mayonnaise, { classes <- lda_from_pls(pls, oil.type[train], NIR[!train,], 10) colSums(oil.type[!train] == classes) # Number of correctly classified out of 42 })
For each number of components LDA/QDA models are created from the scores of the supplied PLS model and classifications are performed. This use of cross-validation has limitations. Handle with care!
lda_from_pls_cv(model, X, y, ncomp, Y.add = NULL)lda_from_pls_cv(model, X, y, ncomp, Y.add = NULL)
model |
|
X |
predictors in the same format as in the |
y |
vector of grouping labels |
ncomp |
maximum number of PLS components |
Y.add |
additional responses |
matrix of classifications
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(mayonnaise, package = "pls") mayonnaise <- within(mayonnaise, {dummy <- model.matrix(~y-1,data.frame(y=factor(oil.type)))}) pls <- plsr(dummy ~ NIR, ncomp = 8, data = mayonnaise, subset = train, validation = "CV", segments = 40, segment.type = "consecutive") with(mayonnaise, { classes <- lda_from_pls_cv(pls, NIR[train,], oil.type[train], 8) colSums(oil.type[train] == classes) # Number of correctly classified out of 120 })data(mayonnaise, package = "pls") mayonnaise <- within(mayonnaise, {dummy <- model.matrix(~y-1,data.frame(y=factor(oil.type)))}) pls <- plsr(dummy ~ NIR, ncomp = 8, data = mayonnaise, subset = train, validation = "CV", segments = 40, segment.type = "consecutive") with(mayonnaise, { classes <- lda_from_pls_cv(pls, NIR[train,], oil.type[train], 8) colSums(oil.type[train] == classes) # Number of correctly classified out of 120 })
Artificial noise variables are added to the predictor set before the PLSR model is fitted. All the original variables having lower "importance" than the artificial noise variables are eliminated before the procedure is repeated until a stop criterion is reached.
mcuve_pls(y, X, ncomp = 10, N = 3, ratio = 0.75, MCUVE.threshold = NA)mcuve_pls(y, X, ncomp = 10, N = 3, ratio = 0.75, MCUVE.threshold = NA)
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 10). |
N |
number of samples Mone Carlo simulations (default = 3). |
ratio |
the proportion of the samples to use for calibration (default = 0.75). |
MCUVE.threshold |
thresholding separate signal from noise (default = NA creates automatic threshold from data). |
Returns a vector of variable numbers corresponding to the model having lowest prediction error.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
V. Centner, D. Massart, O. de Noord, S. de Jong, B. Vandeginste, C. Sterna, Elimination of uninformative variables for multivariate calibration, Analytical Chemistry 68 (1996) 3851-3858.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(gasoline, package = "pls") with( gasoline, mcuve_pls(octane, NIR) )data(gasoline, package = "pls") with( gasoline, mcuve_pls(octane, NIR) )
Adaptation of mvr from package pls v 2.4.3.
mvrV( formula, ncomp, Y.add, data, subset, na.action, shrink, method = c("truncation", "stpls", "model.frame"), scale = FALSE, validation = c("none", "CV", "LOO"), model = TRUE, x = FALSE, y = FALSE, ... )mvrV( formula, ncomp, Y.add, data, subset, na.action, shrink, method = c("truncation", "stpls", "model.frame"), scale = FALSE, validation = c("none", "CV", "LOO"), model = TRUE, x = FALSE, y = FALSE, ... )
formula |
a model formula. Most of the lm formula constructs are supported. See below. |
ncomp |
the number of components to include in the model (see below). |
Y.add |
a vector or matrix of additional responses containing relevant information about the observations. Only used for cppls. |
data |
an optional data frame with the data to fit the model from. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data contain missing values. The default is set by the na.action setting of options, and is na.fail if that is unset. The 'factory-fresh' default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful. See na.omit for other alternatives. |
shrink |
optional shrinkage parameter for |
method |
the multivariate regression method to be used. If "model.frame", the model frame is returned. |
scale |
numeric vector, or logical. If numeric vector, X is scaled by dividing each variable with the corresponding element of scale. If scale is TRUE, X is scaled by dividing each variable by its sample standard deviation. If cross-validation is selected, scaling by the standard deviation is done for every segment. |
validation |
character. What kind of (internal) validation to use. See below. |
model |
a logical. If TRUE, the model frame is returned. |
x |
a logical. If TRUE, the model matrix is returned. |
y |
a logical. If TRUE, the response is returned. |
... |
additional arguments, passed to the underlying fit functions, and mvrCv. |
Plot a heatmap with colorbar.
myImagePlot(x, main, ...)myImagePlot(x, main, ...)
x |
a |
main |
header text for the plot. |
... |
additional arguments (not implemented). |
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least Squares Regression, Chemometrics and Intelligent Laboratory Systems 118 (2012) 62-69.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
myImagePlot(matrix(1:12,3,4), 'A header')myImagePlot(matrix(1:12,3,4), 'A header')
A large collection of variable selection methods for use with
Partial Least Squares. These include all methods in Mehmood et al. 2012
and more. All functions treat numeric responses as regression and
factor responses as classification. Default classification is PLS + LDA,
but setDA() can be used to choose PLS + QDA or PLS with response column maximization.
T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least Squares Regression, Chemometrics and Intelligent Laboratory Systems 118 (2012) 62-69. T. Mehmood, S. Sæbø, K.H. Liland, Comparison of variable selection methods in partial least squares regression, Journal of Chemometrics 34 (2020) e3226.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
Greedy algorithm for extracting the most dominant variables/columns with respect to simultaneous explained X-variance and squared correlation with Y.
PVR(X, Y, nvar = 2, ncomp = NULL)PVR(X, Y, nvar = 2, ncomp = NULL)
X |
numeric predictor |
Y |
numeric response |
nvar |
integer, the required number of selected variables (default = 2). |
ncomp |
integer, the number of principal components to include in the voting process (default = all PCs). |
A list containing:
ids |
The indices of the selected variables. |
betas |
The regression coefficients (including the constant term) for prediction of Y from the selected variables. |
Q |
Orthonormal scores (associated with the selected variables). |
R |
Corresponding loadings. NOTE: R[,vperm] is upper triangular. |
vperm |
Indices arranged in the order of the nvar selected and all non-selected variables. NOTE: R[,vperm] is upper triangular. |
U |
The normalized PCA-scores. |
s |
Singular values of the mean centered X. |
ssEX |
The X-variances explained by the selected variables. |
ssEY |
The Y-variances explained by the selected variables. |
ni |
The norms of the (residual) selected variables before the score-normalization (Q). |
Ulf Indahl, Kristian Hovde Liland.
Joakim Skogholt, Kristian Hovde Liland, Tormod Næs, Age K. Smilde, Ulf Geir Indahl, Selection of principal variables through a modified Gram–Schmidt process with and without supervision, Journal of Chemometrics, Volume 37, Issue 10, Pages e3510 (2023), https://doi.org/10.1002/cem.3510
PVS, VIP, filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls.
library(pls) data(gasoline, package = "pls") # PVR: Select 10 variables using all PCs in voting pvr_result <- PVR(gasoline$NIR, gasoline$octane, nvar = 10) # Compare with PCR using all variables pcr_result <- pcr(octane ~ NIR, ncomp = 10, data = gasoline, validation = "CV", scale = FALSE) # Compare X-variance and Y-variance explained par(mfrow = c(1, 2)) plot(cumsum(pvr_result$ssEX), type = "b", col = "blue", xlab = "Number of Variables/Components", ylab = "Cumulative % X-Variance", main = "X-Variance: PVR vs PCR", ylim = c(50, 100)) pcr_xvar <- 100 * cumsum(pcr_result$Xvar) / pcr_result$Xtotvar lines(seq_along(pcr_xvar), pcr_xvar, type = "b", col = "red") legend("bottomright", legend = c("PVR (10 vars)", "PCR (10 comps)"), col = c("blue", "red"), lty = 1, pch = 1) plot(cumsum(pvr_result$ssEY), type = "b", col = "blue", xlab = "Number of Variables/Components", ylab = "Cumulative % Y-Variance", main = "Y-Variance: PVR vs PCR", ylim = c(0, 100)) pcr_yvar <- 100 * R2(pcr_result)$val[1,1,-1] lines(seq_along(pcr_yvar), pcr_yvar, type = "b", col = "red") legend("bottomright", legend = c("PVR (10 vars)", "PCR (10 comps)"), col = c("blue", "red"), lty = 1, pch = 1) par(mfrow = c(1, 1)) # Predict using selected variables X_selected <- gasoline$NIR[, pvr_result$ids] y_pred_pvr <- cbind(1, X_selected) %*% pvr_result$betas[, ncol(pvr_result$betas)] y_pred_pcr <- predict(pcr_result, ncomp = 10, newdata = gasoline) # Compare RMSE (training error - same data used for fitting) rmse_pvr <- sqrt(mean((gasoline$octane - y_pred_pvr)^2)) rmse_pcr <- sqrt(mean((gasoline$octane - y_pred_pcr)^2)) cat("RMSE - PVR:", round(rmse_pvr, 4), "\n") cat("RMSE - PCR:", round(rmse_pcr, 4), "\n")library(pls) data(gasoline, package = "pls") # PVR: Select 10 variables using all PCs in voting pvr_result <- PVR(gasoline$NIR, gasoline$octane, nvar = 10) # Compare with PCR using all variables pcr_result <- pcr(octane ~ NIR, ncomp = 10, data = gasoline, validation = "CV", scale = FALSE) # Compare X-variance and Y-variance explained par(mfrow = c(1, 2)) plot(cumsum(pvr_result$ssEX), type = "b", col = "blue", xlab = "Number of Variables/Components", ylab = "Cumulative % X-Variance", main = "X-Variance: PVR vs PCR", ylim = c(50, 100)) pcr_xvar <- 100 * cumsum(pcr_result$Xvar) / pcr_result$Xtotvar lines(seq_along(pcr_xvar), pcr_xvar, type = "b", col = "red") legend("bottomright", legend = c("PVR (10 vars)", "PCR (10 comps)"), col = c("blue", "red"), lty = 1, pch = 1) plot(cumsum(pvr_result$ssEY), type = "b", col = "blue", xlab = "Number of Variables/Components", ylab = "Cumulative % Y-Variance", main = "Y-Variance: PVR vs PCR", ylim = c(0, 100)) pcr_yvar <- 100 * R2(pcr_result)$val[1,1,-1] lines(seq_along(pcr_yvar), pcr_yvar, type = "b", col = "red") legend("bottomright", legend = c("PVR (10 vars)", "PCR (10 comps)"), col = c("blue", "red"), lty = 1, pch = 1) par(mfrow = c(1, 1)) # Predict using selected variables X_selected <- gasoline$NIR[, pvr_result$ids] y_pred_pvr <- cbind(1, X_selected) %*% pvr_result$betas[, ncol(pvr_result$betas)] y_pred_pcr <- predict(pcr_result, ncomp = 10, newdata = gasoline) # Compare RMSE (training error - same data used for fitting) rmse_pvr <- sqrt(mean((gasoline$octane - y_pred_pvr)^2)) rmse_pcr <- sqrt(mean((gasoline$octane - y_pred_pcr)^2)) cat("RMSE - PVR:", round(rmse_pvr, 4), "\n") cat("RMSE - PCR:", round(rmse_pcr, 4), "\n")
Greedy algorithm for extracting the most dominant (principal) variables (X-columns) with respect to explained X-variance.
PVS(X, nvar, ncomp = NULL)PVS(X, nvar, ncomp = NULL)
X |
numeric predictor |
nvar |
integer, the required number of selected variables. |
ncomp |
integer, number of principal components included in the voting (default = all PCs). |
A list containing:
Q |
Orthonormal scores (associated with the selected variables). |
R |
Corresponding loadings. NOTE: R[,vperm] is upper triangular. |
ids |
Indices arranged in the order of the nvar selected variables. |
vperm |
Indices arranged in the order of the nvar selected and all non-selected variables. NOTE: R[,vperm] is upper triangular. |
ssEX |
The variances explained by the selected variables. |
ni |
The norms of the (residual) selected variables before the score-normalization (Q). |
U |
The normalized PCA-scores. |
s |
Singular values of the mean centered X. |
Ulf Indahl, Kristian Hovde Liland.
Joakim Skogholt, Kristian Hovde Liland, Tormod Næs, Age K. Smilde, Ulf Geir Indahl, Selection of principal variables through a modified Gram–Schmidt process with and without supervision, Journal of Chemometrics, Volume 37, Issue 10, Pages e3510 (2023), https://doi.org/10.1002/cem.3510
PVR, VIP, filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls.
library(pls) data(gasoline, package = "pls") # PVS: Select 10 variables using all PCs in voting pvs_result <- PVS(gasoline$NIR, nvar = 10) # Compare with PCA using pcr() (octane is unused in PCA) pca_result <- pcr(octane ~ NIR, ncomp = 10, data = gasoline, scale = FALSE) # Plot cumulative variance explained plot(cumsum(pvs_result$ssEX), type = "b", col = "blue", xlab = "Number of Variables/Components", ylab = "Cumulative % Variance Explained", main = "PVS vs PCA", ylim = c(0, 100)) pca_var <- 100 * cumsum(pca_result$Xvar) / pca_result$Xtotvar lines(seq_along(pca_var), pca_var, type = "b", col = "red") legend("bottomright", legend = c("PVS (10 variables)", "PCA (10 components)"), col = c("blue", "red"), lty = 1, pch = 1)library(pls) data(gasoline, package = "pls") # PVS: Select 10 variables using all PCs in voting pvs_result <- PVS(gasoline$NIR, nvar = 10) # Compare with PCA using pcr() (octane is unused in PCA) pca_result <- pcr(octane ~ NIR, ncomp = 10, data = gasoline, scale = FALSE) # Plot cumulative variance explained plot(cumsum(pvs_result$ssEX), type = "b", col = "blue", xlab = "Number of Variables/Components", ylab = "Cumulative % Variance Explained", main = "PVS vs PCA", ylim = c(0, 100)) pca_var <- 100 * cumsum(pca_result$Xvar) / pca_result$Xtotvar lines(seq_along(pca_var), pca_var, type = "b", col = "red") legend("bottomright", legend = c("PVS (10 variables)", "PCA (10 components)"), col = c("blue", "red"), lty = 1, pch = 1)
A regularized variable elimination procedure for parsimonious variable selection, where also a stepwise elimination is carried out
rep_pls(y, X, ncomp = 5, ratio = 0.75, VIP.threshold = 0.5, N = 3)rep_pls(y, X, ncomp = 5, ratio = 0.75, VIP.threshold = 0.5, N = 3)
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 5). |
ratio |
the proportion of the samples to use for calibration (default = 0.75). |
VIP.threshold |
thresholding to remove non-important variables (default = 0.5). |
N |
number of samples in the selection matrix (default = 3). |
A stability based variable selection procedure is adopted, where the samples have been split randomly into a predefined number of training and test sets. For each split, g, the following stepwise procedure is adopted to select the variables. This implementation does not follow the original publication exactly, but it opens for both regression and classification.
Returns a vector of variable numbers corresponding to the model having lowest prediction error.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
T. Mehmood, H. Martens, S. Sæbø, J. Warringer, L. Snipen, A partial least squares based algorithm for parsimonious variable selection, Algorithms for Molecular Biology 6 (2011).
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(gasoline, package = "pls") ## Not run: with( gasoline, rep_pls(octane, NIR) ) ## End(Not run)data(gasoline, package = "pls") ## Not run: with( gasoline, rep_pls(octane, NIR) ) ## End(Not run)
The default methods is LDA, but QDA and column of maximum prediction can be chosen.
setDA(LQ = NULL)setDA(LQ = NULL)
LQ |
character argument 'lda', 'qda', 'max' or NULL |
Returns the default set method.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
## Not run: setDA() # Query 'lda', 'qda' or 'max' setDA('qda') # Set default method to QDA ## End(Not run)## Not run: setDA() # Query 'lda', 'qda' or 'max' setDA('qda') # Set default method to QDA ## End(Not run)
One of five filter methods can be chosen for repeated shaving of
a certain percentage of the worst performing variables. Performance of the
reduced models are stored and viewable through print and plot
methods.
shaving( y, X, ncomp = 10, method = c("SR", "VIP", "sMC", "LW", "RC"), prop = 0.2, min.left = 2, comp.type = c("CV", "max"), validation = c("CV", 1), fixed = integer(0), newy = NULL, newX = NULL, segments = 10, plsType = "plsr", Y.add = NULL, ... ) ## S3 method for class 'shaved' plot(x, y, what = c("error", "spectra"), index = "min", log = "x", ...) ## S3 method for class 'shaved' print(x, ...)shaving( y, X, ncomp = 10, method = c("SR", "VIP", "sMC", "LW", "RC"), prop = 0.2, min.left = 2, comp.type = c("CV", "max"), validation = c("CV", 1), fixed = integer(0), newy = NULL, newX = NULL, segments = 10, plsType = "plsr", Y.add = NULL, ... ) ## S3 method for class 'shaved' plot(x, y, what = c("error", "spectra"), index = "min", log = "x", ...) ## S3 method for class 'shaved' print(x, ...)
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 10). |
method |
filter method, i.e. SR, VIP, sMC, LW or RC given as |
prop |
proportion of variables to be removed in each iteration ( |
min.left |
minimum number of remaining variables. |
comp.type |
use number of components chosen by cross-validation, |
validation |
type of validation for |
fixed |
vector of indeces for compulsory/fixed variables that should always be included in the modelling. |
newy |
validation response for RMSEP/error computations. |
newX |
validation predictors for RMSEP/error computations. |
segments |
see |
plsType |
Type of PLS model, "plsr" or "cppls". |
Y.add |
Additional response for CPPLS, see |
... |
additional arguments for |
x |
object of class |
what |
plot type. Default = "error". Alternative = "spectra". |
index |
which iteration to plot. Default = "min"; corresponding to minimum RMSEP. |
log |
logarithmic x (default) or y scale. |
Variables are first sorted with respect to some importancemeasure, and usually one of the filter measures described above are used. Secondly, a threshold is used to eliminate a subset of the least informative variables. Then a model is fitted again to the remaining variables and performance is measured. The procedure is repeated until maximum model performance is achieved.
Returns a list object of class shaved containing the method type,
the error, number of components, and number of variables per reduced model. It
also contains a list of all sets of reduced variable sets plus the original data.
Kristian Hovde Liland
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(mayonnaise, package = "pls") sh <- shaving(mayonnaise$design[,1], pls::msc(mayonnaise$NIR), type = "interleaved") pars <- par(mfrow = c(2,1), mar = c(4,4,1,1)) plot(sh) plot(sh, what = "spectra") par(pars) print(sh)data(mayonnaise, package = "pls") sh <- shaving(mayonnaise$design[,1], pls::msc(mayonnaise$NIR), type = "interleaved") pars <- par(mfrow = c(2,1), mar = c(4,4,1,1)) plot(sh) plot(sh, what = "spectra") par(pars) print(sh)
Simulate multivariate normal data.
simulate_classes(p, n1, n2) simulate_data(dims, n1 = 150, n2 = 50)simulate_classes(p, n1, n2) simulate_data(dims, n1 = 150, n2 = 50)
p |
integer number of variables. |
n1 |
integer number of samples in each of two classes in training/calibration data. |
n2 |
integer number of samples in each of two classes in test/validation data. |
dims |
a 10 element vector of group sizes. |
The class simulation is a straigh forward simulation of mulitvariate normal data into two classes for training and test data, respectively. The data simulation uses a strictly structured multivariate normal simulation for with continuous response data.
Returns a list of predictor and response data for training and testing.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least Squares Regression, Chemometrics and Intelligent Laboratory Systems 118 (2012) 62-69. T. Mehmood, S. Sæbø, K.H. Liland, Comparison of variable selection methods in partial least squares regression, Journal of Chemometrics 34 (2020) e3226.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
str(simulate_classes(5,4,4))str(simulate_classes(5,4,4))
SwPA-PLS provides the influence of each variable without considering the influence of the rest of the variables through sub-sampling of samples and variables.
spa_pls(y, X, ncomp = 10, N = 3, ratio = 0.8, Qv = 10, SPA.threshold = 0.05)spa_pls(y, X, ncomp = 10, N = 3, ratio = 0.8, Qv = 10, SPA.threshold = 0.05)
y |
vector of response values ( |
X |
numeric predictor |
ncomp |
integer number of components (default = 10). |
N |
number of Monte Carlo simulations (default = 3). |
ratio |
the proportion of the samples to use for calibration (default = 0.8). |
Qv |
integer number of variables to be sampled in each iteration (default = 10). |
SPA.threshold |
thresholding to remove non-important variables (default = 0.05). |
Returns a vector of variable numbers corresponding to the model having lowest prediction error.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
H. Li, M. Zeng, B. Tan, Y. Liang, Q. Xu, D. Cao, Recipe for revealing informative metabolites based on model population analysis, Metabolomics 6 (2010) 353-361. http://code.google.com/p/spa2010/downloads/list.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(gasoline, package = "pls") with( gasoline, spa_pls(octane, NIR) )data(gasoline, package = "pls") with( gasoline, spa_pls(octane, NIR) )
A soft-thresholding step in PLS algorithm (ST-PLS) based on ideas from the nearest shrunken centroid method.
stpls(..., method = c("stpls", "model.frame"))stpls(..., method = c("stpls", "model.frame"))
... |
arguments for the underlying |
method |
choice between the default |
The ST-PLS approach is more or less identical to the Sparse-PLS presented
independently by Lè Cao et al. This implementation is an expansion of code from the
pls package. Arguments for stpls.fit include ncomp and shrink, where
the forme sets then number of components and the latter is the shrinkage parameter
indicating how large proportion of the maximum absolute value of the loadings that
should be subtracted from the loadings in the nearest shrunken centroid method.
Returns an object of class mvrV, simliar to to mvr object of the pls package.
Solve Sæbø, Tahir Mehmood, Kristian Hovde Liland.
S. Sæbø, T. Almøy, J. Aarøe, A.H. Aastveit, ST-PLS: a multi-dimensional nearest shrunken centroid type classifier via pls, Journal of Chemometrics 20 (2007) 54-62.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(yarn, package = "pls") st <- stpls(density~NIR, ncomp=5, shrink=c(0.1,0.2), validation="CV", data=yarn) summary(st)data(yarn, package = "pls") st <- stpls(density~NIR, ncomp=5, shrink=c(0.1,0.2), validation="CV", data=yarn) summary(st)
Adaptation of summary.mvr from the pls package v 2.4.3.
## S3 method for class 'mvrV' summary( object, what = c("all", "validation", "training"), digits = 4, print.gap = 2, ... )## S3 method for class 'mvrV' summary( object, what = c("all", "validation", "training"), digits = 4, print.gap = 2, ... )
object |
an mvrV object |
what |
one of "all", "validation" or "training" |
digits |
integer. Minimum number of significant digits in the output. Default is 4. |
print.gap |
Integer. Gap between coloumns of the printed tables. |
... |
Other arguments sent to underlying methods. |
Variable selection based on the T^2 statistic. A side effect of running the selection is printing of tables and production of plots.
T2_pls(ytr, Xtr, yts, Xts, ncomp = 10, alpha = c(0.2, 0.15, 0.1, 0.05, 0.01))T2_pls(ytr, Xtr, yts, Xts, ncomp = 10, alpha = c(0.2, 0.15, 0.1, 0.05, 0.01))
ytr |
Vector of responses for model training. |
Xtr |
Matrix of predictors for model training. |
yts |
Vector of responses for model testing. |
Xts |
Matrix of predictors for model testing. |
ncomp |
Number of PLS components. |
alpha |
Hotelling's T^2 significance levels. |
Parameters and variables corresponding to variable selections of minimum error and minimum variable set.
Tahir Mehmood, Hotelling T^2 based variable selection in partial least squares regression, Chemometrics and Intelligent Laboratory Systems 154 (2016), pp 23-28
data(gasoline, package = "pls") library(pls) if(interactive()){ t2 <- T2_pls(gasoline$octane[1:40], gasoline$NIR[1:40,], gasoline$octane[-(1:40)], gasoline$NIR[-(1:40),], ncomp = 10, alpha = c(0.2, 0.15, 0.1, 0.05, 0.01)) matplot(t(gasoline$NIR), type = 'l', col=1, ylab='intensity') points(t2$mv[[1]], colMeans(gasoline$NIR)[t2$mv[[1]]], col=2, pch='x') points(t2$mv[[2]], colMeans(gasoline$NIR)[t2$mv[[2]]], col=3, pch='o') }data(gasoline, package = "pls") library(pls) if(interactive()){ t2 <- T2_pls(gasoline$octane[1:40], gasoline$NIR[1:40,], gasoline$octane[-(1:40)], gasoline$NIR[-(1:40),], ncomp = 10, alpha = c(0.2, 0.15, 0.1, 0.05, 0.01)) matplot(t(gasoline$NIR), type = 'l', col=1, ylab='intensity') points(t2$mv[[1]], colMeans(gasoline$NIR)[t2$mv[[1]]], col=2, pch='x') points(t2$mv[[2]], colMeans(gasoline$NIR)[t2$mv[[2]]], col=3, pch='o') }
Distribution based truncation for variable selection in subspace methods for multivariate regression.
truncation(..., Y.add, weights, method = "truncation")truncation(..., Y.add, weights, method = "truncation")
... |
arguments passed on to |
Y.add |
optional additional response vector/matrix found in the input data. |
weights |
optional object weighting vector. |
method |
choice (default = |
Loading weights are truncated around their median based on confidence intervals
for modelling without replicates (Lenth et al.). The arguments passed to mvrV include
all possible arguments to cppls and the following truncation parameters
(with defaults) trunc.pow=FALSE, truncation=NULL, trunc.width=NULL, trunc.weight=0,
reorth=FALSE, symmetric=FALSE.
The default way of performing truncation involves the following parameter values: truncation="Lenth", trunc.width=0.95, indicating Lenth's confidence intervals (assymmetric), with a confidence of 95 shrinkage instead of a hard threshold. An alternative truncation strategy can be used with: truncation="quantile", in which a quantile line is used for detecting outliers/inliers.
Returns an object of class mvrV, simliar to to mvr object of the pls package.
Kristian Hovde Liland.
K.H. Liland, M. Høy, H. Martens, S. Sæbø: Distribution based truncation for variable selection in subspace methods for multivariate regression, Chemometrics and Intelligent Laboratory Systems 122 (2013) 103-111.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(yarn, package = "pls") tr <- truncation(density ~ NIR, ncomp=5, data=yarn, validation="CV", truncation="Lenth", trunc.width=0.95) # Default truncation summary(tr)data(yarn, package = "pls") tr <- truncation(density ~ NIR, ncomp=5, data=yarn, validation="CV", truncation="Lenth", trunc.width=0.95) # Default truncation summary(tr)
Various filter methods extracting and using information from
mvr objects to assign importance to all included variables. Available
methods are Significance Multivariate Correlation (sMC), Selectivity Ratio (SR),
Variable Importance in Projections (VIP), Loading Weights (LW), Regression Coefficients (RC).
VIP(pls.object, opt.comp, p = dim(pls.object$coef)[1]) SR(pls.object, opt.comp, X) sMC(pls.object, opt.comp, X, alpha_mc = 0.05) LW(pls.object, opt.comp) RC(pls.object, opt.comp) URC(pls.object, opt.comp) FRC(pls.object, opt.comp) mRMR(pls.object, nsel, X)VIP(pls.object, opt.comp, p = dim(pls.object$coef)[1]) SR(pls.object, opt.comp, X) sMC(pls.object, opt.comp, X, alpha_mc = 0.05) LW(pls.object, opt.comp) RC(pls.object, opt.comp) URC(pls.object, opt.comp) FRC(pls.object, opt.comp) mRMR(pls.object, nsel, X)
pls.object |
|
opt.comp |
optimal number of components of PLS model. |
p |
number of variables in PLS model. |
X |
data matrix used as predictors in PLS modelling. |
alpha_mc |
quantile significance for automatic selection of variables in |
nsel |
number of variables to select. |
From plsVarSel 0.9.10, the VIP method handles multiple responses correctly, as does the LW method. All other filter methods implemented in this package assume a single response and will give its results based on the first response in multi-response cases.
A vector having the same length as the number of variables in the associated PLS model. High values are associated with high importance, explained variance or relevance to the model.
The sMC has an attribute "quantile", which is the associated quantile of the F-distribution, which can be used as a cut-off for significant variables, similar to the cut-off of 1 associated with the VIP.
Tahir Mehmood, Kristian Hovde Liland, Solve Sæbø.
T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least Squares Regression, Chemometrics and Intelligent Laboratory Systems 118 (2012) 62-69. T. Mehmood, S. Sæbø, K.H. Liland, Comparison of variable selection methods in partial least squares regression, Journal of Chemometrics 34 (2020) e3226.
VIP (SR/sMC/LW/RC), filterPLSR, shaving,
stpls, truncation,
bve_pls, ga_pls, ipw_pls, mcuve_pls,
rep_pls, spa_pls,
lda_from_pls, lda_from_pls_cv, setDA.
data(gasoline, package = "pls") library(pls) pls <- plsr(octane ~ NIR, ncomp = 10, validation = "LOO", data = gasoline) comp <- which.min(pls$validation$PRESS) X <- unclass(gasoline$NIR) vip <- VIP(pls, comp) sr <- SR (pls, comp, X) smc <- sMC(pls, comp, X) lw <- LW (pls, comp) rc <- RC (pls, comp) urc <- URC(pls, comp) frc <- FRC(pls, comp) mrm <- mRMR(pls, 401, X)$score matplot(scale(cbind(vip, sr, smc, lw, rc, urc, frc, mrm)), type = 'l')data(gasoline, package = "pls") library(pls) pls <- plsr(octane ~ NIR, ncomp = 10, validation = "LOO", data = gasoline) comp <- which.min(pls$validation$PRESS) X <- unclass(gasoline$NIR) vip <- VIP(pls, comp) sr <- SR (pls, comp, X) smc <- sMC(pls, comp, X) lw <- LW (pls, comp) rc <- RC (pls, comp) urc <- URC(pls, comp) frc <- FRC(pls, comp) mrm <- mRMR(pls, 401, X)$score matplot(scale(cbind(vip, sr, smc, lw, rc, urc, frc, mrm)), type = 'l')
This implements the PLS-WVC2 component dependent version of WVC from Lin et al., i.e., using Equations 14, 16 and 19. The implementation is used in T. Mehmood, S. Sæbø, K.H. Liland, Comparison of variable selection methods in partial least squares regression, Journal of Chemometrics 34 (2020) e3226. However, there is a mistake in the notation in Mehmood et al. exchanging the denominator of Equation 19 (w'X'Xw) with (w'X'Yw).
WVC_pls(y, X, ncomp, normalize = FALSE, threshold = NULL)WVC_pls(y, X, ncomp, normalize = FALSE, threshold = NULL)
y |
Vector of responses. |
X |
Matrix of predictors. |
ncomp |
Number of components. |
normalize |
Divide WVC vectors by maximum value. |
threshold |
Set loading weights smaller than threshold to 0 and recompute component. |
loading weights, loadings, regression coefficients, scores and Y-loadings plus the WVC weights.
Variable selection in partial least squares with the weighted variable contribution to the first singular value of the covariance matrix, Weilu Lin, Haifeng Hang, Yingping Zhuang, Siliang Zhang, Chemometrics and Intelligent Laboratory Systems 183 (2018) 113–121.
library(pls) data(mayonnaise, package = "pls") wvc <- WVC_pls(factor(mayonnaise$oil.type), mayonnaise$NIR, 10) wvcNT <- WVC_pls(factor(mayonnaise$oil.type), mayonnaise$NIR, 10, TRUE, 0.5) old.par <- par(mfrow=c(3,1), mar=c(2,4,1,1)) matplot(t(mayonnaise$NIR), type='l', col=1, ylab='intensity') matplot(wvc$W[,1:3], type='l', ylab='W') matplot(wvcNT$W[,1:3], type='l', ylab='W, thr.=0.5') par(old.par)library(pls) data(mayonnaise, package = "pls") wvc <- WVC_pls(factor(mayonnaise$oil.type), mayonnaise$NIR, 10) wvcNT <- WVC_pls(factor(mayonnaise$oil.type), mayonnaise$NIR, 10, TRUE, 0.5) old.par <- par(mfrow=c(3,1), mar=c(2,4,1,1)) matplot(t(mayonnaise$NIR), type='l', col=1, ylab='intensity') matplot(wvc$W[,1:3], type='l', ylab='W') matplot(wvcNT$W[,1:3], type='l', ylab='W, thr.=0.5') par(old.par)