SelvarMix: A R package for variable selection in model-based clustering and discriminant analysis with a regularization approach.
Description:
Download SelvarMix 1.2 (beta version): link.
Reference: SelvarMix: A R package for variable selection in model-based clustering and discriminant analysis with a regularization approach, Celeux, G., and Maugis-Rabusseau, C. and Sedki, M. 2016, preprint.
All the experiments are implemented with SelvarMix 1.2
SelvarMix package carries out a regularization approach of variable selection in the model-based clustering and classification frameworks. First, the variables are ranked with a lasso-like procedure. Second, the method of (Maugis, Celeux, and Martin-Magniette 2009; Maugis, Celeux, and Martin-Magniette 2011) is adapted to define the role of variables in the two frameworks. This variable ranking allows us to avoid the painfully slow stepwise forward or backward algorithms of (Maugis, Celeux, and Martin-Magniette 2009). Thus, SelvarMix provides a much faster variable selection procedure than (Maugis, Celeux, and Martin-Magniette 2009; Maugis, Celeux, and Martin-Magniette 2011) allowing to study high-dimensional datasets.
Tool functions summary and print facilitate the result interpretation.
This section presents the whole analysis of a simulated data set. It makes use all the functions implemented in the package SelvarMix and may be regarded as a tutorial.
The cluster analysis is performed with an unknown number of clusters. An information criterion is used for variable selection and choosing the number of clusters. The chosen model is described in a summary.
The synthetic dataset
The simulated dataset consists of 2000 data points in \(\mathbb{R}^{14}\). On the subset of relevant clustering variables \(S = \{1, 2\}\), data are distributed according to a mixture of four equiprobable spherical Gaussian distributions with means \((0,0), (3,0) (0,3)\) and \((3,3)\). The subset of redundant variables is \(U =\{3-11\}\). These variables are explained by the subset of predictor variables \(R = \{1,2\}\) through a linear regression. The last three variables \(W = \{12, 13, 14\}\) are independent. More details are given in (Maugis, Celeux, and Martin-Magniette 2009).
set.seed(123)
n <- 2000; p <- 14
x <- matrix(0,n, p)
x[,1] <- rnorm(n,0,1)
x[,2] <- rnorm(n,0,1)
z <- sample(1:4, n, rep=T)
x[z==2, 1] <- x[z==2, 1] + 3
x[z==3, 2] <- x[z==3, 2] + 3
x[z==4, 1] <- x[z==4, 1] + 3
x[z==4, 2] <- x[z==4, 2] + 3
omega <- matrix(0, 9, 9); diag(omega)[1:3] <- rep(1,3); diag(omega)[4:5] <- rep(0.5,2)
rtmat1 <- matrix(c(cos(pi/3), -sin(pi/3), sin(pi/3), cos(pi/3)), ncol = 2, byrow = TRUE)
rtmat2 <- matrix(c(cos(pi/6), -sin(pi/6), sin(pi/6), cos(pi/6)), ncol = 2, byrow = TRUE)
omega[6:7, 6:7] <- t(rtmat1) %*% diag(c(1,3)) %*% rtmat1
omega[8:9, 8:9] <- t(rtmat2) %*% diag(c(2,6)) %*% rtmat2
b <- cbind(c(0.5,1), c(2,0), c(0,3), c(-1,2), c(2,-4), c(0.5,0), c(4,0.5), c(3,0), c(2,1))
x[,3:11] <- c(0, 0, seq(0.4, 2, len=7)) + x[,1:2]%*%b + t(t(chol(omega)) %*% matrix(rnorm(n*9), 9, n))
x[,12:14] <- matrix(rnorm(3*n), n, 3)
x[,12] <- x[,12] + 3.2; x[,13] <- x[,13] + 3.6; x[,13] <- x[,13] + 4
Variable selection and selection of the number of clusters in the clustering framework
# Cluster analysis with variable selection with parallel computing (8 cores)
# The last two input arguments are optional
require(SelvarMix)
Loading required package: SelvarMix
Loading required package: glasso
Loading required package: Rmixmod
Loading required package: Rcpp
Rmixmod version 2.0.3 loaded
R package of mixmodLib version 3.0.1
Condition of use
----------------
Copyright (C) MIXMOD Team - 2001-2015
MIXMOD is publicly available under the GPL license (see www.gnu.org/copyleft/gpl.html)
You can redistribute it and/or modify it under the terms of the GPL-3 license.
Please understand that there may still be bugs and errors. Use it at your own risk.
We take no responsibility for any errors or omissions in this package or for any misfortune that may befall you or others as a result of its use.
Please report bugs at: http://www.mixmod.org/article.php3?id_article=23
More information on : www.mixmod.org
Loading required package: parallel
obj <- SelvarClustLasso(x=x, nbcluster=3:5, models=mixmodGaussianModel(family = "spherical"), nbcores=8)
variable ranking
SRUW selection with BIC criterion
model selection with BIC criterion
Model Summary
# Summary of the selected model
summary(obj)
Criterion: BIC
Criterion value: -94401.89
Number of clusters: 4
Gaussian mixutre model: Gaussian_p_L_I
Regression covariance model: LC
Independent covariance model: LI
The SRUW model:
S: 1 2
R: 1 2
U: 3 4 5 6 7 8 9 10 11
W: 14 13 12
Result print
# print clustering and regression parameters
print(obj)
****************************************
*** Cluster 1
* proportion = 0.2500
* means = 2.9666 2.8857
* variances = | 0.9684 0.0000 |
| 0.0000 0.9684 |
*** Cluster 2
* proportion = 0.2500
* means = 0.0404 3.0217
* variances = | 0.9684 0.0000 |
| 0.0000 0.9684 |
*** Cluster 3
* proportion = 0.2500
* means = 3.0350 0.0084
* variances = | 0.9684 0.0000 |
| 0.0000 0.9684 |
*** Cluster 4
* proportion = 0.2500
* means = 0.0219 -0.0611
* variances = | 0.9684 0.0000 |
| 0.0000 0.9684 |
****************************************
Regression parameters:
3 4 5 6 7
intercept 0.8955058 0.91583873 0.9756543 0.9668664 0.9979245
1 0.5086800 2.01156693 -0.0186641 -1.0231808 1.9828448
2 1.0166012 -0.02518655 3.0110172 2.0015700 -4.0198483
8 9 10 11
intercept 0.846165781 0.8630311 0.88161925 0.8946564
1 0.521229074 4.0049986 2.98323243 2.0136865
2 -0.008883697 0.5097855 0.01598624 1.0115493
Variable selection in classification
# Discriminant analysis with learning and testing data
# Variable selection with parallel computing (8 cores)
xl <- x[1:1900,]; xt <- x[1901:2000,]
zl <- z[1:1900]; zt <- z[1901:2000]
obj <- SelvarLearnLasso(x=xl, z=zl, models=mixmodGaussianModel(family = "spherical"), xtest=xt, ztest=zt,nbcores=8)
variable ranking
SRUW selection with BIC criterion
model selection with BIC criterion
Model Summary
# Summary of the selected model
summary(obj)
Criterion: BIC
Criterion value: -90917.73
Number of clusters: 4
Gaussian mixutre model: Gaussian_p_L_I
Prediction error: 0.14
Regression covariance model: LC
Independent covariance model: LI
The SRUW model:
S: 1 2
R: 1 2
U: 3 4 5 6 7 8 9 10 11
W: 14 13 12
Result print
# print clustering and regression parameters
print(obj)
****************************************
*** Cluster 1
* proportion = 0.2500
* means = 0.0348 -0.0533
* variances = | 0.9721 0.0000 |
| 0.0000 0.9721 |
*** Cluster 2
* proportion = 0.2500
* means = 2.9986 0.0349
* variances = | 0.9721 0.0000 |
| 0.0000 0.9721 |
*** Cluster 3
* proportion = 0.2500
* means = 0.0656 3.0287
* variances = | 0.9721 0.0000 |
| 0.0000 0.9721 |
*** Cluster 4
* proportion = 0.2500
* means = 3.0066 2.8858
* variances = | 0.9721 0.0000 |
| 0.0000 0.9721 |
****************************************
Regression parameters:
3 4 5 6 7
intercept 0.8855278 0.91073034 0.96793471 0.969374 1.005703
1 0.5118310 2.01298787 -0.01485695 -1.023251 1.979198
2 1.0239138 -0.02352075 3.01144447 1.997734 -4.019287
8 9 10 11
intercept 0.842400074 0.8593035 0.8677248 0.899352
1 0.521734834 4.0069348 2.9883470 1.997643
2 -0.004344507 0.5107485 0.0266206 1.014451
Maugis, C., G. Celeux, and M.-L. Martin-Magniette. 2009. “Variable Selection in Model-Based Clustering: A General Variable Role Modeling.” Computational Statistics and Data Analysis 53: 3872–82.
———. 2011. “Variable Selection in Model-Based Discriminant Analysis.” Journal of Multivariate Analysis 102: 1374–87.