SelvarMix

SelvarMix: A R package for variable selection in model-based clustering and discriminant analysis with a regularization approach.

Overview of the SelvarMix functions

This section presents the whole analysis of a simulated data set. It makes use all the functions implemented in the package SelvarMix and may be regarded as a tutorial.

The cluster analysis is performed with an unknown number of clusters. An information criterion is used for variable selection and choosing the number of clusters. The chosen model is described in a summary.

The synthetic dataset

The simulated dataset consists of 2000 data points in \(\mathbb{R}^{14}\). On the subset of relevant clustering variables \(S = \{1, 2\}\), data are distributed according to a mixture of four equiprobable spherical Gaussian distributions with means \((0,0), (3,0) (0,3)\) and \((3,3)\). The subset of redundant variables is \(U =\{3-11\}\). These variables are explained by the subset of predictor variables \(R = \{1,2\}\) through a linear regression. The last three variables \(W = \{12, 13, 14\}\) are independent. More details are given in (Maugis, Celeux, and Martin-Magniette 2009).

set.seed(123)
n <- 2000; p <- 14
x <- matrix(0,n, p)
x[,1] <- rnorm(n,0,1)
x[,2] <- rnorm(n,0,1)
z <-  sample(1:4, n, rep=T)
x[z==2, 1] <- x[z==2, 1] + 3
x[z==3, 2] <- x[z==3, 2] + 3
x[z==4, 1] <- x[z==4, 1] + 3
x[z==4, 2] <- x[z==4, 2] + 3

omega <- matrix(0, 9, 9); diag(omega)[1:3] <- rep(1,3); diag(omega)[4:5] <- rep(0.5,2)
rtmat1 <- matrix(c(cos(pi/3), -sin(pi/3), sin(pi/3), cos(pi/3)), ncol = 2, byrow = TRUE)
rtmat2 <- matrix(c(cos(pi/6), -sin(pi/6), sin(pi/6), cos(pi/6)), ncol = 2, byrow = TRUE)
omega[6:7, 6:7] <- t(rtmat1) %*% diag(c(1,3)) %*% rtmat1
omega[8:9, 8:9] <- t(rtmat2) %*% diag(c(2,6)) %*% rtmat2
b <- cbind(c(0.5,1), c(2,0), c(0,3), c(-1,2), c(2,-4), c(0.5,0), c(4,0.5), c(3,0), c(2,1))
x[,3:11] <- c(0, 0, seq(0.4, 2, len=7)) + x[,1:2]%*%b + t(t(chol(omega)) %*% matrix(rnorm(n*9), 9, n)) 
x[,12:14] <- matrix(rnorm(3*n), n, 3)
x[,12] <- x[,12] + 3.2; x[,13] <- x[,13] + 3.6; x[,13] <- x[,13] + 4

Go to the top

Variable selection and selection of the number of clusters in the clustering framework

# Cluster analysis with variable selection with parallel computing (8 cores) 
# The last two input arguments are optional
require(SelvarMix)

Loading required package: SelvarMix

Loading required package: glasso

Loading required package: Rmixmod

Loading required package: Rcpp

Rmixmod version 2.0.3 loaded
R package of mixmodLib version 3.0.1

Condition of use
----------------
Copyright (C)  MIXMOD Team - 2001-2015

MIXMOD is publicly available under the GPL license (see www.gnu.org/copyleft/gpl.html)
You can redistribute it and/or modify it under the terms of the GPL-3 license.
Please understand that there may still be bugs and errors. Use it at your own risk.
We take no responsibility for any errors or omissions in this package or for any misfortune that may befall you or others as a result of its use.

Please report bugs at: http://www.mixmod.org/article.php3?id_article=23

More information on : www.mixmod.org

Loading required package: parallel

obj <- SelvarClustLasso(x=x, nbcluster=3:5, models=mixmodGaussianModel(family = "spherical"), nbcores=8)

variable  ranking
SRUW selection with BIC criterion
model selection  with BIC criterion

Model Summary

# Summary of the selected model
summary(obj)

Criterion: BIC 
Criterion value: -94401.89 
Number of clusters: 4 
Gaussian mixutre model: Gaussian_p_L_I 
Regression covariance model: LC 
Independent covariance model: LI 
The SRUW model:
 S: 1 2 
 R: 1 2 
 U: 3 4 5 6 7 8 9 10 11 
 W: 14 13 12

Go to the top

Result print

# print clustering and regression parameters 
print(obj)

****************************************
*** Cluster 1 
* proportion =  0.2500 
* means      =  2.9666 2.8857 
* variances  = |     0.9684     0.0000 |
               |     0.0000     0.9684 |
*** Cluster 2 
* proportion =  0.2500 
* means      =  0.0404 3.0217 
* variances  = |     0.9684     0.0000 |
               |     0.0000     0.9684 |
*** Cluster 3 
* proportion =  0.2500 
* means      =  3.0350 0.0084 
* variances  = |     0.9684     0.0000 |
               |     0.0000     0.9684 |
*** Cluster 4 
* proportion =  0.2500 
* means      =  0.0219 -0.0611 
* variances  = |     0.9684     0.0000 |
               |     0.0000     0.9684 |
****************************************
Regression parameters:
                  3           4          5          6          7
intercept 0.8955058  0.91583873  0.9756543  0.9668664  0.9979245
1         0.5086800  2.01156693 -0.0186641 -1.0231808  1.9828448
2         1.0166012 -0.02518655  3.0110172  2.0015700 -4.0198483
                     8         9         10        11
intercept  0.846165781 0.8630311 0.88161925 0.8946564
1          0.521229074 4.0049986 2.98323243 2.0136865
2         -0.008883697 0.5097855 0.01598624 1.0115493

Go to the top

Variable selection in classification

# Discriminant analysis with learning and testing data
# Variable selection with parallel computing (8 cores)
xl <- x[1:1900,]; xt <- x[1901:2000,] 
zl <- z[1:1900]; zt <- z[1901:2000]
obj <- SelvarLearnLasso(x=xl, z=zl, models=mixmodGaussianModel(family = "spherical"), xtest=xt, ztest=zt,nbcores=8)

variable  ranking
SRUW  selection with BIC criterion
model selection with BIC criterion

Model Summary

# Summary of the selected model
summary(obj)

Criterion: BIC 
Criterion value: -90917.73 
Number of clusters: 4 
Gaussian mixutre model: Gaussian_p_L_I 
Prediction error: 0.14 
Regression covariance model: LC 
Independent covariance model: LI 
The SRUW model:
 S: 1 2 
 R: 1 2 
 U: 3 4 5 6 7 8 9 10 11 
 W: 14 13 12

Go to the top

Result print

# print clustering and regression parameters 
print(obj)

****************************************
*** Cluster 1 
* proportion =  0.2500 
* means      =  0.0348 -0.0533 
* variances  = |     0.9721     0.0000 |
               |     0.0000     0.9721 |
*** Cluster 2 
* proportion =  0.2500 
* means      =  2.9986 0.0349 
* variances  = |     0.9721     0.0000 |
               |     0.0000     0.9721 |
*** Cluster 3 
* proportion =  0.2500 
* means      =  0.0656 3.0287 
* variances  = |     0.9721     0.0000 |
               |     0.0000     0.9721 |
*** Cluster 4 
* proportion =  0.2500 
* means      =  3.0066 2.8858 
* variances  = |     0.9721     0.0000 |
               |     0.0000     0.9721 |
****************************************
Regression parameters:
                  3           4           5         6         7
intercept 0.8855278  0.91073034  0.96793471  0.969374  1.005703
1         0.5118310  2.01298787 -0.01485695 -1.023251  1.979198
2         1.0239138 -0.02352075  3.01144447  1.997734 -4.019287
                     8         9        10       11
intercept  0.842400074 0.8593035 0.8677248 0.899352
1          0.521734834 4.0069348 2.9883470 1.997643
2         -0.004344507 0.5107485 0.0266206 1.014451

Go to the top

Maugis, C., G. Celeux, and M.-L. Martin-Magniette. 2009. “Variable Selection in Model-Based Clustering: A General Variable Role Modeling.” Computational Statistics and Data Analysis 53: 3872–82.

———. 2011. “Variable Selection in Model-Based Discriminant Analysis.” Journal of Multivariate Analysis 102: 1374–87.

SelvarMix

Introduction

Overview of the SelvarMix functions