Skip to Main content Skip to Navigation
Conference papers

Some Sparse Methods for High Dimensional Data

Gilbert Saporta 1
1 CEDRIC - MSDMA - CEDRIC. Méthodes statistiques de data-mining et apprentissage
CEDRIC - Centre d'études et de recherche en informatique et communications
Abstract : High dimensional data means that the number of variables p is far larger than the number of observations n . This occurs in several fields such as genomic data or chemometrics. When p>n the OLS estimator does not exist for linear regression. Since it is a case of forced multicollinearity, one may use regularized methods such as ridge regression, principal component regression or PLS regression: these methods provide rather robust estimates through a dimension reduction approach or constraints on the regression coefficients. The fact that all the predictors are kept may be considered as a positive point in some cases. However if p>>n, it becomes a drawback since a combination of thousands of variables cannot be interpreted. Sparse combinations, ie with a large number of zero coefficients are preferred. Lasso, elastic net, sPLS perform simultaneously regularization and variable selection thanks to non quadratic penalties: L1, SCAD etc. Group-lasso is a generalization fitted to the case where explanatory variables are structured in blocks. Recent works include sparse discriminant analysis and sparse canonical correlation analysis. In PCA, the singular value decomposition shows that if we regress principal components onto the input variables, the vector of regression coefficients is equal to the factor loadings. It suffices to adapt sparse regression techniques to get sparse versions of PCA. Sparse Multiple Correspondence Analysis is derived from group-lasso with groups of indicator variables. Finally when one has a large number of observations, it is frequent that unobserved heterogeneity occurs, which means that there is no single model, but several local models: one for each cluster of a latent variable. Clusterwise methods optimize simultaneously the partition and the local models; they have been already extended to PLS regression. We will present here CS-PLS (Clusterwise Sparse PLS) a combination of clusterwise PLS and sPLS which is well fitted for big data: large n , large p.
Document type :
Conference papers
Complete list of metadatas

https://hal-cnam.archives-ouvertes.fr/hal-02500643
Contributor : Gilbert Saporta <>
Submitted on : Friday, March 6, 2020 - 11:22:22 AM
Last modification on : Sunday, March 8, 2020 - 1:21:11 AM

Identifiers

  • HAL Id : hal-02500643, version 1

Collections

Citation

Gilbert Saporta. Some Sparse Methods for High Dimensional Data. H2DM International Workshop on High Dimensional Data Mining, Jun 2016, Naples, Italy. ⟨hal-02500643⟩

Share

Metrics

Record views

9