Some Sparse Methods for High Dimensional Data

Gilbert Saporta; Anne Bernard; Stéphanie Bougeard; Ndèye Niang; Cristian Preda

Communication Dans Un Congrès Année : 2016

Some Sparse Methods for High Dimensional Data

(1) , (2) , (3) , (1) , (4)

1
2
3
4

Gilbert Saporta

Fonction : Auteur
PersonId : 180161
IdHAL : gilbert-saporta
ORCID : 0000-0002-3406-5887
IdRef : 027122565

CEDRIC. Méthodes statistiques de data-mining et apprentissage

Anne Bernard

Fonction : Auteur
PersonId : 964651

QFAB Bioinformatics

Stéphanie Bougeard

Fonction : Auteur
PersonId : 927725
IdRef : 120587297

Agence nationale de sécurité sanitaire de l'alimentation, de l'environnement et du travail

Ndèye Niang

Fonction : Auteur
PersonId : 182344
IdHAL : ndeye-niang
ORCID : 0000-0002-6109-9935
IdRef : 179489879

CEDRIC. Méthodes statistiques de data-mining et apprentissage

Cristian Preda

Fonction : Auteur
PersonId : 966500

École polytechnique universitaire de Lille

Résumé

High dimensional data means that the number of variables p is far larger than the number of observations n . This occurs in several fields such as genomic data or chemometrics. When p>n the OLS estimator does not exist for linear regression. Since it is a case of forced multicollinearity, one may use regularized methods such as ridge regression, principal component regression or PLS regression: these methods provide rather robust estimates through a dimension reduction approach or constraints on the regression coefficients. The fact that all the predictors are kept may be considered as a positive point in some cases. However if p>>n, it becomes a drawback since a combination of thousands of variables cannot be interpreted. Sparse combinations, ie with a large number of zero coefficients are preferred. Lasso, elastic net, sPLS perform simultaneously regularization and variable selection thanks to non quadratic penalties: L1, SCAD etc. Group-lasso is a generalization fitted to the case where explanatory variables are structured in blocks. Recent works include sparse discriminant analysis and sparse canonical correlation analysis. In PCA, the singular value decomposition shows that if we regress principal components onto the input variables, the vector of regression coefficients is equal to the factor loadings. It suffices to adapt sparse regression techniques to get sparse versions of PCA. Sparse Multiple Correspondence Analysis is derived from group-lasso with groups of indicator variables. Finally when one has a large number of observations, it is frequent that unobserved heterogeneity occurs, which means that there is no single model, but several local models: one for each cluster of a latent variable. Clusterwise methods optimize simultaneously the partition and the local models; they have been already extended to PLS regression. We will present here CS-PLS (Clusterwise Sparse PLS) a combination of clusterwise PLS and sPLS which is well fitted for big data: large n , large p.

Mots clés

High dimensional sparse clusterwise PLS

Domaines

Statistiques [stat] Bio-informatique [q-bio.QM]

sparseH2DM2016.pdf (1.59 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Gilbert Saporta : Connectez-vous pour contacter le contributeur

https://cnam.hal.science/hal-02500643

Soumis le : mercredi 9 décembre 2020-14:49:14

Dernière modification le : mardi 24 octobre 2023-14:30:03

Dates et versions

hal-02500643 , version 1 (09-12-2020)

Identifiants

HAL Id : hal-02500643 , version 1

Citer

Gilbert Saporta, Anne Bernard, Stéphanie Bougeard, Ndèye Niang, Cristian Preda. Some Sparse Methods for High Dimensional Data. H2DM International Workshop on High Dimensional Data Mining, Jun 2016, Naples, Italy. ⟨hal-02500643⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ANSES CNAM CEDRIC-CNAM HESAM

108 Consultations

14 Téléchargements

Some Sparse Methods for High Dimensional Data

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager