Sparse Methods for Unsupervised Data Analysis

Gilbert Saporta; Ruiping Liu; Ndeye Niang Keita; Huiwen Wang

Communication Dans Un Congrès Année : 2019

Sparse Methods for Unsupervised Data Analysis

(1) , (2) , (1) , (2)

1
2

Gilbert Saporta

Fonction : Auteur
PersonId : 180161
IdHAL : gilbert-saporta
ORCID : 0000-0002-3406-5887
IdRef : 027122565

CEDRIC. Méthodes statistiques de data-mining et apprentissage

Ruiping Liu

Fonction : Auteur
PersonId : 1064492
ORCID : 0000-0001-8591-7712

Beihang University

Ndeye Niang Keita

Fonction : Auteur
PersonId : 182344
IdHAL : ndeye-niang
ORCID : 0000-0002-6109-9935
IdRef : 179489879

CEDRIC. Méthodes statistiques de data-mining et apprentissage

Huiwen Wang

Fonction : Auteur

Beihang University

Résumé

Principal Components Analysis (PCA), Correspondence Analysis (CA) and Multiple Correspondence Analysis (MCA) are among the most efficient techniques for visualizing and exploring numerical and categorical data in an unsupervised way. However, in the case of high-dimensional data, the interpretation of linear combinations of hundreds or thousands of variables becomes very difficult. The objective of sparse methods is to obtain pseudo-components which are linear combinations of only a small number of variables, and thus to facilitate interpretation by highlighting only the most important features. This simplification is achieved at the cost of the loss of characteristic properties like the orthogonality of the components and of the loadings. This explains why there are more than 20 variants of sparse PCA. In contrast, sparsifying correspondence analysis has received little or no attention in the literature, except for MCA. After a brief survey of sparse PCA, we will focus in sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices. We use the fact that CA is both a PCA (or a weighted SVD) and a canonical analysis, in order to develop column sparse (or row sparse) CA and a doubly sparse CA for rows and columns.

Domaines

Statistiques [stat]

sparseSIDM.pdf (1.66 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Gilbert Saporta : Connectez-vous pour contacter le contributeur

https://cnam.hal.science/hal-02471316

Soumis le : mercredi 9 décembre 2020-11:02:24

Dernière modification le : mercredi 28 septembre 2022-05:53:31

Dates et versions

hal-02471316 , version 1 (09-12-2020)

Identifiants

HAL Id : hal-02471316 , version 1

Citer

Gilbert Saporta, Ruiping Liu, Ndeye Niang Keita, Huiwen Wang. Sparse Methods for Unsupervised Data Analysis. The 4th International Symposium on Interval Data Modelling (SIDM 2019), Jun 2019, Pékin, China. ⟨hal-02471316⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNAM CEDRIC-CNAM HESAM

95 Consultations

32 Téléchargements

Sparse Methods for Unsupervised Data Analysis

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager