Current projection methods-induced biases at subgroup detection for machine-learning based data-analysis of biomedical data

  • Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroupstructure. Inotherdatasets,t-SNEoccasionallysuggestedthewrongnumberofsubgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Author:Jörn LötschORCiDGND, Alfred UltschGND
Pubmed Id:
Parent Title (English):International journal of molecular sciences
Publisher:Molecular Diversity Preservation International (MDPI)
Place of publication:Basel
Document Type:Article
Date of Publication (online):2019/12/20
Date of first Publication:2019/12/20
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2020/05/16
Tag:computational techniques; data science; emergent self-organizing maps; high-dimensional data sets; immunological research; machine-learning; t-distributed stochastic neighbor embedding; flow cytometry
Page Number:13
First Page:1
Last Page:13
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
Institutes:Medizin / Medizin
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie
6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit
Licence (German):License LogoCreative Commons - Namensnennung 4.0