Current projection methods-induced biases at subgroup detection for machine-learning based data-analysis of biomedical data
- Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroupstructure. Inotherdatasets,t-SNEoccasionallysuggestedthewrongnumberofsubgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.
Author: | Jörn LötschORCiDGND, Alfred UltschGND |
---|---|
URN: | urn:nbn:de:hebis:30:3-535493 |
DOI: | https://doi.org/10.3390/ijms21010079 |
ISSN: | 1422-0067 |
ISSN: | 1661-6596 |
Pubmed Id: | https://pubmed.ncbi.nlm.nih.gov/31861946 |
Parent Title (English): | International journal of molecular sciences |
Publisher: | Molecular Diversity Preservation International (MDPI) |
Place of publication: | Basel |
Document Type: | Article |
Language: | English |
Date of Publication (online): | 2019/12/20 |
Date of first Publication: | 2019/12/20 |
Publishing Institution: | Universitätsbibliothek Johann Christian Senckenberg |
Release Date: | 2020/05/16 |
Tag: | computational techniques; data science; emergent self-organizing maps; high-dimensional data sets; immunological research; machine-learning; t-distributed stochastic neighbor embedding; flow cytometry |
Volume: | 21 |
Issue: | 79 |
Page Number: | 13 |
First Page: | 1 |
Last Page: | 13 |
Note: | This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited |
HeBIS-PPN: | 465081827 |
Institutes: | Medizin / Medizin |
Dewey Decimal Classification: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie | |
6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit | |
Sammlungen: | Universitätspublikationen |
Open-Access-Publikationsfonds: | Medizin |
Licence (German): | Creative Commons - Namensnennung 4.0 |