Universitätspublikationen
Filtern
Erscheinungsjahr
- 2022 (2)
Dokumenttyp
Sprache
- Englisch (2)
Volltext vorhanden
- ja (2)
Gehört zur Bibliographie
- nein (2)
Schlagworte
- Biomedical informatics (2) (entfernen)
Institut
- Medizin (2)
Motivation: Gaussian mixture models (GMMs) are probabilistic models commonly used in biomedical research to detect subgroup structures in data sets with one-dimensional information. Reliable model parameterization requires that the number of modes, i.e., states of the generating process, is known. However, this is rarely the case for empirically measured biomedical data. Several implementations are available that estimate GMM parameters differently. This work aims to provide a comparative evaluation of automated GMM fitting methods.
Results and conclusions: The performance of commonly used algorithms for automatic parameterization and mode number determination was compared with respect to reproducing the ground truth of generated data derived from multiple normal distributions. Four main variants of Gaussian mode number detection algorithms and five variants of GMM parameter estimation methods were tested in a combinatory scenario. The combination of best performing mode number determination algorithms and GMM parameter estimation methods was then tested on artificial and real-live data sets known to display a GMM structure. None of the tested methods correctly determined the underlying data structure consistently. The likelihood ratio test had the best performance in identifying the mode number associated with the best GMM fit of the data distribution while the Markov chain Monte Carlo (MCMC) algorithm was best for GMM parameter estimation while. The combination of the two methods of number determination algorithms and GMM parameter estimation was consistently among the best and overall outperformed the available implementations.
Implementation: An automated tool for the detection of GMM based structures in (biomedical) datasets was created based on the present results and made freely available in the R library “opGMMassessment” at https://cran.r-project.org/package=opGMMassessment.
Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)
(2022)
Background: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters.
Methods: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,2–√) distribution. The EDO-transformation of a variable X is proposed as EDO=X/(2–√⋅s) following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling.
Results: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling.
Conclusions: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at https://cran.r-project.org/package=EDOtrans.