Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)

  • Background: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters. Methods: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,2–√) distribution. The EDO-transformation of a variable X is proposed as EDO=X/(2–√⋅s) following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling. Results: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling. Conclusions: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at

Download full text files

Export metadata

Author:Alfred UltschGND, Jörn LötschORCiDGND
Parent Title (English):BMC bioinformatics
Publisher:BioMed Central , Springer
Place of publication:London , Berlin ; Heidelberg
Document Type:Article
Date of Publication (online):2022/06/16
Date of first Publication:2022/06/16
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2023/04/27
Tag:Biomedical informatics; Data preprocessing; Data science; Machine-learning
Issue:art. 233
Article Number:233
Page Number:18
First Page:1
Last Page:18
Open Access funding enabled and organized by Projekt DEAL.
This work has been funded by the Landesoffensive zur Entwicklung wissenschaftlich-ökonomischer Exzellenz (LOEWE), LOEWE-Zentrum für Translationale Medizin und Pharmakologie (JL), in particular through the project “Reproducible cleaning of biomedical laboratory data using methods of visualization, error correction and transformation implemented as interactive R-notebooks” (JL).).
The “EDOtrans” R package is freely available at It contains the data sets used in this report and not elsewhere available as referenced. Results of the proof-of-concept study using PAM clustering instead of k-means, and results of the four experiments with the data sets of Gaussian mixtures, Iris flowers, FACS data and “Lsun”, using average or complete linkage instead of Ward’s linkage, are provided as Additional file 1: Figures.
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie
6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit
Licence (German):License LogoCreative Commons - CC BY - Namensnennung 4.0 International