Refine
Year of publication
Document Type
- Article (26)
Language
- English (26)
Has Fulltext
- yes (26)
Is part of the Bibliography
- no (26)
Keywords
- Data science (3)
- Machine-learning (3)
- Data processing (2)
- Functional clustering (2)
- Olfactory system (2)
- Probability density (2)
- Probability distribution (2)
- B cells (1)
- Bioinformatics (1)
- Biomedical informatics (1)
Institute
- Medizin (23)
- Pharmazie (3)
- Biochemie und Chemie (1)
- Biochemie, Chemie und Pharmazie (1)
- Biowissenschaften (1)
Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)
(2022)
Background: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters.
Methods: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,2–√) distribution. The EDO-transformation of a variable X is proposed as EDO=X/(2–√⋅s) following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling.
Results: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling.
Conclusions: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at https://cran.r-project.org/package=EDOtrans.
Motivation: Calculating the magnitude of treatment effects or of differences between two groups is a common task in quantitative science. Standard effect size measures based on differences, such as the commonly used Cohen's, fail to capture the treatment-related effects on the data if the effects were not reflected by the central tendency. The present work aims at (i) developing a non-parametric alternative to Cohen’s d, which (ii) circumvents some of its numerical limitations and (iii) involves obvious changes in the data that do not affect the group means and are therefore not captured by Cohen’s d.
Results: We propose "Impact” as a novel non-parametric measure of effect size obtained as the sum of two separate components and includes (i) a difference-based effect size measure implemented as the change in the central tendency of the group-specific data normalized to pooled variability and (ii) a data distribution shape-based effect size measure implemented as the difference in probability density of the group-specific data. Results obtained on artificial and empirical data showed that “Impact”is superior to Cohen's d by its additional second component in detecting clearly visible effects not reflected in central tendencies. The proposed effect size measure is invariant to the scaling of the data, reflects changes in the central tendency in cases where differences in the shape of probability distributions between subgroups are negligible, but captures changes in probability distributions as effects and is numerically stable even if the variances of the data set or its subgroups disappear.
Conclusions: The proposed effect size measure shares the ability to observe such an effect with machine learning algorithms. Therefore, the proposed effect size measure is particularly well suited for data science and artificial intelligence-based knowledge discovery from big and heterogeneous data.
Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
(2021)
Motivation: The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. Results: By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. Conclusions: Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names "Atom", "Chainlink", "EngyTime", "Golfball", "Hepta", "Lsun", "Target", "Tetra", "TwoDiamonds", and "WingNut". Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces.
Background: Human genetic research has implicated functional variants of more than one hundred genes in the modulation of persisting pain. Artificial intelligence and machine‐learning techniques may combine this knowledge with results of genetic research gathered in any context, which permits the identification of the key biological processes involved in chronic sensitization to pain.
Methods: Based on published evidence, a set of 110 genes carrying variants reported to be associated with modulation of the clinical phenotype of persisting pain in eight different clinical settings was submitted to unsupervised machine‐learning aimed at functional clustering. Subsequently, a mathematically supported subset of genes, comprising those most consistently involved in persisting pain, was analysed by means of computational functional genomics in the Gene Ontology knowledgebase.
Results: Clustering of genes with evidence for a modulation of persisting pain elucidated a functionally heterogeneous set. The situation cleared when the focus was narrowed to a genetic modulation consistently observed throughout several clinical settings. On this basis, two groups of biological processes, the immune system and nitric oxide signalling, emerged as major players in sensitization to persisting pain, which is biologically highly plausible and in agreement with other lines of pain research.
Conclusions: The present computational functional genomics‐based approach provided a computational systems‐biology perspective on chronic sensitization to pain. Human genetic control of persisting pain points to the immune system as a source of potential future targets for drugs directed against persisting pain. Contemporary machine‐learned methods provide innovative approaches to knowledge discovery from previous evidence.
Significance: We show that knowledge discovery in genetic databases and contemporary machine‐learned techniques can identify relevant biological processes involved in Persitent pain.
Based on increasing evidence suggesting that MS pathology involves alterations in bioactive lipid metabolism, the present analysis was aimed at generating a complex serum lipid-biomarker. Using unsupervised machine-learning, implemented as emergent self-organizing maps of neuronal networks, swarm intelligence and Minimum Curvilinear Embedding, a cluster structure was found in the input data space comprising serum concentrations of d = 43 different lipid-markers of various classes. The structure coincided largely with the clinical diagnosis, indicating that the data provide a basis for the creation of a biomarker (classifier). This was subsequently assessed using supervised machine-learning, implemented as random forests and computed ABC analysis-based feature selection. Bayesian statistics-based biomarker creation was used to map the diagnostic classes of either MS patients (n = 102) or healthy subjects (n = 301). Eight lipid-markers passed the feature selection and comprised GluCerC16, LPA20:4, HETE15S, LacCerC24:1, C16Sphinganine, biopterin and the endocannabinoids PEA and OEA. A complex classifier or biomarker was developed that predicted MS at a sensitivity, specificity and accuracy of approximately 95% in training and test data sets, respectively. The present successful application of serum lipid marker concentrations to MS data is encouraging for further efforts to establish an MS biomarker based on serum lipidomics.
The comprehensive assessment of pain-related human phenotypes requires combinations of nociceptive measures that produce complex high-dimensional data, posing challenges to bioinformatic analysis. In this study, we assessed established experimental models of heat hyperalgesia of the skin, consisting of local ultraviolet-B (UV-B) irradiation or capsaicin application, in 82 healthy subjects using a variety of noxious stimuli. We extended the original heat stimulation by applying cold and mechanical stimuli and assessing the hypersensitization effects with a clinically established quantitative sensory testing (QST) battery (German Research Network on Neuropathic Pain). This study provided a 246 × 10-sized data matrix (82 subjects assessed at baseline, following UV-B application, and following capsaicin application) with respect to 10 QST parameters, which we analyzed using machine-learning techniques. We observed statistically significant effects of the hypersensitization treatments in 9 different QST parameters. Supervised machine-learned analysis implemented as random forests followed by ABC analysis pointed to heat pain thresholds as the most relevantly affected QST parameter. However, decision tree analysis indicated that UV-B additionally modulated sensitivity to cold. Unsupervised machine-learning techniques, implemented as emergent self-organizing maps, hinted at subgroups responding to topical application of capsaicin. The distinction among subgroups was based on sensitivity to pressure pain, which could be attributed to sex differences, with women being more sensitive than men. Thus, while UV-B and capsaicin share a major component of heat pain sensitization, they differ in their effects on QST parameter patterns in healthy subjects, suggesting a lack of redundancy between these models.
Background: Prevention of persistent pain following breast cancer surgery, via early identification of patients at high risk, is a clinical need. Supervised machine-learning was used to identify parameters that predict persistence of significant pain.
Methods: Over 500 demographic, clinical and psychological parameters were acquired up to 6 months after surgery from 1,000 women (aged 28–75 years) who were treated for breast cancer. Pain was assessed using an 11-point numerical rating scale before surgery and at months 1, 6, 12, 24, and 36. The ratings at months 12, 24, and 36 were used to allocate patents to either "persisting pain" or "non-persisting pain" groups. Unsupervised machine learning was applied to map the parameters to these diagnoses.
Results: A symbolic rule-based classifier tool was created that comprised 21 single or aggregated parameters, including demographic features, psychological and pain-related parameters, forming a questionnaire with "yes/no" items (decision rules). If at least 10 of the 21 rules applied, persisting pain was predicted at a cross-validated accuracy of 86% and a negative predictive value of approximately 95%.
Conclusions: The present machine-learned analysis showed that, even with a large set of parameters acquired from a large cohort, early identification of these patients is only partly successful. This indicates that more parameters are needed for accurate prediction of persisting pain. However, with the current parameters it is possible, with a certainty of almost 95%, to exclude the possibility of persistent pain developing in a woman being treated for breast cancer.
Pain and pain chronification are incompletely understood and unresolved medical problems that continue to have a high prevalence. It has been accepted that pain is a complex phenomenon. Contemporary methods of computational science can use complex clinical and experimental data to better understand the complexity of pain. Among data science techniques, machine learning is referred to as a set of methods that can automatically detect patterns in data and then use the uncovered patterns to predict or classify future data, to observe structures such as subgroups in the data, or to extract information from the data suitable to derive new knowledge. Together with (bio)statistics, artificial intelligence and machine learning aim at learning from data. ...
Background: To prevent persistent post-surgery pain, early identification of patients at high risk is a clinical need. Supervised machine-learning techniques were used to test how accurately the patients’ performance in a preoperatively performed tonic cold pain test could predict persistent post-surgery pain.
Methods: We analysed 763 patients from a cohort of 900 women who were treated for breast cancer, of whom 61 patients had developed signs of persistent pain during three yr of follow-up. Preoperatively, all patients underwent a cold pain test (immersion of the hand into a water bath at 2–4 °C). The patients rated the pain intensity using a numerical ratings scale (NRS) from 0 to 10. Supervised machine-learning techniques were used to construct a classifier that could predict patients at risk of persistent pain.
Results: Whether or not a patient rated the pain intensity at NRS=10 within less than 45 s during the cold water immersion test provided a negative predictive value of 94.4% to assign a patient to the "persistent pain" group. If NRS=10 was never reached during the cold test, the predictive value for not developing persistent pain was almost 97%. However, a low negative predictive value of 10% implied a high false positive rate.
Conclusions: Results provide a robust exclusion of persistent pain in women with an accuracy of 94.4%. Moreover, results provide further support for the hypothesis that the endogenous pain inhibitory system may play an important role in the process of pain becoming persistent.