Universitätspublikationen
Refine
Year of publication
Document Type
- Article (33)
Language
- English (33)
Has Fulltext
- yes (33)
Is part of the Bibliography
- no (33)
Keywords
- data science (5)
- Data science (4)
- artificial intelligence (4)
- digital medicine (4)
- Machine-learning (3)
- machine-learning (3)
- Biomedical informatics (2)
- Data processing (2)
- Functional clustering (2)
- Olfactory system (2)
Institute
- Medizin (30)
- Pharmazie (3)
- Biochemie und Chemie (1)
- Biochemie, Chemie und Pharmazie (1)
- Biowissenschaften (1)
Biomedical data obtained during cell experiments, laboratory animal research, or human studies often display a complex distribution. Statistical identification of subgroups in research data poses an analytical challenge. Here were introduce an interactive R-based bioinformatics tool, called “AdaptGauss”. It enables a valid identification of a biologically-meaningful multimodal structure in the data by fitting a Gaussian mixture model (GMM) to the data. The interface allows a supervised selection of the number of subgroups. This enables the expectation maximization (EM) algorithm to adapt more complex GMM than usually observed with a noninteractive approach. Interactively fitting a GMM to heat pain threshold data acquired from human volunteers revealed a distribution pattern with four Gaussian modes located at temperatures of 32.3, 37.2, 41.4, and 45.4 °C. Noninteractive fitting was unable to identify a meaningful data structure. Obtained results are compatible with known activity temperatures of different TRP ion channels suggesting the mechanistic contribution of different heat sensors to the perception of thermal pain. Thus, sophisticated analysis of the modal structure of biomedical data provides a basis for the mechanistic interpretation of the observations. As it may reflect the involvement of different TRP thermosensory ion channels, the analysis provides a starting point for hypothesis-driven laboratory experiments.
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names "Atom", "Chainlink", "EngyTime", "Golfball", "Hepta", "Lsun", "Target", "Tetra", "TwoDiamonds", and "WingNut". Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces.
BACKGROUND: Micro-RNAs (miRNA) are attributed to the systems biological role of a regulatory mechanism of the expression of protein coding genes. Research has identified miRNAs dysregulations in several but distinct pathophysiological processes, which hints at distinct systems-biology functions of miRNAs. The present analysis approached the role of miRNAs from a genomics perspective and assessed the biological roles of 2954 genes and 788 human miRNAs, which can be considered to interact, based on empirical evidence and computational predictions of miRNA versus gene interactions.
RESULTS: From a genomics perspective, the biological processes in which the genes that are influenced by miRNAs are involved comprise of six major topics comprising biological regulation, cellular metabolism, information processing, development, gene expression and tissue homeostasis. The usage of this knowledge as a guidance for further research is sketched for two genetically defined functional areas: cell death and gene expression. Results suggest that the latter points to a fundamental role of miRNAs consisting of hyper-regulation of gene expression, i.e., the control of the expression of such genes which control specifically the expression of genes.
CONCLUSIONS: Laboratory research identified contributions of miRNA regulation to several distinct biological processes. The present analysis transferred this knowledge to a systems-biology level. A comprehensible and precise description of the biological processes in which the genes that are influenced by miRNAs are notably involved could be made. This knowledge can be employed to guide future research concerning the biological role of miRNA (dys-) regulations. The analysis also suggests that miRNAs especially control the expression of genes that control the expression of genes.
Computed ABC analysis for rational selection of most informative variables in multivariate data
(2015)
Objective: Multivariate data sets often differ in several factors or derived statistical parameters, which have to be selected for a valid interpretation. Basing this selection on traditional statistical limits leads occasionally to the perception of losing information from a data set. This paper proposes a novel method for calculating precise limits for the selection of parameter sets.
Methods: The algorithm is based on an ABC analysis and calculates these limits on the basis of the mathematical properties of the distribution of the analyzed items. The limits implement the aim of any ABC analysis, i.e., comparing the increase in yield to the required additional effort. In particular, the limit for set A, the "important few", is optimized in a way that both, the effort and the yield for the other sets (B and C), are minimized and the additional gain is optimized.
Results: As a typical example from biomedical research, the feasibility of the ABC analysis as an objective replacement for classical subjective limits to select highly relevant variance components of pain thresholds is presented. The proposed method improved the biological interpretation of the results and increased the fraction of valid information that was obtained from the experimental data.
Conclusions: The method is applicable to many further biomedical problems including the creation of diagnostic complex biomarkers or short screening tests from comprehensive test batteries. Thus, the ABC analysis can be proposed as a mathematically valid replacement for traditional limits to maximize the information obtained from multivariate research data.
The Gini index is a measure of the inequality of a distribution that can be derived from Lorenz curves. While commonly used in, e.g., economic research, it suffers from ambiguity via lack of Lorenz dominance preservation. Here, investigation of large sets of empirical distributions of incomes of the World’s countries over several years indicated firstly, that the Gini indices are centered on a value of 33.33% corresponding to the Gini index of the uniform distribution and secondly, that the Lorenz curves of these distributions are consistent with Lorenz curves of log-normal distributions. This can be employed to provide a Lorenz dominance preserving equivalent of the Gini index. Therefore, a modified measure based on log-normal approximation and standardization of Lorenz curves is proposed. The so-called UGini index provides a meaningful and intuitive standardization on the uniform distribution as this characterizes societies that provide equal chances. The novel UGini index preserves Lorenz dominance. Analysis of the probability density distributions of the UGini index of the World’s counties income data indicated multimodality in two independent data sets. Applying Bayesian statistics provided a data-based classification of the World’s countries’ income distributions. The UGini index can be re-transferred into the classical index to preserve comparability with previous research.
Computational analyses of functions of gene sets obtained in microarray analyses or by topical database searches are increasingly important in biology. To understand their functions, the sets are usually mapped to Gene Ontology knowledge bases by means of over-representation analysis (ORA). Its result represents the specific knowledge of the functionality of the gene set. However, the specific ontology typically consists of many terms and relationships, hindering the understanding of the ‘main story’. We developed a methodology to identify a comprehensibly small number of GO terms as “headlines” of the specific ontology allowing to understand all central aspects of the roles of the involved genes. The Functional Abstraction method finds a set of headlines that is specific enough to cover all details of a specific ontology and is abstract enough for human comprehension. This method exceeds the classical approaches at ORA abstraction and by focusing on information rather than decorrelation of GO terms, it directly targets human comprehension. Functional abstraction provides, with a maximum of certainty, information value, coverage and conciseness, a representation of the biological functions in a gene set plays a role. This is the necessary means to interpret complex Gene Ontology results thus strengthening the role of functional genomics in biomarker and drug discovery.
Background: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM).
Methods: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means.
Results: Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data.
Conclusions: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
Graphical abstract: 3-D representation of high dimensional data following ESOM projection and visualization of group (cluster) structures using the U-matrix, which employs a geographical map analogy of valleys where members of the same cluster are located, separated by mountain ranges marking cluster borders.
Bayesian inference is ubiquitous in science and widely used in biomedical research such as cell sorting or “omics” approaches, as well as in machine learning (ML), artificial neural networks, and “big data” applications. However, the calculation is not robust in regions of low evidence. In cases where one group has a lower mean but a higher variance than another group, new cases with larger values are implausibly assigned to the group with typically smaller values. An approach for a robust extension of Bayesian inference is proposed that proceeds in two main steps starting from the Bayesian posterior probabilities. First, cases with low evidence are labeled as “uncertain” class membership. The boundary for low probabilities of class assignment (threshold 𝜀
) is calculated using a computed ABC analysis as a data-based technique for item categorization. This leaves a number of cases with uncertain classification (p < 𝜀
). Second, cases with uncertain class membership are relabeled based on the distance to neighboring classified cases based on Voronoi cells. The approach is demonstrated on biomedical data typically analyzed with Bayesian statistics, such as flow cytometric data sets or biomarkers used in medical diagnostics, where it increased the class assignment accuracy by 1–10% depending on the data set. The proposed extension of the Bayesian inference of class membership can be used to obtain robust and plausible class assignments even for data at the extremes of the distribution and/or for which evidence is weak.
Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)
(2022)
Background: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters.
Methods: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,2–√) distribution. The EDO-transformation of a variable X is proposed as EDO=X/(2–√⋅s) following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling.
Results: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling.
Conclusions: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at https://cran.r-project.org/package=EDOtrans.
Background: To prevent persistent post-surgery pain, early identification of patients at high risk is a clinical need. Supervised machine-learning techniques were used to test how accurately the patients’ performance in a preoperatively performed tonic cold pain test could predict persistent post-surgery pain.
Methods: We analysed 763 patients from a cohort of 900 women who were treated for breast cancer, of whom 61 patients had developed signs of persistent pain during three yr of follow-up. Preoperatively, all patients underwent a cold pain test (immersion of the hand into a water bath at 2–4 °C). The patients rated the pain intensity using a numerical ratings scale (NRS) from 0 to 10. Supervised machine-learning techniques were used to construct a classifier that could predict patients at risk of persistent pain.
Results: Whether or not a patient rated the pain intensity at NRS=10 within less than 45 s during the cold water immersion test provided a negative predictive value of 94.4% to assign a patient to the "persistent pain" group. If NRS=10 was never reached during the cold test, the predictive value for not developing persistent pain was almost 97%. However, a low negative predictive value of 10% implied a high false positive rate.
Conclusions: Results provide a robust exclusion of persistent pain in women with an accuracy of 94.4%. Moreover, results provide further support for the hypothesis that the endogenous pain inhibitory system may play an important role in the process of pain becoming persistent.
Diminished sense of smell impairs the quality of life but olfactorily disabled people are hardly considered in measures of disability inclusion. We aimed to stratify perceptual characteristics and odors according to the extent to which they are perceived differently with reduced sense of smell, as a possible basis for creating olfactory experiences that are enjoyed in a similar way by subjects with normal or impaired olfactory function. In 146 subjects with normal or reduced olfactory function, perceptual characteristics (edibility, intensity, irritation, temperature, familiarity, hedonics, painfulness) were tested for four sets of 10 different odors each. Data were analyzed with (i) a projection based on principal component analysis and (ii) the training of a machine-learning algorithm in a 1000-fold cross-validated setting to distinguish between olfactory diagnosis based on odor property ratings. Both analytical approaches identified perceived intensity and familiarity with the odor as discriminating characteristics between olfactory diagnoses, while evoked pain sensation and perceived temperature were not discriminating, followed by edibility. Two disjoint sets of odors were identified, i.e., d = 4 “discriminating odors” with respect to olfactory diagnosis, including cis-3-hexenol, methyl salicylate, 1-butanol and cineole, and d = 7 “non-discriminating odors”, including benzyl acetate, heptanal, 4-ethyl-octanoic acid, methional, isobutyric acid, 4-decanolide and p-cresol. Different weightings of the perceptual properties of odors with normal or reduced sense of smell indicate possibilities to create sensory experiences such as food, meals or scents that by emphasizing trigeminal perceptions can be enjoyed by both normosmic and hyposmic individuals.
The presence of cerebral lesions in patients with neurosensory alterations provides a unique window into brain function. Using a fuzzy logic based combination of morphological information about 27 olfactory-eloquent brain regions acquired with four different brain imaging techniques, patterns of brain damage were analyzed in 127 patients who displayed anosmia, i.e., complete loss of the sense of smell (n = 81), or other and mechanistically still incompletely understood olfactory dysfunctions including parosmia, i.e., distorted perceptions of olfactory stimuli (n = 50), or phantosmia, i.e., olfactory hallucinations (n = 22). A higher prevalence of parosmia, and as a tendency also phantosmia, was observed in subjects with medium overall brain damage. Further analysis showed a lower frequency of lesions in the right temporal lobe in patients with parosmia than in patients without parosmia. This negative direction of the differences was unique for parosmia. In anosmia, and also in phantosmia, lesions were more frequent in patients displaying the respective symptoms than in those without these dysfunctions. In anosmic patients, lesions in the right olfactory bulb region were much more frequent than in patients with preserved sense of smell, whereas a higher frequency of carriers of lesions in the left frontal lobe was observed for phantosmia. We conclude that anosmia, and phantosmia, are the result of lost function in relevant brain areas whereas parosmia is more complex, requiring damaged and intact brain regions at the same time.
Motivation: Calculating the magnitude of treatment effects or of differences between two groups is a common task in quantitative science. Standard effect size measures based on differences, such as the commonly used Cohen's, fail to capture the treatment-related effects on the data if the effects were not reflected by the central tendency. The present work aims at (i) developing a non-parametric alternative to Cohen’s d, which (ii) circumvents some of its numerical limitations and (iii) involves obvious changes in the data that do not affect the group means and are therefore not captured by Cohen’s d.
Results: We propose "Impact” as a novel non-parametric measure of effect size obtained as the sum of two separate components and includes (i) a difference-based effect size measure implemented as the change in the central tendency of the group-specific data normalized to pooled variability and (ii) a data distribution shape-based effect size measure implemented as the difference in probability density of the group-specific data. Results obtained on artificial and empirical data showed that “Impact”is superior to Cohen's d by its additional second component in detecting clearly visible effects not reflected in central tendencies. The proposed effect size measure is invariant to the scaling of the data, reflects changes in the central tendency in cases where differences in the shape of probability distributions between subgroups are negligible, but captures changes in probability distributions as effects and is numerically stable even if the variances of the data set or its subgroups disappear.
Conclusions: The proposed effect size measure shares the ability to observe such an effect with machine learning algorithms. Therefore, the proposed effect size measure is particularly well suited for data science and artificial intelligence-based knowledge discovery from big and heterogeneous data.
Pain and pain chronification are incompletely understood and unresolved medical problems that continue to have a high prevalence. It has been accepted that pain is a complex phenomenon. Contemporary methods of computational science can use complex clinical and experimental data to better understand the complexity of pain. Among data science techniques, machine learning is referred to as a set of methods that can automatically detect patterns in data and then use the uncovered patterns to predict or classify future data, to observe structures such as subgroups in the data, or to extract information from the data suitable to derive new knowledge. Together with (bio)statistics, artificial intelligence and machine learning aim at learning from data. ...
Process pharmacology : a pharmacological data science approach to drug development and therapy
(2016)
A novel functional-genomics based concept of pharmacology that uses artificial intelligence techniques for mining and knowledge discovery in "big data" providing comprehensive information about the drugs’ targets and their functional genomics is proposed. In “process pharmacology”, drugs are associated with biological processes. This puts the disease, regarded as alterations in the activity in one or several cellular processes, in the focus of drug therapy. In this setting, the molecular drug targets are merely intermediates. The identification of drugs for therapeutic or repurposing is based on similarities in the high-dimensional space of the biological processes that a drug influences. Applying this principle to data associated with lymphoblastic leukemia identified a short list of candidate drugs, including one that was recently proposed as novel rescue medication for lymphocytic leukemia. The pharmacological data science approach provides successful selections of drug candidates within development and repurposing tasks.
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroupstructure. Inotherdatasets,t-SNEoccasionallysuggestedthewrongnumberofsubgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.
Feature selection is a common step in data preprocessing that precedes machine learning to reduce data space and the computational cost of processing or obtaining the data. Filtering out uninformative variables is also important for knowledge discovery. By reducing the data space to only those components that are informative to the class structure, feature selection can simplify models so that they can be more easily interpreted by researchers in the field, reminiscent of explainable artificial intelligence. Knowledge discovery in complex data thus benefits from feature selection that aims to understand feature sets in the thematic context from which the data set originates. However, a single variable selected from a very small number of variables that are technically sufficient for AI training may make little immediate thematic sense, whereas the additional consideration of a variable discarded during feature selection could make scientific discovery very explicit. In this report, we propose an approach to explainable feature selection (XFS) based on a systematic reconsideration of unselected features. The difference between the respective classifications when training the algorithms with the selected features or with the unselected features provides a valid estimate of whether the relevant features in a data set have been selected and uninformative or trivial information was filtered out. It is shown that revisiting originally unselected variables in multivariate data sets allows for the detection of pathologies and errors in the feature selection that occasionally resulted in the failure to identify the most appropriate variables.
In a recent discussion on how to deal with data analysis issues initiated by reviewers of pain-related scientific manuscripts in the European Journal of Pain, a seemingly simple statistical issue was raised: two subsets of data in a paper had the same mean and standard deviation. A reviewer asked for a statistical test for or against the identity of the subset distributions. The authors insisted that if the mean and standard deviation were the same, this was sufficient evidence that the subsets of data were not significantly different.
This prompted a discussion among pain researchers, who are not necessarily primarily from the field of data science, a discussion of the importance of carefully examining the distribution of pain-related data in a journal whose primary audience is pain researchers seems warranted...
Recent advances in mathematical modelling and artificial intelligence have challenged the use of traditional regression analysis in biomedical research. This study examined artificial and cancer research data using binomial and multinomial logistic regression and compared its performance with other machine learning models such as random forests, support vector machines, Bayesian classifiers, k-nearest neighbours and repeated incremental clipping (RIPPER). The alternative models often outperformed regression in accurately classifying new cases. Logistic regression had a structural problem similar to early single-layer neural networks, which limited its ability to identify variables with high statistical significance for reliable class assignment. Therefore, regression is not always the best model for class prediction in biomedical datasets. The study emphasises the importance of validating selected models and suggests that a mixture of experts approach may be a more advanced and effective strategy for analysing biomedical datasets.
Selecting the k best features is a common task in machine learning. Typically, a few features have high importance, but many have low importance (right-skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution in order to reduce a feature set to the informative minimum of items. Computed ABC analysis (cABC) is an item categorization method that aims to identify the most important items by partitioning a set of non-negative numerical items into subsets "A", "B", and "C" such that subset "A" contains the "few important" items based on specific properties of ABC curves defined by their relationship to Lorenz curves. In its recursive form, the cABC analysis can be applied again to subset "A". A generic image dataset and three biomedical datasets (lipidomics and two genomics datasets) with a large number of variables were used to perform the experiments. The experimental results show that the recursive cABC analysis limits the dimensions of the data projection to a minimum where the relevant information is still preserved and directs the feature selection in machine learning to the most important class-relevant information, including filtering feature sets for nonsense variables. Feature sets were reduced to 10% or less of the original variables and still provided accurate classification in data not used for feature selection. cABC analysis, in its recursive variant, provides a computationally precise means of reducing information to a minimum. The minimum is the result of a computation of the number of k most relevant items, rather than a decision to select the k best items from a list. In addition, there are precise criteria for stopping the reduction process. The reduction to the most important features can improve the human understanding of the properties of the data set. The cABC method is implemented in the Python package "cABCanalysis" available at https://pypi.org/project/cABCanalysis/.