Refine
Year of publication
Document Type
- Article (33)
Language
- English (33)
Has Fulltext
- yes (33)
Is part of the Bibliography
- no (33)
Keywords
- data science (5)
- Data science (4)
- artificial intelligence (4)
- digital medicine (4)
- Machine-learning (3)
- machine-learning (3)
- Biomedical informatics (2)
- Data processing (2)
- Functional clustering (2)
- Olfactory system (2)
Institute
- Medizin (30)
- Pharmazie (3)
- Biochemie und Chemie (1)
- Biochemie, Chemie und Pharmazie (1)
- Biowissenschaften (1)
Background: The quantification of global DNA methylation has been established in epigenetic screening. As more practicable alternatives to the HPLC-based gold standard, the methylation analysis of CpG islands in repeatable elements (LINE-1) and the luminometric methylation assay (LUMA) of overall 5-methylcytosine content in “CCGG” recognition sites are most widely used. Both methods are applied as virtually equivalent, despite the hints that their results only partly agree. This triggered the present agreement assessments.
Results: Three different human cell types (cultured MCF7 and SHSY5Y cell lines treated with different chemical modulators of DNA methylation and whole blood drawn from pain patients and healthy volunteers) were submitted to the global DNA methylation assays employing LINE-1 or LUMA-based pyrosequencing measurements. The agreement between the two bioassays was assessed using generally accepted approaches to the statistics for laboratory method comparison studies. Although global DNA methylation levels measured by the two methods correlated, five different lines of statistical evidence consistently rejected the assumption of complete agreement. Specifically, a bias was observed between the two methods. In addition, both the magnitude and direction of bias were tissue-dependent. Interassay differences could be grouped based on Bayesian statistics, and these groups allowed in turn to re-identify the originating tissue.
Conclusions: Although providing partly correlated measurements of DNA methylation, interchangeability of the quantitative results obtained with LINE-1 and LUMA was jeopardized by a consistent bias between the results. Moreover, the present analyses strongly indicate a tissue specificity of the differences between the two methods.
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroupstructure. Inotherdatasets,t-SNEoccasionallysuggestedthewrongnumberofsubgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.
Computational analyses of functions of gene sets obtained in microarray analyses or by topical database searches are increasingly important in biology. To understand their functions, the sets are usually mapped to Gene Ontology knowledge bases by means of over-representation analysis (ORA). Its result represents the specific knowledge of the functionality of the gene set. However, the specific ontology typically consists of many terms and relationships, hindering the understanding of the ‘main story’. We developed a methodology to identify a comprehensibly small number of GO terms as “headlines” of the specific ontology allowing to understand all central aspects of the roles of the involved genes. The Functional Abstraction method finds a set of headlines that is specific enough to cover all details of a specific ontology and is abstract enough for human comprehension. This method exceeds the classical approaches at ORA abstraction and by focusing on information rather than decorrelation of GO terms, it directly targets human comprehension. Functional abstraction provides, with a maximum of certainty, information value, coverage and conciseness, a representation of the biological functions in a gene set plays a role. This is the necessary means to interpret complex Gene Ontology results thus strengthening the role of functional genomics in biomarker and drug discovery.
The presence of cerebral lesions in patients with neurosensory alterations provides a unique window into brain function. Using a fuzzy logic based combination of morphological information about 27 olfactory-eloquent brain regions acquired with four different brain imaging techniques, patterns of brain damage were analyzed in 127 patients who displayed anosmia, i.e., complete loss of the sense of smell (n = 81), or other and mechanistically still incompletely understood olfactory dysfunctions including parosmia, i.e., distorted perceptions of olfactory stimuli (n = 50), or phantosmia, i.e., olfactory hallucinations (n = 22). A higher prevalence of parosmia, and as a tendency also phantosmia, was observed in subjects with medium overall brain damage. Further analysis showed a lower frequency of lesions in the right temporal lobe in patients with parosmia than in patients without parosmia. This negative direction of the differences was unique for parosmia. In anosmia, and also in phantosmia, lesions were more frequent in patients displaying the respective symptoms than in those without these dysfunctions. In anosmic patients, lesions in the right olfactory bulb region were much more frequent than in patients with preserved sense of smell, whereas a higher frequency of carriers of lesions in the left frontal lobe was observed for phantosmia. We conclude that anosmia, and phantosmia, are the result of lost function in relevant brain areas whereas parosmia is more complex, requiring damaged and intact brain regions at the same time.
Background: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM).
Methods: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means.
Results: Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data.
Conclusions: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
Graphical abstract: 3-D representation of high dimensional data following ESOM projection and visualization of group (cluster) structures using the U-matrix, which employs a geographical map analogy of valleys where members of the same cluster are located, separated by mountain ranges marking cluster borders.
Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)
(2021)
Motivation: The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. Results: By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. Conclusions: Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.
Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)
(2022)
Background: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters.
Methods: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,2–√) distribution. The EDO-transformation of a variable X is proposed as EDO=X/(2–√⋅s) following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling.
Results: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling.
Conclusions: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available at https://cran.r-project.org/package=EDOtrans.
Next-generation sequencing (NGS) provides unrestricted access to the genome, but it produces ‘big data’ exceeding in amount and complexity the classical analytical approaches. We introduce a bioinformatics-based classifying biomarker that uses emergent properties in genetics to separate pain patients requiring extremely high opioid doses from controls. Following precisely calculated selection of the 34 most informative markers in the OPRM1, OPRK1, OPRD1 and SIGMAR1 genes, pattern of genotypes belonging to either patient group could be derived using a k-nearest neighbor (kNN) classifier that provided a diagnostic accuracy of 80.6±4%. This outperformed alternative classifiers such as reportedly functional opioid receptor gene variants or complex biomarkers obtained via multiple regression or decision tree analysis. The accumulation of several genetic variants with only minor functional influences may result in a qualitative consequence affecting complex phenotypes, pointing at emergent properties in genetics.
Background: Human genetic research has implicated functional variants of more than one hundred genes in the modulation of persisting pain. Artificial intelligence and machine‐learning techniques may combine this knowledge with results of genetic research gathered in any context, which permits the identification of the key biological processes involved in chronic sensitization to pain.
Methods: Based on published evidence, a set of 110 genes carrying variants reported to be associated with modulation of the clinical phenotype of persisting pain in eight different clinical settings was submitted to unsupervised machine‐learning aimed at functional clustering. Subsequently, a mathematically supported subset of genes, comprising those most consistently involved in persisting pain, was analysed by means of computational functional genomics in the Gene Ontology knowledgebase.
Results: Clustering of genes with evidence for a modulation of persisting pain elucidated a functionally heterogeneous set. The situation cleared when the focus was narrowed to a genetic modulation consistently observed throughout several clinical settings. On this basis, two groups of biological processes, the immune system and nitric oxide signalling, emerged as major players in sensitization to persisting pain, which is biologically highly plausible and in agreement with other lines of pain research.
Conclusions: The present computational functional genomics‐based approach provided a computational systems‐biology perspective on chronic sensitization to pain. Human genetic control of persisting pain points to the immune system as a source of potential future targets for drugs directed against persisting pain. Contemporary machine‐learned methods provide innovative approaches to knowledge discovery from previous evidence.
Significance: We show that knowledge discovery in genetic databases and contemporary machine‐learned techniques can identify relevant biological processes involved in Persitent pain.
Background: To prevent persistent post-surgery pain, early identification of patients at high risk is a clinical need. Supervised machine-learning techniques were used to test how accurately the patients’ performance in a preoperatively performed tonic cold pain test could predict persistent post-surgery pain.
Methods: We analysed 763 patients from a cohort of 900 women who were treated for breast cancer, of whom 61 patients had developed signs of persistent pain during three yr of follow-up. Preoperatively, all patients underwent a cold pain test (immersion of the hand into a water bath at 2–4 °C). The patients rated the pain intensity using a numerical ratings scale (NRS) from 0 to 10. Supervised machine-learning techniques were used to construct a classifier that could predict patients at risk of persistent pain.
Results: Whether or not a patient rated the pain intensity at NRS=10 within less than 45 s during the cold water immersion test provided a negative predictive value of 94.4% to assign a patient to the "persistent pain" group. If NRS=10 was never reached during the cold test, the predictive value for not developing persistent pain was almost 97%. However, a low negative predictive value of 10% implied a high false positive rate.
Conclusions: Results provide a robust exclusion of persistent pain in women with an accuracy of 94.4%. Moreover, results provide further support for the hypothesis that the endogenous pain inhibitory system may play an important role in the process of pain becoming persistent.