Refine
Document Type
- Article (19) (remove)
Has Fulltext
- yes (19)
Is part of the Bibliography
- no (19)
Keywords
- machine learning (19) (remove)
Institute
- Medizin (9)
- Biochemie, Chemie und Pharmazie (3)
- Buchmann Institut für Molekulare Lebenswissenschaften (BMLS) (2)
- Wirtschaftswissenschaften (2)
- Biochemie und Chemie (1)
- Biodiversität und Klima Forschungszentrum (BiK-F) (1)
- Biowissenschaften (1)
- Frankfurt Institute for Advanced Studies (FIAS) (1)
- House of Finance (HoF) (1)
- MPI für empirische Ästhetik (1)
Publicly available compound and bioactivity databases provide an essential basis for data-driven applications in life-science research and drug design. By analyzing several bioactivity repositories, we discovered differences in compound and target coverage advocating the combined use of data from multiple sources. Using data from ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs, we assembled a consensus dataset focusing on small molecules with bioactivity on human macromolecular targets. This allowed an improved coverage of compound space and targets, and an automated comparison and curation of structural and bioactivity data to reveal potentially erroneous entries and increase confidence. The consensus dataset comprised of more than 1.1 million compounds with over 10.9 million bioactivity data points with annotations on assay type and bioactivity confidence, providing a useful ensemble for computational applications in drug design and chemogenomics.
Based on accumulating evidence of a role of lipid signaling in many physiological and pathophysiological processes including psychiatric diseases, the present data driven analysis was designed to gather information needed to develop a prospective biomarker, using a targeted lipidomics approach covering different lipid mediators. Using unsupervised methods of data structure detection, implemented as hierarchal clustering, emergent self-organizing maps of neuronal networks, and principal component analysis, a cluster structure was found in the input data space comprising plasma concentrations of d = 35 different lipid-markers of various classes acquired in n = 94 subjects with the clinical diagnoses depression, bipolar disorder, ADHD, dementia, or in healthy controls. The structure separated patients with dementia from the other clinical groups, indicating that dementia is associated with a distinct lipid mediator plasma concentrations pattern possibly providing a basis for a future biomarker. This hypothesis was subsequently assessed using supervised machine-learning methods, implemented as random forests or principal component analysis followed by computed ABC analysis used for feature selection, and as random forests, k-nearest neighbors, support vector machines, multilayer perceptron, and naïve Bayesian classifiers to estimate whether the selected lipid mediators provide sufficient information that the diagnosis of dementia can be established at a higher accuracy than by guessing. This succeeded using a set of d = 7 markers comprising GluCerC16:0, Cer24:0, Cer20:0, Cer16:0, Cer24:1, C16 sphinganine, and LacCerC16:0, at an accuracy of 77%. By contrast, using random lipid markers reduced the diagnostic accuracy to values of 65% or less, whereas training the algorithms with randomly permuted data was followed by complete failure to diagnose dementia, emphasizing that the selected lipid mediators were display a particular pattern in this disease possibly qualifying as biomarkers.
Bacteria that are capable of organizing themselves as biofilms are an important public health issue. Knowledge discovery focusing on the ability to swarm and conquer the surroundings to form persistent colonies is therefore very important for microbiological research communities that focus on a clinical perspective. Here, we demonstrate how a machine learning workflow can be used to create useful models that are capable of discriminating distinct associated growth behaviors along distinct phenotypes. Based on basic gray-scale images, we provide a processing pipeline for binary image generation, making the workflow accessible for imaging data from a wide range of devices and conditions. The workflow includes a locally estimated regression model that easily applies to growth-related data and a shape analysis using identified principal components. Finally, we apply a density-based clustering application with noise (DBSCAN) to extract and analyze characteristic, general features explained by colony shapes and areas to discriminate distinct Bacillus subtilis phenotypes. Our results suggest that the differences regarding their ability to swarm and subsequently conquer the medium that surrounds them result in characteristic features. The differences along the time scales of the distinct latency for the colony formation give insights into the ability to invade the surroundings and therefore could serve as a useful monitoring tool.
Advanced machine learning has achieved extraordinary success in recent years. “Active” operational risk beyond ex post analysis of measured-data machine learning could provide help beyond the regime of traditional statistical analysis when it comes to the “known unknown” or even the “unknown unknown.” While machine learning has been tested successfully in the regime of the “known,” heuristics typically provide better results for an active operational risk management (in the sense of forecasting). However, precursors in existing data can open a chance for machine learning to provide early warnings even for the regime of the “unknown unknown.”
In this study, a portable electronic nose (E-nose) prototype is developed using metal oxide semiconductor (MOS) sensors to detect odors of different wines. Odor detection facilitates the distinction of wines with different properties, including areas of production, vintage years, fermentation processes, and varietals. Four popular machine learning algorithms—extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and backpropagation neural network (BPNN)—were used to build identification models for different classification tasks. Experimental results show that BPNN achieved the best performance, with accuracies of 94% and 92.5% in identifying production areas and varietals, respectively; and SVM achieved the best performance in identifying vintages and fermentation processes, with accuracies of 67.3% and 60.5%, respectively. Results demonstrate the effectiveness of the developed E-nose, which could be used to distinguish different wines based on their properties following selection of an optimal algorithm.
The state-of-the-art pattern recognition method in machine learning (deep convolution neural network) is used to identify the equation of state (EoS) employed in the relativistic hydrodynamic simulations of heavy ion collisions. High-level correlations of particle spectra in transverse momentum and azimuthal angle learned by the network act as an effective EoS-meter in deciphering the nature of the phase transition in QCD. The EoS-meter is model independent and insensitive to other simulation inputs including the initial conditions and shear viscosity for hydrodynamic simulations. Through this study we demonstrate that there is a traceable encoder of the dynamical information from the phase structure that survives the evolution and exists in the final snapshot of heavy ion collisions and one can exclusively and effectively decode these information from the highly complex final output with machine learning when traditional methods fail. Besides the deep neural network, the performance of traditional machine learning classifiers are also provided.
he most basic behavioural states of animals can be described as active or passive. While high-resolution observations of activity patterns can provide insights into the ecology of animal species, few methods are able to measure the activity of individuals of small taxa in their natural environment. We present a novel approach in which a combination of automatic radiotracking and machine learning is used to distinguish between active and passive behaviour in small vertebrates fitted with lightweight transmitters (<0.4 g).
We used a dataset containing >3 million signals from very-high-frequency (VHF) telemetry from two forest-dwelling bat species (Myotis bechsteinii [n = 52] and Nyctalus leisleri [n = 20]) to train and test a random forest model in assigning either active or passive behaviour to VHF-tagged individuals. The generalisability of the model was demonstrated by recording and classifying the behaviour of tagged birds and by simulating the effect of different activity levels with the help of humans carrying transmitters. The model successfully classified the activity states of bats as well as those of birds and humans, although the latter were not included in model training (F1 0.96–0.98).
We provide an ecological case-study demonstrating the potential of this automated monitoring tool. We used the trained models to compare differences in the daily activity patterns of two bat species. The analysis showed a pronounced bimodal activity distribution of N. leisleri over the course of the night while the night-time activity of M. bechsteinii was relatively constant. These results show that subtle differences in the timing of species' activity can be distinguished using our method.
Our approach can classify VHF-signal patterns into fundamental behavioural states with high precision and is applicable to different terrestrial and flying vertebrates. To encourage the broader use of our radiotracking method, we provide the trained random forest models together with an R package that includes all necessary data processing functionalities. In combination with state-of-the-art open-source automated radiotracking, this toolset can be used by the scientific community to investigate the activity patterns of small vertebrates with high temporal resolution, even in dense vegetation.
Comprehensive analysis of tumour sub-volumes for radiomic risk modelling in locally advanced HNSCC
(2020)
Simple Summary: Radiomic risk models are usually based on imaging features, which are extracted from the entire gross tumour volume (GTV entire ). This approach does not explicitly consider the complex biological structure of the tumours. Therefore, in this retrospective study, we investigated the prognostic value of radiomic analyses based on different tumour sub-volumes using computed tomography imaging of patients with locally advanced head and neck squamous cell carcinoma who were treated with primary radio-chemotherapy. The GTV entire was cropped by different margins to define the rim and corresponding core sub-volumes of the tumour. Furthermore, the best performing tumour rim sub-volume was extended into surrounding tissue with different margins. As a result, the models based on the 5 mm tumour rim and on the 3 mm extended rim sub-volume showed an improved performance compared to models based on the corresponding tumour core. This indicates that the consideration of tumour sub-volumes may help to improve radiomic risk models.
Abstract: Imaging features for radiomic analyses are commonly calculated from the entire gross tumour volume (GTVentire). However, tumours are biologically complex and the consideration of different tumour regions in radiomic models may lead to an improved outcome prediction. Therefore, we investigated the prognostic value of radiomic analyses based on different tumour sub-volumes using computed tomography imaging of patients with locally advanced head and neck squamous cell carcinoma. The GTVentire was cropped by different margins to define the rim and the corresponding core sub-volumes of the tumour. Subsequently, the best performing tumour rim sub-volume was extended into surrounding tissue with different margins. Radiomic risk models were developed and validated using a retrospective cohort consisting of 291 patients in one of the six Partner Sites of the German Cancer Consortium Radiation Oncology Group treated between 2005 and 2013. The validation concordance index (C-index) averaged over all applied learning algorithms and feature selection methods using the GTVentire achieved a moderate prognostic performance for loco-regional tumour control (C-index: 0.61 ± 0.04 (mean ± std)). The models based on the 5 mm tumour rim and on the 3 mm extended rim sub-volume showed higher median performances (C-index: 0.65 ± 0.02 and 0.64 ± 0.05, respectively), while models based on the corresponding tumour core volumes performed less (C-index: 0.59 ± 0.01). The difference in C-index between the 5 mm tumour rim and the corresponding core volume showed a statistical trend (p = 0.10). After additional prospective validation, the consideration of tumour sub-volumes may be a promising way to improve prognostic radiomic risk models.
Scores to identify patients at high risk of progression of coronavirus disease (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), may become instrumental for clinical decision-making and patient management. We used patient data from the multicentre Lean European Open Survey on SARS-CoV-2-Infected Patients (LEOSS) and applied variable selection to develop a simplified scoring system to identify patients at increased risk of critical illness or death. A total of 1946 patients who tested positive for SARS-CoV-2 were included in the initial analysis and assigned to derivation and validation cohorts (n = 1297 and n = 649, respectively). Stability selection from over 100 baseline predictors for the combined endpoint of progression to the critical phase or COVID-19-related death enabled the development of a simplified score consisting of five predictors: C-reactive protein (CRP), age, clinical disease phase (uncomplicated vs. complicated), serum urea, and D-dimer (abbreviated as CAPS-D score). This score yielded an area under the curve (AUC) of 0.81 (95% confidence interval [CI]: 0.77–0.85) in the validation cohort for predicting the combined endpoint within 7 days of diagnosis and 0.81 (95% CI: 0.77–0.85) during full follow-up. We used an additional prospective cohort of 682 patients, diagnosed largely after the “first wave” of the pandemic to validate the predictive accuracy of the score and observed similar results (AUC for the event within 7 days: 0.83 [95% CI: 0.78–0.87]; for full follow-up: 0.82 [95% CI: 0.78–0.86]). An easily applicable score to calculate the risk of COVID-19 progression to critical illness or death was thus established and validated.
The use of artificial intelligence (AI) systems in biomedical and clinical settings can disrupt the traditional doctor–patient relationship, which is based on trust and transparency in medical advice and therapeutic decisions. When the diagnosis or selection of a therapy is no longer made solely by the physician, but to a significant extent by a machine using algorithms, decisions become nontransparent. Skill learning is the most common application of machine learning algorithms in clinical decision making. These are a class of very general algorithms (artificial neural networks, classifiers, etc.), which are tuned based on examples to optimize the classification of new, unseen cases. It is pointless to ask for an explanation for a decision. A detailed understanding of the mathematical details of an AI algorithm may be possible for experts in statistics or computer science. However, when it comes to the fate of human beings, this “developer’s explanation” is not sufficient. The concept of explainable AI (XAI) as a solution to this problem is attracting increasing scientific and regulatory interest. This review focuses on the requirement that XAIs must be able to explain in detail the decisions made by the AI to the experts in the field.
When requesting a web-based service, users often fail in setting the website’s privacy settings according to their self privacy preferences. Being overwhelmed by the choice of preferences, a lack of knowledge of related technologies or unawareness of the own privacy preferences are just some reasons why users tend to struggle. To address all these problems, privacy setting prediction tools are particularly well-suited. Such tools aim to lower the burden to set privacy preferences according to owners’ privacy preferences. To be in line with the increased demand for explainability and interpretability by regulatory obligations – such as the General Data Protection Regulation (GDPR) in Europe – in this paper an explainable model for default privacy setting prediction is introduced. Compared to the previous work we present an improved feature selection, increased interpretability of each step in model design and enhanced evaluation metrics to better identify weaknesses in the model’s design before it goes into production. As a result, we aim to provide an explainable and transparent tool for default privacy setting prediction which users easily understand and are therefore more likely to use.
Phenotypical screening is a widely used approach in drug discovery for the identification of small molecules with cellular activities. However, functional annotation of identified hits often poses a challenge. The development of small molecules with narrow or exclusive target selectivity such as chemical probes and chemogenomic (CG) libraries, greatly diminishes this challenge, but non-specific effects caused by compound toxicity or interference with basic cellular functions still pose a problem to associate phenotypic readouts with molecular targets. Hence, each compound should ideally be comprehensively characterized regarding its effects on general cell functions. Here, we report an optimized live-cell multiplexed assay that classifies cells based on nuclear morphology, presenting an excellent indicator for cellular responses such as early apoptosis and necrosis. This basic readout in combination with the detection of other general cell damaging activities of small molecules such as changes in cytoskeletal morphology, cell cycle and mitochondrial health provides a comprehensive time-dependent characterization of the effect of small molecules on cellular health in a single experiment. The developed high-content assay offers multi-dimensional comprehensive characterization that can be used to delineate generic effects regarding cell functions and cell viability, allowing an assessment of compound suitability for subsequent detailed phenotypic and mechanistic studies.
Genetic association studies have shown their usefulness in assessing the role of ion channels in human thermal pain perception. We used machine learning to construct a complex phenotype from pain thresholds to thermal stimuli and associate it with the genetic information derived from the next-generation sequencing (NGS) of 15 ion channel genes which are involved in thermal perception, including ASIC1, ASIC2, ASIC3, ASIC4, TRPA1, TRPC1, TRPM2, TRPM3, TRPM4, TRPM5, TRPM8, TRPV1, TRPV2, TRPV3, and TRPV4. Phenotypic information was complete in 82 subjects and NGS genotypes were available in 67 subjects. A network of artificial neurons, implemented as emergent self-organizing maps, discovered two clusters characterized by high or low pain thresholds for heat and cold pain. A total of 1071 variants were discovered in the 15 ion channel genes. After feature selection, 80 genetic variants were retained for an association analysis based on machine learning. The measured performance of machine learning-mediated phenotype assignment based on this genetic information resulted in an area under the receiver operating characteristic curve of 77.2%, justifying a phenotype classification based on the genetic information. A further item categorization finally resulted in 38 genetic variants that contributed most to the phenotype assignment. Most of them (10) belonged to the TRPV3 gene, followed by TRPM3 (6). Therefore, the analysis successfully identified the particular importance of TRPV3 and TRPM3 for an average pain phenotype defined by the sensitivity to moderate thermal stimuli.
Music listening has become a highly individualized activity with smartphones and music streaming services providing listeners with absolute freedom to listen to any kind of music in any situation. Until now, little has been written about the processes underlying the selection of music in daily life. The present study aimed to disentangle some of the complex processes among the listener, situation, and functions of music listening involved in music selection. Utilizing the experience sampling method, data were collected from 119 participants using a smartphone application. For 10 consecutive days, participants received 14 prompts using stratified-random sampling throughout the day and reported on their music-listening behavior. Statistical learning procedures on multilevel regression models and multilevel structural equation modeling were used to determine the most important predictors and analyze mediation processes between person, situation, functions of listening, and music selection. Results revealed that the features of music selected in daily life were predominantly determined by situational characteristics, whereas consistent individual differences were of minor importance. Functions of music listening were found to act as a mediator between characteristics of the situation and music-selection behavior. We further observed several significant random effects, which indicated that individuals differed in how situational variables affected their music selection behavior. Our findings suggest a need to shift the focus of music-listening research from individual differences to situational influences, including potential person-situation interactions.
In Niedersachsen sind etwa 50 % der forstlichen Standorte in einem Maßstab 1 : 25 000 nach einem relativ komplexen Verfahren kartiert. Jede kartierte Einheit besteht aus Stufen für den Geländewasserhaushalt (WHZ; 43 Stufen), die Nährstoffversorgung (NZ; 16 Stufen) und die Substratund Lagerungsverhältnisse (SLZ; 105 Stufen). Das Ziel der Arbeit war es, WHZ und NZ Stufen der Niedersächsischen forstlichen Standortskartierung für nicht kartierte Gebiete vorherzusagen. Anhand von stratifizierten Zufallsstichproben der WHZ und NZ Stufen aus der Kartierung wurden zwei RandomForest-Modelle kalibriert. Das Modell klassifizierte etwa 77 % der Teststichprobe für die WHZ richtig. Die F1-Werte der einzelnen Stufen reichten dabei von 50–95 %. Falsche Vorhersagen mehrten sich bei Übergängen benachbarter WHZ (z. B. Übergang von Tälern zu Hängen) und bei WHZ mit ähnlichen Geländeeigenschaften, aber Abstufungen in der Wasserversorgung. Einige Modellfehler hängen aber offenbar auch von Unschärfen innerhalb der zugrundeliegenden Kartierung ab. Zusätzlich sagt das Modell im Vergleich zur Feldkartierung viel kleinräumigere Muster vorher, die zwar vom zugrundeliegenden Gelände her nachvollziehbar erscheinen, aber in dieser Genauigkeit nicht im Feld kartiert werden. Etwa 66 % des Testdatensatzes für die NZ wurden richtig klassifiziert. Falsche Vorhersagen traten hier vor allem in direkt benachbarten Stufen der Nährstoffversorgung auf. Unsicherheiten deuten zum einen auf weniger gut geeignete Kovariablen hin, sind möglicherweise aber auch durch zeitliche Änderungen der Bodeneigenschaften selbst sowie durch Ungenauigkeiten in der Kartierung zu erwarten, die wenige Regeln für die Vergabe der Nährstoffzahl vorgibt. Insgesamt beurteilen wir die Modelle als gut geeignet, um sie landesweit anzuwenden. Allerdings ist zu erwarten, dass eine lokale Kalibrierung der Modelle für einzelne Wuchsgebiete die Modellgüte deutlich erhöht. Gleiches kann eine Zusammenfassung ähnlicher Stufen zu waldbaulich relevanten Obergruppen leisten.
The KMT2A (MLL) gene rearrangements (KMT2A-r) are associated with a diverse spectrum of acute leukemias. Although most KMT2A-r are restricted to nine partner genes, we have recently revealed that KMT2A-USP2 fusions are often missed during FISH screening of these genetic alterations. Therefore, complementary methods are important for appropriate detection of any KMT2A-r. Here we use a machine learning model to unravel the most appropriate markers for prediction of KMT2A-r in various types of acute leukemia. A Random Forest and LightGBM classifier was trained to predict KMT2A-r in patients with acute leukemia. Our results revealed a set of 20 genes capable of accurately estimating KMT2A-r. The SKIDA1 (AUC: 0.839; CI: 0.799–0.879) and LAMP5 (AUC: 0.746; CI: 0.685–0.806) overexpression were the better markers associated with KMT2A-r compared to CSPG4 (also named NG2; AUC: 0.722; CI: 0.659–0.784), regardless of the type of acute leukemia. Of importance, high expression levels of LAMP5 estimated the occurrence of all KMT2A-USP2 fusions. Also, we performed drug sensitivity analysis using IC50 data from 345 drugs available in the GDSC database to identify which ones could be used to treat KMT2A-r leukemia. We observed that KMT2A-r cell lines were more sensitive to 5-Fluorouracil (5FU), Gemcitabine (both antimetabolite chemotherapy drugs), WHI-P97 (JAK-3 inhibitor), Foretinib (MET/VEGFR inhibitor), SNX-2112 (Hsp90 inhibitor), AZD6482 (PI3Kβ inhibitor), KU-60019 (ATM kinase inhibitor), and Pevonedistat (NEDD8-activating enzyme (NAE) inhibitor). Moreover, IC50 data from analyses of ex-vivo drug sensitivity to small-molecule inhibitors reveals that Foretinib is a promising drug option for AML patients carrying FLT3 activating mutations. Thus, we provide novel and accurate options for the diagnostic screening and therapy of KMT2A-r leukemia, regardless of leukemia subtype.
Introduction: Affective disorders are a major global burden, with approximately 15% of people worldwide suffering from some form of affective disorder. In patients experiencing their first depressive episode, in most cases it cannot be distinguished whether this is due to bipolar disorder (BD) or major depressive disorder (MDD). Valid fluid biomarkers able to discriminate between the two disorders in a clinical setting are not yet available.
Material and Methods: Seventy depressed patients suffering from BD (bipolar I and II subtypes) and 42 patients with major MDD were recruited and blood samples were taken for proteomic analyses after 8 h fasting. Proteomic profiles were analyzed using the Multiplex Immunoassay platform from Myriad Rules Based Medicine (Myriad RBM; Austin, Texas, USA). Human DiscoveryMAPTM was used to measure the concentration of various proteins, peptides, and small molecules. A multivariate predictive model was consequently constructed to differentiate between BD and MDD.
Results: Based on the various proteomic profiles, the algorithm could discriminate depressed BD patients from MDD patients with an accuracy of 67%.
Discussion: The results of this preliminary study suggest that future discrimination between bipolar and unipolar depression in a single case could be possible, using predictive biomarker models based on blood proteomic profiling.
Bayesian inference is ubiquitous in science and widely used in biomedical research such as cell sorting or “omics” approaches, as well as in machine learning (ML), artificial neural networks, and “big data” applications. However, the calculation is not robust in regions of low evidence. In cases where one group has a lower mean but a higher variance than another group, new cases with larger values are implausibly assigned to the group with typically smaller values. An approach for a robust extension of Bayesian inference is proposed that proceeds in two main steps starting from the Bayesian posterior probabilities. First, cases with low evidence are labeled as “uncertain” class membership. The boundary for low probabilities of class assignment (threshold 𝜀
) is calculated using a computed ABC analysis as a data-based technique for item categorization. This leaves a number of cases with uncertain classification (p < 𝜀
). Second, cases with uncertain class membership are relabeled based on the distance to neighboring classified cases based on Voronoi cells. The approach is demonstrated on biomedical data typically analyzed with Bayesian statistics, such as flow cytometric data sets or biomarkers used in medical diagnostics, where it increased the class assignment accuracy by 1–10% depending on the data set. The proposed extension of the Bayesian inference of class membership can be used to obtain robust and plausible class assignments even for data at the extremes of the distribution and/or for which evidence is weak.
Background: The categorization of individuals as normosmic, hyposmic, or anosmic from test results of odor threshold, discrimination, and identification may provide a limited view of the sense of smell. The purpose of this study was to expand the clinical diagnostic repertoire by including additional tests. Methods: A random cohort of n = 135 individuals (83 women and 52 men, aged 21 to 94 years) was tested for odor threshold, discrimination, and identification, plus a distance test, in which the odor of peanut butter is perceived, a sorting task of odor dilutions for phenylethyl alcohol and eugenol, a discrimination test for odorant enantiomers, a lateralization test with eucalyptol, a threshold assessment after 10 min of exposure to phenylethyl alcohol, and a questionnaire on the importance of olfaction. Unsupervised methods were used to detect structure in the olfaction-related data, followed by supervised feature selection methods from statistics and machine learning to identify relevant variables. Results: The structure in the olfaction-related data divided the cohort into two distinct clusters with n = 80 and 55 subjects. Odor threshold, discrimination, and identification did not play a relevant role for cluster assignment, which, on the other hand, depended on performance in the two odor dilution sorting tasks, from which cluster assignment was possible with a median 100-fold cross-validated balanced accuracy of 77–88%. Conclusions: The addition of an odor sorting task with the two proposed odor dilutions to the odor test battery expands the phenotype of olfaction and fits seamlessly into the sensory focus of standard test batteries.