Refine
Document Type
- Article (19)
- Working Paper (4)
- Bachelor Thesis (1)
- Conference Proceeding (1)
- Master's Thesis (1)
- Preprint (1)
- Report (1)
Has Fulltext
- yes (28)
Is part of the Bibliography
- no (28)
Keywords
- machine learning (28) (remove)
Institute
- Medizin (9)
- Wirtschaftswissenschaften (6)
- Center for Financial Studies (CFS) (4)
- Biochemie, Chemie und Pharmazie (3)
- Buchmann Institut für Molekulare Lebenswissenschaften (BMLS) (2)
- Frankfurt Institute for Advanced Studies (FIAS) (2)
- Psychologie (2)
- Biochemie und Chemie (1)
- Biodiversität und Klima Forschungszentrum (BiK-F) (1)
- Biowissenschaften (1)
Publicly available compound and bioactivity databases provide an essential basis for data-driven applications in life-science research and drug design. By analyzing several bioactivity repositories, we discovered differences in compound and target coverage advocating the combined use of data from multiple sources. Using data from ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs, we assembled a consensus dataset focusing on small molecules with bioactivity on human macromolecular targets. This allowed an improved coverage of compound space and targets, and an automated comparison and curation of structural and bioactivity data to reveal potentially erroneous entries and increase confidence. The consensus dataset comprised of more than 1.1 million compounds with over 10.9 million bioactivity data points with annotations on assay type and bioactivity confidence, providing a useful ensemble for computational applications in drug design and chemogenomics.
Based on accumulating evidence of a role of lipid signaling in many physiological and pathophysiological processes including psychiatric diseases, the present data driven analysis was designed to gather information needed to develop a prospective biomarker, using a targeted lipidomics approach covering different lipid mediators. Using unsupervised methods of data structure detection, implemented as hierarchal clustering, emergent self-organizing maps of neuronal networks, and principal component analysis, a cluster structure was found in the input data space comprising plasma concentrations of d = 35 different lipid-markers of various classes acquired in n = 94 subjects with the clinical diagnoses depression, bipolar disorder, ADHD, dementia, or in healthy controls. The structure separated patients with dementia from the other clinical groups, indicating that dementia is associated with a distinct lipid mediator plasma concentrations pattern possibly providing a basis for a future biomarker. This hypothesis was subsequently assessed using supervised machine-learning methods, implemented as random forests or principal component analysis followed by computed ABC analysis used for feature selection, and as random forests, k-nearest neighbors, support vector machines, multilayer perceptron, and naïve Bayesian classifiers to estimate whether the selected lipid mediators provide sufficient information that the diagnosis of dementia can be established at a higher accuracy than by guessing. This succeeded using a set of d = 7 markers comprising GluCerC16:0, Cer24:0, Cer20:0, Cer16:0, Cer24:1, C16 sphinganine, and LacCerC16:0, at an accuracy of 77%. By contrast, using random lipid markers reduced the diagnostic accuracy to values of 65% or less, whereas training the algorithms with randomly permuted data was followed by complete failure to diagnose dementia, emphasizing that the selected lipid mediators were display a particular pattern in this disease possibly qualifying as biomarkers.
Bacteria that are capable of organizing themselves as biofilms are an important public health issue. Knowledge discovery focusing on the ability to swarm and conquer the surroundings to form persistent colonies is therefore very important for microbiological research communities that focus on a clinical perspective. Here, we demonstrate how a machine learning workflow can be used to create useful models that are capable of discriminating distinct associated growth behaviors along distinct phenotypes. Based on basic gray-scale images, we provide a processing pipeline for binary image generation, making the workflow accessible for imaging data from a wide range of devices and conditions. The workflow includes a locally estimated regression model that easily applies to growth-related data and a shape analysis using identified principal components. Finally, we apply a density-based clustering application with noise (DBSCAN) to extract and analyze characteristic, general features explained by colony shapes and areas to discriminate distinct Bacillus subtilis phenotypes. Our results suggest that the differences regarding their ability to swarm and subsequently conquer the medium that surrounds them result in characteristic features. The differences along the time scales of the distinct latency for the colony formation give insights into the ability to invade the surroundings and therefore could serve as a useful monitoring tool.
Advanced machine learning has achieved extraordinary success in recent years. “Active” operational risk beyond ex post analysis of measured-data machine learning could provide help beyond the regime of traditional statistical analysis when it comes to the “known unknown” or even the “unknown unknown.” While machine learning has been tested successfully in the regime of the “known,” heuristics typically provide better results for an active operational risk management (in the sense of forecasting). However, precursors in existing data can open a chance for machine learning to provide early warnings even for the regime of the “unknown unknown.”
Analysis of machine learning prediction quality for automated subgroups within the MIMIC III dataset
(2023)
The motivation for this master’s thesis is to explore the potential of predictive data analytics in the field of medicine. For this, the MIMIC-III dataset offers an extensive foundation for the construction of prediction models, including Random Forest, XGBOOST, and deep learning networks. These models were implemented to forecast the mortality of 2,655 stroke patients.
The first part of the thesis involved conducting a comprehensive data analysis of the filtered MIMIC-III dataset.
Subsequently, the effectiveness and fairness of the predictive models were evaluated. Although the performance levels of the developed models did not match those reported in related research, their potential became evident. The results obtained demonstrated promising capabilities and highlighted the effectiveness of the applied methodologies. Moreover, the feature relevance within the XGBOOST model was examined to increase model explainability.
Finally, relevant subgroups were identified to perform a comparative analysis of the prediction performance across these subgroups. While this approach can be regarded as a valuable methodology, it was not possible to investigate underlying reasons for potential unfairness across clusters. Inside the test data, not enough instances remained per subgroup for further fairness or feature relevance analysis.
In conclusion, the implementation of an alternative use case with a higher patient count is recommended.
The code for this analysis is made available via a GitHub repository and includes a frontend to visualize the results.
Biased auctioneers
(2022)
We construct a neural network algorithm that generates price predictions for art at auction, relying on both visual and non-visual object characteristics. We find that higher automated valuations relative to auction house pre-sale estimates are associated with substantially higher price-to-estimate ratios and lower buy-in rates, pointing to estimates’ informational inefficiency. The relative contribution of machine learning is higher for artists with less dispersed and lower average prices. Furthermore, we show that auctioneers’ prediction errors are persistent both at the artist and at the auction house level, and hence directly predictable themselves using information on past errors.
With the ongoing loss of global biodiversity, long-term recordings of species distribution patterns are increasingly becoming important to investigate the causes and consequences for their change. Therefore, the digitization of scientific literature, both modern and historical, has been attracting growing attention in recent years. To meet this growing demand the Specialised Information Service for Biodiversity Research (BIOfid) was launched in 2017 with the aim of increasing the availability and accessibility of biodiversity information. Closely tied to the research community the interdisciplinary BIOfid team is digitizing data sources of biodiversity related research and provides a modern and professional infrastructure for hosting and sharing them. As a pilot project, German publications on the distribution and ecology of vascular plants, birds, moths and butterflies covering the past 250 years are prioritized. Large parts of the text corpus defined in accordance with the needs of the relevant German research community have already been transferred to a machine-readable format and will be publicly accessible soon. Software tools for text mining, semantic annotation and analysis with respect to the current trends in machine learning are developed to maximize bioscientific data output through user-specific queries that can be created via the BIOfid web portal (https://www.biofid.de/). To boost knowledge discovery, specific ontologies focusing on morphological traits and taxonomy are being prepared and will continuously be extended to keep up with an ever-expanding volume of literature sources.
In this study, a portable electronic nose (E-nose) prototype is developed using metal oxide semiconductor (MOS) sensors to detect odors of different wines. Odor detection facilitates the distinction of wines with different properties, including areas of production, vintage years, fermentation processes, and varietals. Four popular machine learning algorithms—extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and backpropagation neural network (BPNN)—were used to build identification models for different classification tasks. Experimental results show that BPNN achieved the best performance, with accuracies of 94% and 92.5% in identifying production areas and varietals, respectively; and SVM achieved the best performance in identifying vintages and fermentation processes, with accuracies of 67.3% and 60.5%, respectively. Results demonstrate the effectiveness of the developed E-nose, which could be used to distinguish different wines based on their properties following selection of an optimal algorithm.
The state-of-the-art pattern recognition method in machine learning (deep convolution neural network) is used to identify the equation of state (EoS) employed in the relativistic hydrodynamic simulations of heavy ion collisions. High-level correlations of particle spectra in transverse momentum and azimuthal angle learned by the network act as an effective EoS-meter in deciphering the nature of the phase transition in QCD. The EoS-meter is model independent and insensitive to other simulation inputs including the initial conditions and shear viscosity for hydrodynamic simulations. Through this study we demonstrate that there is a traceable encoder of the dynamical information from the phase structure that survives the evolution and exists in the final snapshot of heavy ion collisions and one can exclusively and effectively decode these information from the highly complex final output with machine learning when traditional methods fail. Besides the deep neural network, the performance of traditional machine learning classifiers are also provided.
he most basic behavioural states of animals can be described as active or passive. While high-resolution observations of activity patterns can provide insights into the ecology of animal species, few methods are able to measure the activity of individuals of small taxa in their natural environment. We present a novel approach in which a combination of automatic radiotracking and machine learning is used to distinguish between active and passive behaviour in small vertebrates fitted with lightweight transmitters (<0.4 g).
We used a dataset containing >3 million signals from very-high-frequency (VHF) telemetry from two forest-dwelling bat species (Myotis bechsteinii [n = 52] and Nyctalus leisleri [n = 20]) to train and test a random forest model in assigning either active or passive behaviour to VHF-tagged individuals. The generalisability of the model was demonstrated by recording and classifying the behaviour of tagged birds and by simulating the effect of different activity levels with the help of humans carrying transmitters. The model successfully classified the activity states of bats as well as those of birds and humans, although the latter were not included in model training (F1 0.96–0.98).
We provide an ecological case-study demonstrating the potential of this automated monitoring tool. We used the trained models to compare differences in the daily activity patterns of two bat species. The analysis showed a pronounced bimodal activity distribution of N. leisleri over the course of the night while the night-time activity of M. bechsteinii was relatively constant. These results show that subtle differences in the timing of species' activity can be distinguished using our method.
Our approach can classify VHF-signal patterns into fundamental behavioural states with high precision and is applicable to different terrestrial and flying vertebrates. To encourage the broader use of our radiotracking method, we provide the trained random forest models together with an R package that includes all necessary data processing functionalities. In combination with state-of-the-art open-source automated radiotracking, this toolset can be used by the scientific community to investigate the activity patterns of small vertebrates with high temporal resolution, even in dense vegetation.