004 Datenverarbeitung; Informatik
Refine
Document Type
- Article (7)
- Bachelor Thesis (1)
- Master's Thesis (1)
Language
- English (9)
Has Fulltext
- yes (9)
Is part of the Bibliography
- no (9)
Keywords
- machine learning (9) (remove)
Institute
When requesting a web-based service, users often fail in setting the website’s privacy settings according to their self privacy preferences. Being overwhelmed by the choice of preferences, a lack of knowledge of related technologies or unawareness of the own privacy preferences are just some reasons why users tend to struggle. To address all these problems, privacy setting prediction tools are particularly well-suited. Such tools aim to lower the burden to set privacy preferences according to owners’ privacy preferences. To be in line with the increased demand for explainability and interpretability by regulatory obligations – such as the General Data Protection Regulation (GDPR) in Europe – in this paper an explainable model for default privacy setting prediction is introduced. Compared to the previous work we present an improved feature selection, increased interpretability of each step in model design and enhanced evaluation metrics to better identify weaknesses in the model’s design before it goes into production. As a result, we aim to provide an explainable and transparent tool for default privacy setting prediction which users easily understand and are therefore more likely to use.
Background: The categorization of individuals as normosmic, hyposmic, or anosmic from test results of odor threshold, discrimination, and identification may provide a limited view of the sense of smell. The purpose of this study was to expand the clinical diagnostic repertoire by including additional tests. Methods: A random cohort of n = 135 individuals (83 women and 52 men, aged 21 to 94 years) was tested for odor threshold, discrimination, and identification, plus a distance test, in which the odor of peanut butter is perceived, a sorting task of odor dilutions for phenylethyl alcohol and eugenol, a discrimination test for odorant enantiomers, a lateralization test with eucalyptol, a threshold assessment after 10 min of exposure to phenylethyl alcohol, and a questionnaire on the importance of olfaction. Unsupervised methods were used to detect structure in the olfaction-related data, followed by supervised feature selection methods from statistics and machine learning to identify relevant variables. Results: The structure in the olfaction-related data divided the cohort into two distinct clusters with n = 80 and 55 subjects. Odor threshold, discrimination, and identification did not play a relevant role for cluster assignment, which, on the other hand, depended on performance in the two odor dilution sorting tasks, from which cluster assignment was possible with a median 100-fold cross-validated balanced accuracy of 77–88%. Conclusions: The addition of an odor sorting task with the two proposed odor dilutions to the odor test battery expands the phenotype of olfaction and fits seamlessly into the sensory focus of standard test batteries.
The use of artificial intelligence (AI) systems in biomedical and clinical settings can disrupt the traditional doctor–patient relationship, which is based on trust and transparency in medical advice and therapeutic decisions. When the diagnosis or selection of a therapy is no longer made solely by the physician, but to a significant extent by a machine using algorithms, decisions become nontransparent. Skill learning is the most common application of machine learning algorithms in clinical decision making. These are a class of very general algorithms (artificial neural networks, classifiers, etc.), which are tuned based on examples to optimize the classification of new, unseen cases. It is pointless to ask for an explanation for a decision. A detailed understanding of the mathematical details of an AI algorithm may be possible for experts in statistics or computer science. However, when it comes to the fate of human beings, this “developer’s explanation” is not sufficient. The concept of explainable AI (XAI) as a solution to this problem is attracting increasing scientific and regulatory interest. This review focuses on the requirement that XAIs must be able to explain in detail the decisions made by the AI to the experts in the field.
Bacteria that are capable of organizing themselves as biofilms are an important public health issue. Knowledge discovery focusing on the ability to swarm and conquer the surroundings to form persistent colonies is therefore very important for microbiological research communities that focus on a clinical perspective. Here, we demonstrate how a machine learning workflow can be used to create useful models that are capable of discriminating distinct associated growth behaviors along distinct phenotypes. Based on basic gray-scale images, we provide a processing pipeline for binary image generation, making the workflow accessible for imaging data from a wide range of devices and conditions. The workflow includes a locally estimated regression model that easily applies to growth-related data and a shape analysis using identified principal components. Finally, we apply a density-based clustering application with noise (DBSCAN) to extract and analyze characteristic, general features explained by colony shapes and areas to discriminate distinct Bacillus subtilis phenotypes. Our results suggest that the differences regarding their ability to swarm and subsequently conquer the medium that surrounds them result in characteristic features. The differences along the time scales of the distinct latency for the colony formation give insights into the ability to invade the surroundings and therefore could serve as a useful monitoring tool.
Advanced machine learning has achieved extraordinary success in recent years. “Active” operational risk beyond ex post analysis of measured-data machine learning could provide help beyond the regime of traditional statistical analysis when it comes to the “known unknown” or even the “unknown unknown.” While machine learning has been tested successfully in the regime of the “known,” heuristics typically provide better results for an active operational risk management (in the sense of forecasting). However, precursors in existing data can open a chance for machine learning to provide early warnings even for the regime of the “unknown unknown.”
The state-of-the-art pattern recognition method in machine learning (deep convolution neural network) is used to identify the equation of state (EoS) employed in the relativistic hydrodynamic simulations of heavy ion collisions. High-level correlations of particle spectra in transverse momentum and azimuthal angle learned by the network act as an effective EoS-meter in deciphering the nature of the phase transition in QCD. The EoS-meter is model independent and insensitive to other simulation inputs including the initial conditions and shear viscosity for hydrodynamic simulations. Through this study we demonstrate that there is a traceable encoder of the dynamical information from the phase structure that survives the evolution and exists in the final snapshot of heavy ion collisions and one can exclusively and effectively decode these information from the highly complex final output with machine learning when traditional methods fail. Besides the deep neural network, the performance of traditional machine learning classifiers are also provided.
When we browse via WiFi on our laptop or mobile phone, we receive data over a noisy channel. The received message may differ from the one that was sent originally. Luckily it is often possible to reconstruct the original message but it may take a lot of time. That’s because decoding the received message is a complex problem, NP-hard to be exact. As we continue browsing, new information is sent to us in a high frequency. So if lags are to be avoided and as memory is finite, there is not much time left for decoding. Coding theory tackles this problem by creating models of the channels we use to communicate and tailor codes based on the channel properties. A well known family of codes are Low-Density Parity-Check codes (LDPC codes), they are widely used in standards like WiFi and DVB-T2. In practical settings the complexity of decoding a received message can be heavily reduced by using LDPC codes and approximative decoding algorithms. This thesis lays out the basic construction of LDPC codes and a proper decoding using the sum-product algorithm. On this basis a neural network to improve decoding is introduced. Therefore the sum-product algorithm is transformed into a neural network decoder. This approach was first presented by Nachmani et al. and treated in detail by Navneet Agrawal in 2017. To find out how machine learning can improve the codes, the bit error rates of the trained neural network decoder are compared with the bit error rates of the classic sum-product algorithm approach. Experiments with static and dynamic training datasets of diverse sizes, various signal-to-noise ratios, a feed forward as well as a recurrent architecture show how to tune the neural network decoder even further. Results of the experiments are used to verify statements made in Agrawal’s work. In addition, corrections and improvements in the area of metrics are presented. An implementation of the neural network to facilitate access for others will be made available to the public.
Bayesian inference is ubiquitous in science and widely used in biomedical research such as cell sorting or “omics” approaches, as well as in machine learning (ML), artificial neural networks, and “big data” applications. However, the calculation is not robust in regions of low evidence. In cases where one group has a lower mean but a higher variance than another group, new cases with larger values are implausibly assigned to the group with typically smaller values. An approach for a robust extension of Bayesian inference is proposed that proceeds in two main steps starting from the Bayesian posterior probabilities. First, cases with low evidence are labeled as “uncertain” class membership. The boundary for low probabilities of class assignment (threshold 𝜀
) is calculated using a computed ABC analysis as a data-based technique for item categorization. This leaves a number of cases with uncertain classification (p < 𝜀
). Second, cases with uncertain class membership are relabeled based on the distance to neighboring classified cases based on Voronoi cells. The approach is demonstrated on biomedical data typically analyzed with Bayesian statistics, such as flow cytometric data sets or biomarkers used in medical diagnostics, where it increased the class assignment accuracy by 1–10% depending on the data set. The proposed extension of the Bayesian inference of class membership can be used to obtain robust and plausible class assignments even for data at the extremes of the distribution and/or for which evidence is weak.
Analysis of machine learning prediction quality for automated subgroups within the MIMIC III dataset
(2023)
The motivation for this master’s thesis is to explore the potential of predictive data analytics in the field of medicine. For this, the MIMIC-III dataset offers an extensive foundation for the construction of prediction models, including Random Forest, XGBOOST, and deep learning networks. These models were implemented to forecast the mortality of 2,655 stroke patients.
The first part of the thesis involved conducting a comprehensive data analysis of the filtered MIMIC-III dataset.
Subsequently, the effectiveness and fairness of the predictive models were evaluated. Although the performance levels of the developed models did not match those reported in related research, their potential became evident. The results obtained demonstrated promising capabilities and highlighted the effectiveness of the applied methodologies. Moreover, the feature relevance within the XGBOOST model was examined to increase model explainability.
Finally, relevant subgroups were identified to perform a comparative analysis of the prediction performance across these subgroups. While this approach can be regarded as a valuable methodology, it was not possible to investigate underlying reasons for potential unfairness across clusters. Inside the test data, not enough instances remained per subgroup for further fairness or feature relevance analysis.
In conclusion, the implementation of an alternative use case with a higher patient count is recommended.
The code for this analysis is made available via a GitHub repository and includes a frontend to visualize the results.