OPUS 4 | Search

Cluster regularization via a hierarchical feature regression (2024)

The hierarchical feature regression (HFR) is a novel graph-based regularized regression estimator, which mobilizes insights from the domains of machine learning and graph theory to estimate robust parameters for a linear regression. The estimator constructs a supervised feature graph that decomposes parameters along its edges, adjusting first for common variation and successively incorporating idiosyncratic patterns into the fitting process. The graph structure has the effect of shrinking parameters towards group targets, where the extent of shrinkage is governed by a hyperparameter, and group compositions as well as shrinkage targets are determined endogenously. The method offers rich resources for the visual exploration of the latent effect structure in the data, and demonstrates good predictive accuracy and versatility when compared to a panel of commonly used regularization techniques across a range of empirical and simulated regression tasks.

Learning from experts: energy efficiency in residential buildings (2023)

Billio, Monica ; Casarin, Roberto ; Costola, Michele ; Veggente, Veronica

Measuring and reducing energy consumption constitutes a crucial concern in public policies aimed at mitigating global warming. The real estate sector faces the challenge of enhancing building efficiency, where insights from experts play a pivotal role in the evaluation process. This research employs a machine learning approach to analyze expert opinions, seeking to extract the key determinants influencing potential residential building efficiency and establishing an efficient prediction framework. The study leverages open Energy Performance Certificate databases from two countries with distinct latitudes, namely the UK and Italy, to investigate whether enhancing energy efficiency necessitates different intervention approaches. The findings reveal the existence of non-linear relationships between efficiency and building characteristics, which cannot be captured by conventional linear modeling frameworks. By offering insights into the determinants of residential building efficiency, this study provides guidance to policymakers and stakeholders in formulating effective and sustainable strategies for energy efficiency improvement.

Entity matching with similarity encoding: a supervised learning recommendation framework for linking (big) data (2023)

Karapanagiotis, Pantelis ; Liebald, Marius

In this study, we introduce a novel entity matching (EM) framework. It com-bines state-of-the-art EM approaches based on Artiﬁcial Neural Networks (ANN) with a new similarity encoding derived from matching techniques that are preva-lent in ﬁnance and economics. Our framework is on-par or outperforms alternative end-to-end frameworks in standard benchmark cases. Because similarity encod-ing is constructed using (edit) distances instead of semantic similarities, it avoids out-of-vocabulary problems when matching dirty data. We highlight this property by applying an EM application to dirty ﬁnancial ﬁrm-level data extracted from historical archives.

Few temporally distributed brain connectivity states predict human cognitive abilities (2023)

Wehrheim, Maren H. ; Faskowitz, Joshua ; Sporns, Olaf ; Fiebach, Christian ; Kaschube, Matthias ; Hilger, Kirsten

Highlights • Brain connectivity states identified by cofluctuation strength. • CMEP as new method to robustly predict human traits from brain imaging data. • Network-identifying connectivity ‘events’ are not predictive of cognitive ability. • Sixteen temporally independent fMRI time frames allow for significant prediction. • Neuroimaging-based assessment of cognitive ability requires sufficient scan lengths. Abstract Human functional brain connectivity can be temporally decomposed into states of high and low cofluctuation, defined as coactivation of brain regions over time. Rare states of particularly high cofluctuation have been shown to reflect fundamentals of intrinsic functional network architecture and to be highly subject-specific. However, it is unclear whether such network-defining states also contribute to individual variations in cognitive abilities – which strongly rely on the interactions among distributed brain regions. By introducing CMEP, a new eigenvector-based prediction framework, we show that as few as 16 temporally separated time frames (< 1.5% of 10 min resting-state fMRI) can significantly predict individual differences in intelligence (N = 263, p < .001). Against previous expectations, individual's network-defining time frames of particularly high cofluctuation do not predict intelligence. Multiple functional brain networks contribute to the prediction, and all results replicate in an independent sample (N = 831). Our results suggest that although fundamentals of person-specific functional connectomes can be derived from few time frames of highest connectivity, temporally distributed information is necessary to extract information about cognitive abilities. This information is not restricted to specific connectivity states, like network-defining high-cofluctuation states, but rather reflected across the entire length of the brain connectivity time series.

Comparative assessment of automated algorithms for the separation of one-dimensional Gaussian mixtures (2022)

Lötsch, Jörn ; Malkusch, Sebastian ; Ultsch, Alfred

Motivation: Gaussian mixture models (GMMs) are probabilistic models commonly used in biomedical research to detect subgroup structures in data sets with one-dimensional information. Reliable model parameterization requires that the number of modes, i.e., states of the generating process, is known. However, this is rarely the case for empirically measured biomedical data. Several implementations are available that estimate GMM parameters differently. This work aims to provide a comparative evaluation of automated GMM fitting methods. Results and conclusions: The performance of commonly used algorithms for automatic parameterization and mode number determination was compared with respect to reproducing the ground truth of generated data derived from multiple normal distributions. Four main variants of Gaussian mode number detection algorithms and five variants of GMM parameter estimation methods were tested in a combinatory scenario. The combination of best performing mode number determination algorithms and GMM parameter estimation methods was then tested on artificial and real-live data sets known to display a GMM structure. None of the tested methods correctly determined the underlying data structure consistently. The likelihood ratio test had the best performance in identifying the mode number associated with the best GMM fit of the data distribution while the Markov chain Monte Carlo (MCMC) algorithm was best for GMM parameter estimation while. The combination of the two methods of number determination algorithms and GMM parameter estimation was consistently among the best and overall outperformed the available implementations. Implementation: An automated tool for the detection of GMM based structures in (biomedical) datasets was created based on the present results and made freely available in the R library “opGMMassessment” at https://cran.r-project.org/package=opGMMassessment.

Financing sustainable entrepreneurship: ESG measurement, valuation, and performance (2022)

Mansouri, Sasan ; Momtaz, Paul P.

Sustainability orientation has a positive effect on startups' initial valuation and a negative effect on their post-funding financial performance. All else equal, improving sustainability orientation by one standard deviation increases startups' funding amount by 28 % and decreases investors' abnormal returns per post-funding year by 16 %. The results hold in a large sample of blockchain-based crowdfunding campaigns, also known as Initial Coin Offerings (ICOs) or token offerings. A key contribution is a machine-learning approach to assess startups' Environment, Society and Governance (ESG) properties from textual data, which we make readily available at www.SustainableEntrepreneurship.org.

The cost of fairness in AI: evidence from e-commerce (2021)

Zahn, Moritz von ; Feuerriegel, Stefan ; Kühl, Niklas

Contemporary information systems make widespread use of artificial intelligence (AI). While AI offers various benefits, it can also be subject to systematic errors, whereby people from certain groups (defined by gender, age, or other sensitive attributes) experience disparate outcomes. In many AI applications, disparate outcomes confront businesses and organizations with legal and reputational risks. To address these, technologies for so-called “AI fairness” have been developed, by which AI is adapted such that mathematical constraints for fairness are fulfilled. However, the financial costs of AI fairness are unclear. Therefore, the authors develop AI fairness for a real-world use case from e-commerce, where coupons are allocated according to clickstream sessions. In their setting, the authors find that AI fairness successfully manages to adhere to fairness requirements, while reducing the overall prediction performance only slightly. However, they find that AI fairness also results in an increase in financial cost. Thus, in this way the paper’s findings contribute to designing information systems on the basis of AI fairness.

Prediction of COVID-19 deterioration in high-risk patients at diagnosis: an early warning score for advanced COVID-19 developed by machine learning (2021)

Purpose: While more advanced COVID-19 necessitates medical interventions and hospitalization, patients with mild COVID-19 do not require this. Identifying patients at risk of progressing to advanced COVID-19 might guide treatment decisions, particularly for better prioritizing patients in need for hospitalization. Methods: We developed a machine learning-based predictor for deriving a clinical score identifying patients with asymptomatic/mild COVID-19 at risk of progressing to advanced COVID-19. Clinical data from SARS-CoV-2 positive patients from the multicenter Lean European Open Survey on SARS-CoV-2 Infected Patients (LEOSS) were used for discovery (2020-03-16 to 2020-07-14) and validation (data from 2020-07-15 to 2021-02-16). Results: The LEOSS dataset contains 473 baseline patient parameters measured at the first patient contact. After training the predictor model on a training dataset comprising 1233 patients, 20 of the 473 parameters were selected for the predictor model. From the predictor model, we delineated a composite predictive score (SACOV-19, Score for the prediction of an Advanced stage of COVID-19) with eleven variables. In the validation cohort (n = 2264 patients), we observed good prediction performance with an area under the curve (AUC) of 0.73 ± 0.01. Besides temperature, age, body mass index and smoking habit, variables indicating pulmonary involvement (respiration rate, oxygen saturation, dyspnea), inflammation (CRP, LDH, lymphocyte counts), and acute kidney injury at diagnosis were identified. For better interpretability, the predictor was translated into a web interface. Conclusion: We present a machine learning-based predictor model and a clinical score for identifying patients at risk of developing advanced COVID-19.

Generation and validation of a formula to calculate hemoglobin loss on a cohort of healthy adults subjected to controlled blood loss (2021)

Hahn-Klimroth, Maximilian Grischa ; Loick, Philipp ; Kim-Wanner, Soo-Zin ; Seifried, Erhard ; Bönig, Halvard-Björn

Background: The ability to approximate intra-operative hemoglobin loss with reasonable precision and linearity is prerequisite for determination of a relevant surgical outcome parameter: This information enables comparison of surgical procedures between different techniques, surgeons or hospitals, and supports anticipation of transfusion needs. Different formulas have been proposed, but none of them were validated for accuracy, precision and linearity against a cohort with precisely measured hemoglobin loss and, possibly for that reason, neither has established itself as gold standard. We sought to identify the minimal dataset needed to generate reasonably precise and accurate hemoglobin loss prediction tools and to derive and validate an estimation formula. Methods: Routinely available clinical and laboratory data from a cohort of 401 healthy individuals with controlled hemoglobin loss between 29 and 233 g were extracted from medical charts. Supervised learning algorithms were applied to identify a minimal data set and to generate and validate a formula for calculation of hemoglobin loss. Results: Of the classical supervised learning algorithms applied, the linear and Ridge regression models performed at least as well as the more complex models. Most straightforward to analyze and check for robustness, we proceeded with linear regression. Weight, height, sex and hemoglobin concentration before and on the morning after the intervention were sufficient to generate a formula for estimation of hemoglobin loss. The resulting model yields an outstanding R2 of 53.2% with similar precision throughout the entire range of volumes or donor sizes, thereby meaningfully outperforming previously proposed medical models. Conclusions: The resulting formula will allow objective benchmarking of surgical blood loss, enabling informed decision making as to the need for pre-operative type-and-cross only vs. reservation of packed red cell units, depending on a patient’s anemia tolerance, and thus contributing to resource management.

Machine learning in information systems - a bibliographic review and open research issues (2021)

Abdel-Karim, Benjamin M. ; Pfeuffer, Nicolas ; Hinz, Oliver

Artificial Intelligence (AI) and Machine Learning (ML) are currently hot topics in industry and business practice, while management-oriented research disciplines seem reluctant to adopt these sophisticated data analytics methods as research instruments. Even the Information Systems (IS) discipline with its close connections to Computer Science seems to be conservative when conducting empirical research endeavors. To assess the magnitude of the problem and to understand its causes, we conducted a bibliographic review on publications in high-level IS journals. We reviewed 1,838 articles that matched corresponding keyword-queries in journals from the AIS senior scholar basket, Electronic Markets and Decision Support Systems (Ranked B). In addition, we conducted a survey among IS researchers (N = 110). Based on the findings from our sample we evaluate different potential causes that could explain why ML methods are rather underrepresented in top-tier journals and discuss how the IS discipline could successfully incorporate ML methods in research undertakings.

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

23 search hits