A biomedical case study showing that tuning random forests can fundamentally change the interpretation of supervised data structure exploration aimed at knowledge discovery
- Knowledge discovery in biomedical data using supervised methods assumes that the data contain structure relevant to the class structure if a classifier can be trained to assign a case to the correct class better than by guessing. In this setting, acceptance or rejection of a scientific hypothesis may depend critically on the ability to classify cases better than randomly, without high classification performance being the primary goal. Random forests are often chosen for knowledge-discovery tasks because they are considered a powerful classifier that does not require sophisticated data transformation or hyperparameter tuning and can be regarded as a reference classifier for tabular numerical data. Here, we report a case where the failure of random forests using the default hyperparameter settings in the standard implementations of R and Python would have led to the rejection of the hypothesis that the data contained structure relevant to the class structure. After tuning the hyperparameters, classification performance increased from 56% to 65% balanced accuracy in R, and from 55% to 67% balanced accuracy in Python. More importantly, the 95% confidence intervals in the tuned versions were to the right of the value of 50% that characterizes guessing-level classification. Thus, tuning provided the desired evidence that the data structure supported the class structure of the data set. In this case, the tuning made more than a quantitative difference in the form of slightly better classification accuracy, but significantly changed the interpretation of the data set. This is especially true when classification performance is low and a small improvement increases the balanced accuracy to over 50% when guessing.
Verfasserangaben: | Jörn LötschORCiDGND, Benjamin MayerORCiDGND |
---|---|
URN: | urn:nbn:de:hebis:30:3-755644 |
DOI: | https://doi.org/10.3390/biomedinformatics2040034 |
ISSN: | 2673-7426 |
Titel des übergeordneten Werkes (Englisch): | BioMedInformatics |
Verlag: | MDPI |
Verlagsort: | Basel |
Dokumentart: | Wissenschaftlicher Artikel |
Sprache: | Englisch |
Datum der Veröffentlichung (online): | 18.10.2022 |
Datum der Erstveröffentlichung: | 18.10.2022 |
Datum der Freischaltung: | 11.09.2023 |
Freies Schlagwort / Tag: | artificial intelligence; data science; digital medicine; machine-learning |
Jahrgang: | 2.2022 |
Ausgabe / Heft: | 4 |
Seitenzahl: | 9 |
Erste Seite: | 544 |
Letzte Seite: | 552 |
HeBIS-PPN: | 51312019X |
Institute: | Medizin |
DDC-Klassifikation: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie | |
6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit | |
Sammlungen: | Universitätspublikationen |
Lizenz (Deutsch): | Creative Commons - Namensnennung 4.0 |