A biomedical case study showing that tuning random forests can fundamentally change the interpretation of supervised data structure exploration aimed at knowledge discovery

Lötsch, Jörn; Mayer, Benjamin

doi:10.3390/biomedinformatics2040034

Treffer 9 von 19

Zurück zur Trefferliste

A biomedical case study showing that tuning random forests can fundamentally change the interpretation of supervised data structure exploration aimed at knowledge discovery

Jörn Lötsch, Benjamin Mayer

Knowledge discovery in biomedical data using supervised methods assumes that the data contain structure relevant to the class structure if a classifier can be trained to assign a case to the correct class better than by guessing. In this setting, acceptance or rejection of a scientific hypothesis may depend critically on the ability to classify cases better than randomly, without high classification performance being the primary goal. Random forests are often chosen for knowledge-discovery tasks because they are considered a powerful classifier that does not require sophisticated data transformation or hyperparameter tuning and can be regarded as a reference classifier for tabular numerical data. Here, we report a case where the failure of random forests using the default hyperparameter settings in the standard implementations of R and Python would have led to the rejection of the hypothesis that the data contained structure relevant to the class structure. After tuning the hyperparameters, classification performance increased from 56% to 65% balanced accuracy in R, and from 55% to 67% balanced accuracy in Python. More importantly, the 95% confidence intervals in the tuned versions were to the right of the value of 50% that characterizes guessing-level classification. Thus, tuning provided the desired evidence that the data structure supported the class structure of the data set. In this case, the tuning made more than a quantitative difference in the form of slightly better classification accuracy, but significantly changed the interpretation of the data set. This is especially true when classification performance is low and a small improvement increases the balanced accuracy to over 50% when guessing.

Metadaten
Verfasserangaben:	Jörn Lötsch ORCiD GND, Benjamin Mayer ORCiD GND
URN:	urn:nbn:de:hebis:30:3-755644
DOI:	https://doi.org/10.3390/biomedinformatics2040034
ISSN:	2673-7426
Titel des übergeordneten Werkes (Englisch):	BioMedInformatics
Verlag:	MDPI
Verlagsort:	Basel
Dokumentart:	Wissenschaftlicher Artikel
Sprache:	Englisch
Datum der Veröffentlichung (online):	18.10.2022
Datum der Erstveröffentlichung:	18.10.2022
Datum der Freischaltung:	11.09.2023
Freies Schlagwort / Tag:	artificial intelligence; data science; digital medicine; machine-learning
Jahrgang:	2.2022
Ausgabe / Heft:	4
Seitenzahl:	9
Erste Seite:	544
Letzte Seite:	552
HeBIS-PPN:	51312019X
Institute:	Medizin
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
	5 Naturwissenschaften und Mathematik / 57 Biowissenschaften; Biologie / 570 Biowissenschaften; Biologie
	6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit
Sammlungen:	Universitätspublikationen
Lizenz (Deutsch):	Creative Commons - Namensnennung 4.0

Open Access

A biomedical case study showing that tuning random forests can fundamentally change the interpretation of supervised data structure exploration aimed at knowledge discovery

Volltext Dateien herunterladen

Metadaten exportieren

Weitere Dienste