Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

Lötsch, Jörn; Malkusch, Sebastian; Ultsch, Alfred

doi:10.1371/journal.pone.0255838

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

Jörn Lötsch, Sebastian Malkusch, Alfred Ultsch

Motivation: The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method. Results: By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn. Conclusions: Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.

Metadaten
Author:	Jörn Lötsch ORCiD GND, Sebastian Malkusch ORCiD GND, Alfred Ultsch GND
URN:	urn:nbn:de:hebis:30:3-626903
DOI:	https://doi.org/10.1371/journal.pone.0255838
ISSN:	1932-6203
Parent Title (English):	PLOS ONE
Publisher:	PLOS
Place of publication:	San Francisco, California, US
Document Type:	Article
Language:	English
Date of Publication (online):	2021/08/05
Date of first Publication:	2021/08/05
Publishing Institution:	Universitätsbibliothek Johann Christian Senckenberg
Release Date:	2021/10/15
Tag:	Computer hardware; Computer software; Data processing; Data reduction; Mathematical functions; Principal component analysis; Probability density; Probability distribution
Volume:	16
Issue:	8, art. e0255838
Page Number:	16
First Page:	1
Last Page:	16
Note:	This work has been funded by the Landesoffensive zur Entwicklung wissenschaftlich-ökonomischer Exzellenz (LOEWE), LOEWE-Zentrum für Translationale Medizin und Pharmakologie (JL), in particular through the project “Reproducible cleaning of biomedical laboratory data using methods of visualization, error correction and transformation implemented as interactive R-notebooks“ (JL). The funders had no role in the decision to publish or in the preparation of the manuscript.
HeBIS-PPN:	488131464
Institutes:	Medizin
Dewey Decimal Classification:	6 Technik, Medizin, angewandte Wissenschaften / 61 Medizin und Gesundheit / 610 Medizin und Gesundheit
Sammlungen:	Universitätspublikationen
Licence (German):	Creative Commons - Namensnennung 4.0

Open Access

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling)

Download full text files

Export metadata

Additional Services