004 Datenverarbeitung; Informatik
Refine
Year of publication
Document Type
- Article (264)
- Doctoral Thesis (148)
- Working Paper (122)
- Bachelor Thesis (54)
- Conference Proceeding (53)
- Diploma Thesis (49)
- Preprint (49)
- Part of a Book (42)
- Contribution to a Periodical (39)
- diplomthesis (30)
Is part of the Bibliography
- no (917)
Keywords
- Lambda-Kalkül (21)
- Inklusion (13)
- Formale Semantik (11)
- Barrierefreiheit (10)
- Digitalisierung (10)
- Operationale Semantik (9)
- artificial intelligence (9)
- data science (9)
- lambda calculus (9)
- machine learning (9)
Institute
- Informatik (482)
- Informatik und Mathematik (106)
- Präsidium (74)
- Medizin (55)
- Frankfurt Institute for Advanced Studies (FIAS) (54)
- Wirtschaftswissenschaften (45)
- Physik (35)
- Hochschulrechenzentrum (24)
- studiumdigitale (24)
- Extern (12)
A Large Ion Collider Experiment (ALICE) is a high-energy physics experiment, designed to study heavy ion collisions at the European Organization for Nuclear Research (CERN)Large Hadron Collider (LHC). ALICE is built to study the fundamental properties of matter as it existed shortly after the big bang. This requires reading out millions of sensors with high frequency, enabling high statistics for physics analysis, resulting in a considerable computing demand concerning network throughput and processing power. With the ALICE Run 3 upgrade [14], requirements for a High Throughput Computing
(HTC) online processing cluster increased significantly, due to more than an order of magnitude more data than in Run 2, resulting in a processing input rate of up to 900 GB/s. Online (real-time) event reconstruction allows for the compression of the data stream to 130 GB/s, which is stored on disk for physics analysis.
This thesis presents the implementation of the ALICE Event Processing Node (EPN) compute farm, to cope with the Run 3 online computing challenges. Building a Data Centre tailored to ALICE requirements for the Run 3 and Run 4 EPN farm. Providing the operational conditions for a dynamic compute environment of a High Performance Computing (HPC) cluster, with significant load changes in a short time span, when starting or stopping a data-taking run. EPN servers provide the required computing resources for online reconstruction and data compression. The farm includes network connectivity towards First Level Processors (FLPs), requiring reliable throughput of 900 GB/s between FLPs and EPNs and connectivity from the internal InfiniBand network to the CERN Exabyte Object Storage (EOS) Ethernet network, with more than 100 GB/s.
The results of operating the EPN computing infrastructure during the first year of Run 3 LHC collisions are described in the context of the ALICE experiment. The EPN farm was delivering the expected performance for ALICE data-taking. Data Centre environmental conditions remained stable during the last more than two years, in particular during starting and stopping runs, which include significant changes in IT load. Several unforeseen external circumstances lead to increasing demands for the Online Offline System (O2). Higher data rates than anticipated required network performance to exceed the initial design specifications, for the throughput between FLPs and EPNs. In particular, the high throughput from an internal EPN InfiniBand network towards the storage Ethernet network was one of the challenges to overcome.
This bachelor thesis developed a pipeline for automatic processing of scanned hospital letters: HospLetExtractor. Hospital letters can contain valuable information about potential adverse drug reactions and useful case information relevant to pharmacovigilance. To make this data accessible, this thesis presents a pipeline consisting of image pre-processing, optical character recognition and post-processing. Pre-processing deskews the images, removes lines and rectangles, reduces noise and applies super-resolution. For the post-processing a spell checking system was set up including a newly built word frequency dictionary for german medical terms based on a created corpus of german medical texts. Furthermore, classical and deep learning models for the classification of hospital letters were compared, in which the transformer-based models performed best. In order to train and test the models, a new gold standard was created. By making these medical documents accessible for automatic analysis, hopefully a contribution can be made to expand the scope of pharmacovigilance.
Natural Language Processing (NLP) for big data requires an efficient and sophisticated infrastructure to complete tasks both fast and correctly. Providing an intuitive and lightweight interaction with a framework that abstracts and simplifies complex tasks assists in reaching this goal. This bachelor thesis extends the NLP framework Docker Unified UIMA Interface (DUUI) by an API and a web-based graphical user interface to control and manage pipelines for automated analysis of large quantities of natural language. The extension aims to reduce the entry barrier into the field as well as to accelerate the creation and management of pipelines according to UIMA standards. Pipelines can be executed in the browser or using the web API directly and then monitored on a document level. The evaluation in usability and user experience indicates that the implementation benefits the framework by making its usage more user friendly, lightweight, and intuitive while also making the management of pipelines more efficient.
Graph4Med: a web application and a graph database for visualizing and analyzing medical databases
(2022)
Background: Medical databases normally contain large amounts of data in a variety of forms. Although they grant significant insights into diagnosis and treatment, implementing data exploration into current medical databases is challenging since these are often based on a relational schema and cannot be used to easily extract information for cohort analysis and visualization. As a consequence, valuable information regarding cohort distribution or patient similarity may be missed. With the rapid advancement of biomedical technologies, new forms of data from methods such as Next Generation Sequencing (NGS) or chromosome microarray (array CGH) are constantly being generated; hence it can be expected that the amount and complexity of medical data will rise and bring relational database systems to a limit.
Description: We present Graph4Med, a web application that relies on a graph database obtained by transforming a relational database. Graph4Med provides a straightforward visualization and analysis of a selected patient cohort. Our use case is a database of pediatric Acute Lymphoblastic Leukemia (ALL). Along routine patients’ health records it also contains results of latest technologies such as NGS data. We developed a suitable graph data schema to convert the relational data into a graph data structure and store it in Neo4j. We used NeoDash to build a dashboard for querying and displaying patients’ cohort analysis. This way our tool (1) quickly displays the overview of patients’ cohort information such as distributions of gender, age, mutations (fusions), diagnosis; (2) provides mutation (fusion) based similarity search and display in a maneuverable graph; (3) generates an interactive graph of any selected patient and facilitates the identification of interesting patterns among patients.
Conclusion: We demonstrate the feasibility and advantages of a graph database for storing and querying medical databases. Our dashboard allows a fast and interactive analysis and visualization of complex medical data. It is especially useful for patients similarity search based on mutations (fusions), of which vast amounts of data have been generated by NGS in recent years. It can discover relationships and patterns in patients cohorts that are normally hard to grasp. Expanding Graph4Med to more medical databases will bring novel insights into diagnostic and research.
Interacting with the environment to process sensory information, generate perceptions, and shape behavior engages neural networks in brain areas with highly varied representations, ranging from unimodal sensory cortices to higher-order association areas. Recent work suggests a much greater degree of commonality across areas, with distributed and modular networks present in both sensory and non-sensory areas during early development. However, it is currently unknown whether this initially common modular structure undergoes an equally common developmental trajectory, or whether such a modular functional organization persists in some areas—such as primary visual cortex—but not others. Here we examine the development of network organization across diverse cortical regions in ferrets of both sexes using in vivo widefield calcium imaging of spontaneous activity. We find that all regions examined, including both primary sensory cortices (visual, auditory, and somatosensory—V1, A1, and S1, respectively) and higher order association areas (prefrontal and posterior parietal cortices) exhibit a largely similar pattern of changes over an approximately 3 week developmental period spanning eye opening and the transition to predominantly externally-driven sensory activity. We find that both a modular functional organization and millimeter-scale correlated networks remain present across all cortical areas examined. These networks weakened over development in most cortical areas, but strengthened in V1. Overall, the conserved maintenance of modular organization across different cortical areas suggests a common pathway of network refinement, and suggests that a modular organization—known to encode functional representations in visual areas—may be similarly engaged in highly diverse brain areas.
Significance Different areas of the mature brain encode vastly different representations of the world. This study shows that a modular functional organization where nearby neurons participate in similar functional networks is shared across different brain areas not only during early development, but also as the brain matures where it remains a shared feature that shapes neural activity. The largely conserved trajectory of developmental changes across brain areas suggests that similar circuit mechanisms may drive this maturation. This implies that the large literature on developing cortical circuits, which is largely focused on sensory areas, may also apply more broadly, and that perturbations during development that impinge on any such shared mechanisms may produce deficits that extend across multiple brain systems.
Background: Prostate cancer is a major health concern in aging men. Paralleling an aging society, prostate cancer prevalence increases emphasizing the need for efcient diagnostic algorithms.
Methods: Retrospectively, 106 prostate tissue samples from 48 patients (mean age,
66 ± 6.6 years) were included in the study. Patients sufered from prostate cancer (n = 38) or benign prostatic hyperplasia (n = 10) and were treated with radical prostatectomy or Holmium laser enucleation of the prostate, respectively. We constructed tissue microarrays (TMAs) comprising representative malignant (n = 38) and benign (n = 68) tissue cores. TMAs were processed to histological slides, stained, digitized and assessed for the applicability of machine learning strategies and open–source tools in diagnosis of prostate cancer. We applied the software QuPath to extract features for shape, stain intensity, and texture of TMA cores for three stainings, H&E, ERG, and PIN-4. Three machine learning algorithms, neural network (NN), support vector machines (SVM), and random forest (RF), were trained and cross-validated with 100 Monte Carlo random splits into 70% training set and 30% test set. We determined AUC values for single color channels, with and without optimization of hyperparameters by exhaustive grid search. We applied recursive feature elimination to feature sets of multiple color transforms.
Results: Mean AUC was above 0.80. PIN-4 stainings yielded higher AUC than H&E and
ERG. For PIN-4 with the color transform saturation, NN, RF, and SVM revealed AUC of 0.93 ± 0.04, 0.91 ± 0.06, and 0.92 ± 0.05, respectively. Optimization of hyperparameters improved the AUC only slightly by 0.01. For H&E, feature selection resulted in no increase of AUC but to an increase of 0.02–0.06 for ERG and PIN-4.
Conclusions: Automated pipelines may be able to discriminate with high accuracy between malignant and benign tissue. We found PIN-4 staining best suited for classifcation. Further bioinformatic analysis of larger data sets would be crucial to evaluate the reliability of automated classifcation methods for clinical practice and to evaluate potential discrimination of aggressiveness of cancer to pave the way to automatic precision medicine.
Unified probabilistic deep continual learning through generative replay and open set recognition
(2022)
Modern deep neural networks are well known to be brittle in the face of unknown data instances and recognition of the latter remains a challenge. Although it is inevitable for continual-learning systems to encounter such unseen concepts, the corresponding literature appears to nonetheless focus primarily on alleviating catastrophic interference with learned representations. In this work, we introduce a probabilistic approach that connects these perspectives based on variational inference in a single deep autoencoder model. Specifically, we propose to bound the approximate posterior by fitting regions of high density on the basis of correctly classified data points. These bounds are shown to serve a dual purpose: unseen unknown out-of-distribution data can be distinguished from already trained known tasks towards robust application. Simultaneously, to retain already acquired knowledge, a generative replay process can be narrowed to strictly in-distribution samples, in order to significantly alleviate catastrophic interference.
Assessing communicative accommodation in the context of large language models : a semiotic approach
(2023)
Recently, significant strides have been made in the ability of transformer-based chatbots to hold natural conversations. However, despite a growing societal and scientific relevancy, there are few frameworks systematically deriving what it means for a chatbot conversation to be natural. The present work approaches this question through the phenomenon of communicative accommodation/interactive alignment. While there is existing research suggesting that humans adapt communicatively to technologies, the aim of this work is to explore the accommodation of AI-chatbots to an interlocutor. Its research interest is twofold: Firstly, the structural ability of the transformer-architecture to support accommodative behavior is assessed using a frame constructed in accordance with existing accommodationtheories.
This results in hypotheses to be tested empirically. Secondly, since effective accommodation produces the same outcomes, regardless of technical implementation, a behavioral experiment is proposed. Existing quantifications of accommodation are reconciled,
extended, and modified to apply them to nonhuman-interlocutors. Thus, a measurement scheme is suggested which evaluates textual data from text-only, double-blind interactions between chatbots and humans, chatbots and chatbots and humans and humans. Using the generated human-to-human convergence data as a reference, the degree of artificial accommodation can be evaluated. Accommodation as a central facet of artificial interactivity can thus be evaluated directly against its theoretical paradigm, i.e. human interaction. In case that subsequent examinations show that chatbots effectively do not accommodate, there may be a new form of algorithmic bias, emerging from the aggregate accommodation towards chatbots but not towards humans. Thus, existing, hegemonic semantics could be cemented through chatbot-learning. Meanwhile, the ability to effectively accommodate would render chatbots vastly more susceptible to misuse.
Detailed feedback on exercises helps learners become proficient but is time-consuming for educators and, thus, hardly scalable. This manuscript evaluates how well Generative Artificial Intelligence (AI) provides automated feedback on complex multimodal exercises requiring coding, statistics, and economic reasoning. Besides providing this technology through an easily accessible web application, this article evaluates the technology’s performance by comparing the quantitative feedback (i.e., points achieved) from Generative AI models with human expert feedback for 4,349 solutions to marketing analytics exercises. The results show that automated feedback produced by Generative AI (GPT-4) provides almost unbiased evaluations while correlating highly with (r = 0.94) and deviating only 6 % from human evaluations. GPT-4 performs best among seven Generative AI models, albeit at the highest cost. Comparing the models’ performance with costs shows that GPT-4, Mistral Large, Claude 3 Opus, and Gemini 1.0 Pro dominate three other Generative AI models (Claude 3 Sonnet, GPT-3.5, and Gemini 1.5 Pro). Expert assessment of the qualitative feedback (i.e., the AI’s textual response) indicates that it is mostly correct, sufficient, and appropriate for learners. A survey of marketing analytics learners shows that they highly recommend the app and its Generative AI feedback. An advantage of the app is its subject-agnosticism—it does not require any subject- or exercise-specific training. Thus, it is immediately usable for new exercises in marketing analytics and other subjects.