High throughput computing infrastructure for the ALICE EPN online processing
- A Large Ion Collider Experiment (ALICE) is a high-energy physics experiment, designed to study heavy ion collisions at the European Organization for Nuclear Research (CERN)Large Hadron Collider (LHC). ALICE is built to study the fundamental properties of matter as it existed shortly after the big bang. This requires reading out millions of sensors with high frequency, enabling high statistics for physics analysis, resulting in a considerable computing demand concerning network throughput and processing power. With the ALICE Run 3 upgrade [14], requirements for a High Throughput Computing (HTC) online processing cluster increased significantly, due to more than an order of magnitude more data than in Run 2, resulting in a processing input rate of up to 900 GB/s. Online (real-time) event reconstruction allows for the compression of the data stream to 130 GB/s, which is stored on disk for physics analysis. This thesis presents the implementation of the ALICE Event Processing Node (EPN) compute farm, to cope with the Run 3 online computing challenges. Building a Data Centre tailored to ALICE requirements for the Run 3 and Run 4 EPN farm. Providing the operational conditions for a dynamic compute environment of a High Performance Computing (HPC) cluster, with significant load changes in a short time span, when starting or stopping a data-taking run. EPN servers provide the required computing resources for online reconstruction and data compression. The farm includes network connectivity towards First Level Processors (FLPs), requiring reliable throughput of 900 GB/s between FLPs and EPNs and connectivity from the internal InfiniBand network to the CERN Exabyte Object Storage (EOS) Ethernet network, with more than 100 GB/s. The results of operating the EPN computing infrastructure during the first year of Run 3 LHC collisions are described in the context of the ALICE experiment. The EPN farm was delivering the expected performance for ALICE data-taking. Data Centre environmental conditions remained stable during the last more than two years, in particular during starting and stopping runs, which include significant changes in IT load. Several unforeseen external circumstances lead to increasing demands for the Online Offline System (O2). Higher data rates than anticipated required network performance to exceed the initial design specifications, for the throughput between FLPs and EPNs. In particular, the high throughput from an internal EPN InfiniBand network towards the storage Ethernet network was one of the challenges to overcome.
Author: | Johannes LehrbachORCiDGND |
---|---|
URN: | urn:nbn:de:hebis:30:3-864477 |
DOI: | https://doi.org/10.21248/gups.86447 |
Place of publication: | Frankfurt am Main |
Referee: | Volker LindenstruthORCiD, Udo KebschullORCiDGND |
Document Type: | Doctoral Thesis |
Language: | English |
Date of Publication (online): | 2024/07/29 |
Year of first Publication: | 2023 |
Publishing Institution: | Universitätsbibliothek Johann Christian Senckenberg |
Granting Institution: | Johann Wolfgang Goethe-Universität |
Date of final exam: | 2024/07/05 |
Release Date: | 2024/07/29 |
Tag: | ALICE; CERN; High Throughput Computing / HTC |
Page Number: | 141 |
HeBIS-PPN: | 520202562 |
Institutes: | Informatik und Mathematik / Informatik |
Dewey Decimal Classification: | 0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik |
Sammlungen: | Universitätspublikationen |
Licence (German): | ![]() |