Refine
Document Type
- Doctoral Thesis (9) (remove)
Language
- English (9)
Has Fulltext
- yes (9)
Is part of the Bibliography
- no (9)
Keywords
- ALICE (9) (remove)
Institute
- Physik (4)
- Informatik und Mathematik (3)
- Informatik (2)
The ALICE High-Level-Trigger (HLT) is a large scale computing farm designed and constructed for the purpose of the realtime reconstruction of particle interactions (events) inside the ALICE detector. The reconstruction of such events is based on the raw data produced in collisions inside the ALICE at the Large Hadron Collider. The online reconstruction in the HLT allows the triggering on certain event topologies and a significant data reduction by applying compression algorithms. Moreover, it enables a real-time verification of the quality of the data.
To receive the raw data from the various sub-detectors of ALICE, the HLT is equipped with 226 custom built FPGA-based PCI-X cards, the H-RORCs. The H-RORC interfaces the detector readout electronics to the nodes of the HLT farm. In addition to the transfer of raw data, 108 H-RORCs host 216 Fast-Cluster-Finder (FCF) processors for the Time-Projection-Chamber (TPC). The TPC is the main tracking detector of ALICE and contributes with up to 16 GB/s to over 90% of the overall data volume. The FCF processor implements the first of two steps in the data reconstruction of the TPC. It calculates the space points and their properties from charge clouds (clusters) created by charged particles traversing the TPCs gas volume. Those space points are not only the base for the tracking algorithm, but also allow for a Huffman-based data compression, which reduces the data volume by a factor of 4 to 6.
The FCF processor is designed to cope with any incoming data rate up to the maximum bandwidth of the incoming optical link (160 MB/s) without creating back-pressure to the detectors readout electronics. A performance comparison with the software implementation of the algorithm shows a speedup factor of about 20 compared with one AMD Opteron 6172 Core @ 2.1 GHz, the CPU type used in the HLT during the LHC Run1 campaign. Comparison with an Intel E5-2690 Core @ 3.0 GHz, the CPU type used by the HLT for the LHC Run2 campaign, results in a speedup factor of 8.5. In total numbers, the 216 FCF processors provide the computing performance of 4255 AMD Opteron cores or 2203 Intel cores of the previously mentioned type. The performance of the reconstruction with respect to the physics analysis is equivalent or better than the official ALICE Offline clusterizer. Therefore, ALICE data taking was switched in 2011 to FCF cluster recording and compression only, discarding the raw data from the TPC. Due to the capability to compress the clusters, the recorded data volume could be increased by a factor of 4 to 6.
For the LHC Run3 campaign, starting in 2020, the FCF builds the foundation of the ALICE data taking and processing strategy. The raw data volume (before processing) of the upgraded TPC will exceed 3 TB/s. As a consequence, online processing of the raw data and compression of the results before it enters the online computing farms is an essential and crucial part of the computing model.
Within the scope of this thesis, the H-RORC card and the FCF processor were developed and built from scratch. It covers the conceptual design, the optimisation and implementation, as well as the verification. It is completed by performance benchmarks and experiences from real data taking.
Die vorliegende Arbeit beschäftigt sich mit der Charakterisierung des ALTRO Chips (ALICE TPC Readout), der ein integraler und wichtiger Bestandteil der Auslesekette des TPC (Time Projection Chamber) Detektors von ALICE (A Large Ion Collider Experiment) ist. ALICE ist ein Experiment am noch im Bau befindlichen LHC (Large Hadron Collider) am CERN mit der zentralen Ausrichtung, Schwerionenkollisionen zu untersuchen. Diese sind von besonderem Interesse, da durch sie ein experimenteller Zugriff zu dem QGP (Quark Gluon Plasma) existiert, dem einzigen vom Standardmodell vorhergesagten Phasenübergang, der unter Laborbedingungen erreichbar ist. Im Jahr 2004 wurden Messungen an einem Teststrahl am CERN PS (Proton Synchrotron) durchgeführt. Der Prototyp wurde voll mit FECs bestückt, was 5400 Kanälen entspricht und einer anderen Gasmixtur (Ne/N2/CO2 90%/5%/5%) befüllt. Für das optimale Leistungsverhalten der ALICE TPC muß der Digitalprozessor im ALTRO, bestehend aus vier Berechnungseinheiten, mit den passenden Werten konfiguriert werden. Der Datenfluss beginnt mit dem BCS1 (Baseline Correction and Subtraction 1) Modul, das systematische Störungen und die Grundlinie entfernt. Da der ALTRO kontinuierlich das anliegende Signal abtastet, entfernt es automatisch langsame Grundlinienveränderungen, die Beispielsweise durch Temperaturänderungen auftreten können. Gefolgt von dem TCF (Tail Cancellation Filter), der den Schweif des langsam fallenden, vom PASA generierten Signals entfernt. Um die nichtsystematischen Störungen der Grundlinie zu entfernen, folgt die BCS2 (Baseline Correction and Subtraction 2), die auf einer gleitenden Mittelwertsberechnung mit Ausschluß von Detektorsignalen über einen doppelten Schwellenwert basiert. Die finale Einheit für die Signalverarbeitung ist die ZSU (Zero Suppression Unit), die Meßpunkte unterhalb eines definierten Schwellwertes entfernt. Hier wird der weg beschrieben die TCF und BCS1 Parameter aus vorhandenen Detektordaten zu extrahieren. Während der Analyse der Daten von kosmischen Teilchen fiel bei Signalen mit hoher Amplitude (>700 ADC) eine zusätzliche Struktur in dem Schweif auf. Der Monitor wurde deswegen mit einem gleitenden Mittelwertfilter erweitert, worauf sich diese Struktur auch in kleineren Signalen (> 200 ADC) zeigte. Dieses Signal wird von Ionen erzeugt, die zur Kathode oder zu den Pads driften, bisher ist jedoch weder die Streuung der Elektronenlawine an der Anode, noch die Variationsbreite in den erzeugten Elektronlawinen verstanden oder gemessen worden. Eine erfolgreiche Messung, sowie Charakterisierung wird in dieser Arbeit beschrieben. Im Jahr 2005 im Sommer beginnt der Einbau der Gaskammern der TPC in ALICE, die Elektronik folgt am Ende dieses Jahres. Parallel hierzu wurde der Prototyp der TPC wieder in Betrieb genommen und im Frühling wird ein kompletter Sektor mit der Detektorelektronik ausgestattet. An diesen zwei Aufbauten wird die ALTRO Charakterisierung fortgeführt, verfeinert und komplettiert.
A new era in experimental nuclear physics has begun with the start-up of the Large Hadron Collider at CERN and its dedicated heavy-ion detector system ALICE. Measuring the highest energy density ever produced in nucleus-nucleus collisions, the detector has been designed to study the properties of the created hot and dense medium, assumed to be a Quark-Gluon Plasma.
Comprised of 18 high granularity sub-detectors, ALICE delivers data from a few million electronic channels of proton-proton and heavy-ion collisions.
The produced data volume can reach up to 26 GByte/s for central Pb–Pb
collisions at design luminosity of L = 1027 cm−2 s−1 , challenging not only the data storage, but also the physics analysis. A High-Level Trigger (HLT) has been built and commissioned to reduce that amount of data to a storable value prior to archiving with the means of data filtering and compression without the loss of physics information. Implemented as a large high performance compute cluster, the HLT is able to perform a full reconstruction of all events at the time of data-taking, which allows to trigger, based on the information of a complete event. Rare physics probes, with high transverse momentum, can be identified and selected to enhance the overall physics reach of the experiment.
The commissioning of the HLT is at the center of this thesis. Being deeply embedded in the ALICE data path and, therefore, interfacing all other ALICE subsystems, this commissioning imposed not only a major challenge, but also a massive coordination effort, which was completed with the first proton-proton collisions reconstructed by the HLT. Furthermore, this thesis is completed with the study and implementation of on-line high transverse momentum triggers.
Programmable hardware in the form of FPGAs found its place in various high energy physics experiments over the past few decades. These devices provide highly parallel and fully configurable data transport, data formatting, and data processing capabilities with custom interfaces, even in rigid or constrained environments. Additionally, FPGA functionalities and the number of their logic resources have grown exponentially in the last few years, making FPGAs more and more suitable for complex data processing tasks. ALICE is one of the four main experiments at the LHC and specialized in the study of heavy-ion collisions. The readout chain of the ALICE detectors makes use of FPGAs at various places. The Read-Out Receiver Cards (RORCs) are one example of FPGA-based readout hardware, building the interface between the custom detector electronics and the commercial server nodes in the data processing clusters of the Data Acquisition (DAQ) system as well as the High Level Trigger (HLT). These boards are implemented as server plug-in cards with serial optical links towards the detectors. Experimental data is received via more than 500 optical links, already partly pre-processed in the FPGAs, and pushed towards the host machines. Computer clusters consisting of a few hundred nodes collect, aggregate, compress, reconstruct, and prepare the experimental data for permanent storage and later analysis. With the end of the first LHC run period in 2012 and the start of Run 2 in 2015, the DAQ and HLT systems were renewed and several detector components were upgraded for higher data rates and event rates. Increased detector link rates and obsolete host interfaces rendered it impossible to reuse the previous RORCs in Run 2.
This thesis describes the development, integration, and maintenance of the next generation of RORCs for ALICE in Run 2. A custom hardware platform, initially developed as a joint effort between the ALICE DAQ and HLT groups in the course of this work, found its place in the Run 2 readout systems of the ALICE and ATLAS experiments. The hardware fulfills all experiment requirements, matches its target performance, and has been running stable in the production systems since the start of Run 2. Firmware and software developments for the hardware evaluation, the design of the board, the mass production hardware tests, as well as the operation of the final board in the HLT, were carried out as part of this work. 74 boards were integrated into the HLT hardware and software infrastructure, with various firmware and software developments, to provide the main experimental data input and output interface of the HLT for Run 2. The hardware cluster finder, an FPGA-based data pre-processing core from the previous generation of RORCs, was ported to the new hardware. It has been improved and extended to meet the experimental requirements throughout Run 2. The throughput of this firmware component could be doubled and the algorithm extended, providing an improved noise rejection and an increased overall mean data compression ratio compared to its previous implementation. The hardware cluster finder forms a crucial component in the HLT data reconstruction and compression scheme with a processing performance of one board equivalent to around ten server nodes for comparable processing steps in software.
The work on the firmware development, especially on the hardware cluster finder, once more demonstrated that developing and maintaining data processing algorithms with the common low-level hardware description methods is tedious and time-consuming. Therefore, a high-level synthesis (HLS) hardware description method applying dataflow computing at an algorithmic level to FPGAs was evaluated in this context. The hardware cluster finder served as an example of a typical data processing algorithm in a high energy physics readout application. The existing and highly optimized low-level implementation provided a reference for comparisons in terms of throughput and resource usage. The cluster finder algorithm could be implemented in the dataflow description with comparably little effort, providing fast development cycles, compact code and at, the same time, simplified extension and maintenance options. The performance results in terms of throughput and resource usage are comparable to the manual implementation. The dataflow environment proved to be highly valuable for design space explorations. An integration of the dataflow description into the HLT firmware and software infrastructure could be demonstrated as a proof of concept. A high-level hardware description could ease both the design space exploration, the initial development, the maintenance, and the extension of hardware algorithms for high energy physics readout applications.
Der Urknall vor ungefähr 13.8 Milliarden Jahren markiert die Entstehung des Universums. Die gesamte Energie und Materie war in einem Punkt konzentriert und expandiert seitdem kontinuierlich. Wenige Sekundenbruchteile nach dem Urknall war die Temperatur und Dichte dieser Materie extrem hoch und die erschaffenen Elementarteilchen, speziell Quarks und Gluonen, durchliefen einen Zustand den man als Quark-Gluon-Plasma (QGP) bezeichnet und innerhalb dessen die starke Wechselwirkung dominiert. Innerhalb dieses Plasmas können Quarks und Gluonen, welche sonst in Hadronen gebunden sind, sich frei bewegen. Die direkte Beobachtung des frühzeitlichen QGPs ist mit heutigen Mitteln nicht möglich. Allerdings ist es möglich die Dynamik und Kinematik innerhalb eines künstlich erzeugten QGPs zu erforschen und damit Rückschlüsse auf die Vorgänge während des Urknalls zu machen.
Um künstliche QGPs unter kontrollierten Bedingungen zu erzeugen, werden heutzutage ultrarelativistische Schwerionen zur Kollision gebracht. Der stärkste je gebaute Schwerionenbeschleuniger LHC befindet sich am Kernforschungzentrum CERN in der Nähe von Genf. Das ALICE Experiment, als eines der vier großen Experimente am LHC, wurde speziell gebaut um das QGP näher zu untersuchen. Vollständig ionisierte Bleikerne werden mit nahezu Lichtgeschwindigkeit in den Experimenten zur Kollision gebracht. Die deponierte Energie lässt die Temperatur der Quarks und Gluonen innerhalb der kollidierenden Nukleonen ansteigen bis eine kritische Temperatur überschritten wird und ein Phasenübergang in das QGP erfolgt. Im Laufe der Kollision kühlt das Medium ab und gelangt unter die kritische Temperatur. Nun werden aus den ehemals freien Quarks Hadronen gebildet. Diese Hadronen oder Zerfallsprodukte dieser Hadronen können daraufhin in die Detektoren des Experiments fliegen und werden dann dort gemessen.
Es gibt mehrere mögliche Observablen des QGP, die messbar mit dem ALICE Experiment sind. Die Observablen, die in dieser Arbeit detailliert untersucht werden, sind die invariante Masse und der Paartransversalimpuls eines Dielektrons. Ein Dielektron besteht aus einem Elektron und einem Positron, welche miteinander korreliert sind. Dielektronen sind ideale Sonden zur Vermessung des QGPs. Sie werden durch verschiedene Prozesse während allen Kollisionsphasen produziert, wie beispielsweise bei den initialen, harten Stößen der kollidierenden Nukleonen oder durch den elektromagnetischen Zerfall verschiedener Hadronen wie π0 und J/ψ. Zusätzlich strahlt das QGP Dielektronen abhängig von seiner Temperatur ab. Theoretisch erlaubt dies die direkte Temperaturmessung des QGPs. Ein weiterer Vorteil der Dielektronenmessung gegenüber der Messung von Hadronen liegt darin, dass Elektronen und Positronen keine Farbladungen tragen und somit auch nicht mit der dominierenden starken Wechselwirkung innerhalb des QGPs interagieren und somit unbeeinflusst Informationen über seine Dynamik liefern können.
In dieser vorliegenden Arbeit werden Dielektronenspektren als Funktion der invarianten Masse und des Paartransversalimpulses in Blei-Blei-Kollisionen mit einer Schwerpunktsenergie von √sNN = 5.02 TeV gemessen. Das erste Mal in Schwerionenkollisionen konnte an einem der großen LHC Experimente der minimale Transversalimpuls der gemessenen Elektronen und Positronen auf peT > 0.2 GeV/c minimiert werden. Dies gibt im Vergleich zu der publizierten Messung mit peT > 0.4 GeV/c die Möglichkeit auch sogenannte weiche Prozesse zu messen, erhöht aber auch den Komplexit ätsgrad der Messung durch massiv gesteigerten Untergrund. Zusätzlich ist die Messung zentralitäsabhängig durchgeführt. Zentralität ist ein Maß für den Abstand der beiden Bleikerne zum Zeitpunkt der Kollision. Je zentraler eine Kollision, desto größer ist die deponierte Energie und desto größer und heißer ist das erzeugte QGP und die daraus resultierenden Effekte.
Die gemessenen Dielektronenverteilungen werden mit dem erwarteten Beiträgen aus hadronischen Zerfällen verglichen. Die Messung ergibt, dass der Beitrag aus semileptonischen Zerfällen von Charmquarks gemessen im Vakuum, welcher mit der Anzahl der binären Nukleon-Nukleon-Kollisionen in Blei-Blei-Ereignissen hochskaliert ist, nicht das Dielektronenspektrum beschreibt. Eine Modifizierung des Beitrag gemäß des unabhängig gemessenen nuklearen Modifikationsfaktors für einzelne Elektronen aus Charm- und Beautyquarks verbessert die Beschreibung des Dielektronenspektrums. Zusätzlich wurde der Beitrag virtueller direkter Photonen abgeschätzt. Die gemessenen Werte sind vergleichbar mit vorangegangenen Messungen bei einer niedrigeren Schwerpunktsenergie. Ebenso ist es möglich in periphären Kollisionen einen Beitrag durch eine Quelle zu vermessen, die Dielektronen bei niedrigem Transversalimpuls pT,ee < 0.15 GeV/c aussendet.
A Large Ion Collider Experiment (ALICE) is a high-energy physics experiment, designed to study heavy ion collisions at the European Organization for Nuclear Research (CERN)Large Hadron Collider (LHC). ALICE is built to study the fundamental properties of matter as it existed shortly after the big bang. This requires reading out millions of sensors with high frequency, enabling high statistics for physics analysis, resulting in a considerable computing demand concerning network throughput and processing power. With the ALICE Run 3 upgrade [14], requirements for a High Throughput Computing
(HTC) online processing cluster increased significantly, due to more than an order of magnitude more data than in Run 2, resulting in a processing input rate of up to 900 GB/s. Online (real-time) event reconstruction allows for the compression of the data stream to 130 GB/s, which is stored on disk for physics analysis.
This thesis presents the implementation of the ALICE Event Processing Node (EPN) compute farm, to cope with the Run 3 online computing challenges. Building a Data Centre tailored to ALICE requirements for the Run 3 and Run 4 EPN farm. Providing the operational conditions for a dynamic compute environment of a High Performance Computing (HPC) cluster, with significant load changes in a short time span, when starting or stopping a data-taking run. EPN servers provide the required computing resources for online reconstruction and data compression. The farm includes network connectivity towards First Level Processors (FLPs), requiring reliable throughput of 900 GB/s between FLPs and EPNs and connectivity from the internal InfiniBand network to the CERN Exabyte Object Storage (EOS) Ethernet network, with more than 100 GB/s.
The results of operating the EPN computing infrastructure during the first year of Run 3 LHC collisions are described in the context of the ALICE experiment. The EPN farm was delivering the expected performance for ALICE data-taking. Data Centre environmental conditions remained stable during the last more than two years, in particular during starting and stopping runs, which include significant changes in IT load. Several unforeseen external circumstances lead to increasing demands for the Online Offline System (O2). Higher data rates than anticipated required network performance to exceed the initial design specifications, for the throughput between FLPs and EPNs. In particular, the high throughput from an internal EPN InfiniBand network towards the storage Ethernet network was one of the challenges to overcome.
The production of quarkonia, the bound state of an heavy quark with its anti-particle, has for a long time been seen as a key process to understand the properties of nuclear matter in a relativistic heavy-ion collision. This thesis presents studies on the production of quarkonia in heavy-ion collisions at the new Large Hadron collider (LHC). The focus is set on the decay of J/Psi and Upsilon-states into their di-electronic decay channel, measured within the central detectors of the ALICE detector.
As an integral part of ALICE, the dedicated heavy ion experiment at CERN’s Large Hadron Collider, the Transition Radiation Detector (TRD) contributes to the experiment’s tracking, triggering and particle identification. Central element in the TRD’s processing chain is its trigger and readout processor, the Global Tracking Unit (GTU). The GTU implements fast triggers on various signatures, which rely on the reconstruction of up to 20 000 particle track segments to global tracks, and performs the buffering and processing of event raw data as part of a complex detector readout tree.
The high data rates the system has to handle and its dual use as trigger and readout processor with shared resources and interwoven processing paths require the GTU to be a unique, high-performance parallel processing system. To achieve high data taking efficiency, all elements of the GTU are optimized for high running stability and low dead time.
The solutions presented in this thesis for the handling of readout data in the GTU, from the initial reception to the final assembly and transmission to the High-Level Trigger computer farm, address all these aspects. The presented concepts employ multi-event buffering, in-stream data processing, extensive embedded diagnostics, and advanced features of modern FPGAs to build a robust high-performance system that can conduct the high- bandwidth readout of the TRD with maximum stability and minimized dead time. The work summarized here not only includes the complete process from the conceptual layout of the multi-event data handling and segment control, but also its implementation, simulation, verification, operation and commissioning. It also covers the system upgrade for the second data taking period and presents an analysis of the actual system performance.
The presented design of the GTU’s input stage, which is comprised of 90 FPGA-based nodes, is built to support multi-event buffering for the data received from the 18 TRD supermodules on 1080 optical links at the full sender aggregate net bandwidth of 2.16 Tbit/s. With careful design of the control logic and the overall data path, the readout on the 18 concentrator nodes of the supermodule stage can utilize an effective aggregate output bandwidth of initially 3.33 GiB/s, and, after the successful readout bandwidth upgrade, 6.50 GiB/s via 18 optical links. The high possible readout link utilization of more than 99 % and the intermediate buffering of events on the GTU helps to keep the dead time associated with the local event building and readout typically below 10%. The GTU has been used for production data taking since start-up of the experiment and ever since performs the event buffering, local event building and readout for the TRD in a correct, efficient and highly dependable fashion.
On development, feasibility, and limits of highly efficient CPU and GPU programs in several fields
(2013)
With processor clock speeds having stagnated, parallel computing architectures have achieved a breakthrough in recent years. Emerging many-core processors like graphics cards run hundreds of threads in parallel and vector instructions are experiencing a revival. Parallel processors with many independent but simple arithmetical logical units fail executing serial tasks efficiently. However, their sheer parallel processing power makes them predestined for parallel applications while the simple construction of their cores makes them unbeatably power efficient. Unfortunately, old programs cannot profit by simple recompilation. Adaptation often requires rethinking and modifying algorithms to make use of parallel execution. Many applications have some serial subroutines which are very hard to parallelize, hence contemporary compute clusters are often homogeneous, offering fast processors for serial tasks and parallel processors for parallel tasks. In order not to waste the available compute power, highly efficient programs are mandatory.
This thesis is about the development of fast algorithms and their implementations on modern CPUs and GPUs, about the maximum achievable efficiency with respect to peak performance and to power consumption respectively, and about feasibility and limits of programs for CPUs, GPUs, and heterogeneous systems. Three totally different applications from distinct fields, which were developed in the extent of this thesis, are presented.
The ALICE experiment at the LHC particle collider at CERN studies heavy-ion collisions at high rates of several hundred Hz, while every collision produces thousands of particles, whose trajectories must be reconstructed. For this purpose, ALICE track reconstruction and ALICE track merging have been adapted for GPUs and deployed on 64 GPU-enabled compute-nodes at CERN.
After a testing phase, the tracker ran in nonstop operation during 2012 providing full real-time track reconstruction. The tracker employs a multithreaded pipeline as well as asynchronous data transfer to ensure continuous GPU utilization and outperforms the fastest available CPUs by about a factor three.
The Linpack benchmark is the standard tool for ranking compute clusters. It solves a dense system of linear equations using primarily matrix multiplication facilitated by a routine called DGEMM. A heterogeneous GPU-enabled version of DGEMM and Linpack has been developed, which can utilize the CAL, CUDA, and OpenCL APIs as backend. Employing this implementation, the LOEWE-CSC cluster ranked place 22 in the November 2010 Top500 list of the fastest supercomputers, and the Sanam cluster achieved the second place in the November 2012 Green500 list of the most power efficient supercomputers. An elaborate lookahead algorithm, a pipeline, and asynchronous data transfer hide the serial CPU-bound tasks of Linpack behind DGEMM execution on the GPU reaching the highest efficiency on GPU-accelerated clusters.
Failure erasure codes enable failure tolerant storage of data and real-time failover, ensuring that in case of a hardware defect servers and even complete data centers remain operational. It is an absolute necessity for present-day computer infrastructure. The mathematical theory behind the codes involves matrix-computations in finite fields, which are not natively supported by modern processors and hence computationally very expensive. This thesis presents a novel scheme for fast encoding matrix generation and demonstrates a fast implementation for the encoding itself, which uses exclusively either integer or logical vector instructions. Depending on the scenario, it is always hitting different hard limits of the hardware: either the maximum attainable memory bandwidth, or the peak instruction throughput, or the PCI Express bandwidth limit when GPUs or FPGAs are used.
The thesis demonstrates that in most cases with respect to the available peak performance, GPU implementations can be as efficient as their CPU counterparts.
With respect to costs or power consumption, they are much more efficient. For this purpose, complex tasks must be split in serial as well as parallel parts and the execution must be pipelined such that the CPU bound tasks are hidden behind GPU execution. Few cases are identified where this is not possible due to PCI Express limitations or not reasonable because practical GPU languages are missing.