- ALICE (2) (remove)
- Commissioning of the ALICE High-Level Trigger (2012)
- A new era in experimental nuclear physics has begun with the start-up of the Large Hadron Collider at CERN and its dedicated heavy-ion detector system ALICE. Measuring the highest energy density ever produced in nucleus-nucleus collisions, the detector has been designed to study the properties of the created hot and dense medium, assumed to be a Quark-Gluon Plasma. Comprised of 18 high granularity sub-detectors, ALICE delivers data from a few million electronic channels of proton-proton and heavy-ion collisions. The produced data volume can reach up to 26 GByte/s for central Pb–Pb collisions at design luminosity of L = 1027 cm−2 s−1 , challenging not only the data storage, but also the physics analysis. A High-Level Trigger (HLT) has been built and commissioned to reduce that amount of data to a storable value prior to archiving with the means of data filtering and compression without the loss of physics information. Implemented as a large high performance compute cluster, the HLT is able to perform a full reconstruction of all events at the time of data-taking, which allows to trigger, based on the information of a complete event. Rare physics probes, with high transverse momentum, can be identified and selected to enhance the overall physics reach of the experiment. The commissioning of the HLT is at the center of this thesis. Being deeply embedded in the ALICE data path and, therefore, interfacing all other ALICE subsystems, this commissioning imposed not only a major challenge, but also a massive coordination effort, which was completed with the first proton-proton collisions reconstructed by the HLT. Furthermore, this thesis is completed with the study and implementation of on-line high transverse momentum triggers.
- On development, feasibility, and limits of highly efficient CPU and GPU programs in several fields (2013)
- With processor clock speeds having stagnated, parallel computing architectures have achieved a breakthrough in recent years. Emerging many-core processors like graphics cards run hundreds of threads in parallel and vector instructions are experiencing a revival. Parallel processors with many independent but simple arithmetical logical units fail executing serial tasks efficiently. However, their sheer parallel processing power makes them predestined for parallel applications while the simple construction of their cores makes them unbeatably power efficient. Unfortunately, old programs cannot profit by simple recompilation. Adaptation often requires rethinking and modifying algorithms to make use of parallel execution. Many applications have some serial subroutines which are very hard to parallelize, hence contemporary compute clusters are often homogeneous, offering fast processors for serial tasks and parallel processors for parallel tasks. In order not to waste the available compute power, highly efficient programs are mandatory. This thesis is about the development of fast algorithms and their implementations on modern CPUs and GPUs, about the maximum achievable efficiency with respect to peak performance and to power consumption respectively, and about feasibility and limits of programs for CPUs, GPUs, and heterogeneous systems. Three totally different applications from distinct fields, which were developed in the extent of this thesis, are presented. The ALICE experiment at the LHC particle collider at CERN studies heavy-ion collisions at high rates of several hundred Hz, while every collision produces thousands of particles, whose trajectories must be reconstructed. For this purpose, ALICE track reconstruction and ALICE track merging have been adapted for GPUs and deployed on 64 GPU-enabled compute-nodes at CERN. After a testing phase, the tracker ran in nonstop operation during 2012 providing full real-time track reconstruction. The tracker employs a multithreaded pipeline as well as asynchronous data transfer to ensure continuous GPU utilization and outperforms the fastest available CPUs by about a factor three. The Linpack benchmark is the standard tool for ranking compute clusters. It solves a dense system of linear equations using primarily matrix multiplication facilitated by a routine called DGEMM. A heterogeneous GPU-enabled version of DGEMM and Linpack has been developed, which can utilize the CAL, CUDA, and OpenCL APIs as backend. Employing this implementation, the LOEWE-CSC cluster ranked place 22 in the November 2010 Top500 list of the fastest supercomputers, and the Sanam cluster achieved the second place in the November 2012 Green500 list of the most power efficient supercomputers. An elaborate lookahead algorithm, a pipeline, and asynchronous data transfer hide the serial CPU-bound tasks of Linpack behind DGEMM execution on the GPU reaching the highest efficiency on GPU-accelerated clusters. Failure erasure codes enable failure tolerant storage of data and real-time failover, ensuring that in case of a hardware defect servers and even complete data centers remain operational. It is an absolute necessity for present-day computer infrastructure. The mathematical theory behind the codes involves matrix-computations in finite fields, which are not natively supported by modern processors and hence computationally very expensive. This thesis presents a novel scheme for fast encoding matrix generation and demonstrates a fast implementation for the encoding itself, which uses exclusively either integer or logical vector instructions. Depending on the scenario, it is always hitting different hard limits of the hardware: either the maximum attainable memory bandwidth, or the peak instruction throughput, or the PCI Express bandwidth limit when GPUs or FPGAs are used. The thesis demonstrates that in most cases with respect to the available peak performance, GPU implementations can be as efficient as their CPU counterparts. With respect to costs or power consumption, they are much more efficient. For this purpose, complex tasks must be split in serial as well as parallel parts and the execution must be pipelined such that the CPU bound tasks are hidden behind GPU execution. Few cases are identified where this is not possible due to PCI Express limitations or not reasonable because practical GPU languages are missing.