Refine
Year of publication
Document Type
- Doctoral Thesis (18)
Language
- English (18)
Has Fulltext
- yes (18)
Is part of the Bibliography
- no (18)
Keywords
- ALICE (4)
- CBM experiment (2)
- Cellular Automaton (2)
- FPGA (2)
- GPGPU (2)
- HLT (2)
- HPC (2)
- High energy physics (2)
- Many-core computer architectures (2)
- Parallel and SIMD calculations (2)
Institute
- Informatik und Mathematik (12)
- Informatik (5)
- Physik (1)
The main task of modern large experiments with heavy ions, such as CBM (FAIR), STAR (BNL) and ALICE (CERN) is a detailed study of the phase diagram of quantum chromodynamics (QCD) in the quark-gluon plasma (QGP), the equation of state of matter at extremely high baryonic densities, and the transition from the hadronic phase of matter to the quark-gluon phase.
In the thesis, the missing mass method is developed for the reconstruction of short-lived particles with neutral particles in their decay products, as well as its implementation in the form of fast algorithms and a set of software for prac- tical application in heavy ion physics experiments. Mathematical procedures implementing the method were developed and implemented within the KF Par- ticle Finder package for the future CBM (FAIR) experiment and subsequently adapted and applied for processing and analysis of real data in the STAR (BNL) experiment.
The KF Particle Finder package is designed to reconstruct most signal particles from the physics program of the CBM experiment, including strange particles, strange resonances, hypernuclei, light vector mesons, charm particles and char- monium. The package includes searches for over a hundred decays of short-lived particles. This makes the KF Particle Finder a universal platform for short-lived particle reconstruction and physics analysis both online and offline.
The missing mass method has been proposed to reconstruct decays of short-lived charged particles when one of the daughter particles is neutral and is not regis- tered in the detector system. The implementation of the missing mass method was integrated into the KF Particle Finder package to search for 18 decays with a neutral daughter particle.
Like all other algorithms of the KF Particle Finder package, the missing mass method is implemented with extensive use of vector (SIMD) instructions and is optimized for parallel operation on modern many-core high performance com- puter clusters, which can include both processors and coprocessors. A set of algorithms implementing the method was tested on computers with tens of cores and showed high speed and practically linear scalability with respect to the num- ber of cores involved.
It is extremely important, especially for the initial stage of the CBM experiment, which is planned for 2025, to demonstrate already now on real data the reliability of the developed approach, as well as the high efficiency of the current implemen- tation of both the entire KF Particle Finder package, and its integral part, the missing mass method. Such an opportunity was provided by the FAIR Phase-0 program, motivating the use in the STAR experiment of software packages orig- inally developed for the CBM experiment.
Application of the method to real data of the STAR experiment shows very good results with a high signal-to-background ratio and a large significance value. The results demonstrate the reliability and high efficiency of the missing mass method in the reconstruction of both charged mother particles and their neutral daughter particles. Being an integral part of the KF Particle Finder package, now the main approach for reconstruction and analysis of short-lived particles in the STAR experiment, the missing mass method will continue to be used for the physics analysis in online and offline modes.
The high quality of the results of the express data analysis has led to their status as preliminary physics results with the right to present them at international physics conferences and meetings on behalf of the STAR Collaboration.
As an integral part of ALICE, the dedicated heavy ion experiment at CERN’s Large Hadron Collider, the Transition Radiation Detector (TRD) contributes to the experiment’s tracking, triggering and particle identification. Central element in the TRD’s processing chain is its trigger and readout processor, the Global Tracking Unit (GTU). The GTU implements fast triggers on various signatures, which rely on the reconstruction of up to 20 000 particle track segments to global tracks, and performs the buffering and processing of event raw data as part of a complex detector readout tree.
The high data rates the system has to handle and its dual use as trigger and readout processor with shared resources and interwoven processing paths require the GTU to be a unique, high-performance parallel processing system. To achieve high data taking efficiency, all elements of the GTU are optimized for high running stability and low dead time.
The solutions presented in this thesis for the handling of readout data in the GTU, from the initial reception to the final assembly and transmission to the High-Level Trigger computer farm, address all these aspects. The presented concepts employ multi-event buffering, in-stream data processing, extensive embedded diagnostics, and advanced features of modern FPGAs to build a robust high-performance system that can conduct the high- bandwidth readout of the TRD with maximum stability and minimized dead time. The work summarized here not only includes the complete process from the conceptual layout of the multi-event data handling and segment control, but also its implementation, simulation, verification, operation and commissioning. It also covers the system upgrade for the second data taking period and presents an analysis of the actual system performance.
The presented design of the GTU’s input stage, which is comprised of 90 FPGA-based nodes, is built to support multi-event buffering for the data received from the 18 TRD supermodules on 1080 optical links at the full sender aggregate net bandwidth of 2.16 Tbit/s. With careful design of the control logic and the overall data path, the readout on the 18 concentrator nodes of the supermodule stage can utilize an effective aggregate output bandwidth of initially 3.33 GiB/s, and, after the successful readout bandwidth upgrade, 6.50 GiB/s via 18 optical links. The high possible readout link utilization of more than 99 % and the intermediate buffering of events on the GTU helps to keep the dead time associated with the local event building and readout typically below 10%. The GTU has been used for production data taking since start-up of the experiment and ever since performs the event buffering, local event building and readout for the TRD in a correct, efficient and highly dependable fashion.
Driven by rapid technological advancements, the amount of data that is created, captured, communicated, and stored worldwide has grown exponentially over the past decades. Along with this development it has become critical for many disciplines of science and business to being able to gather and analyze large amounts of data. The sheer volume of the data often exceeds the capabilities of classical storage systems, with the result that current large-scale storage systems are highly distributed and are comprised of a high number of individual storage components. As with any other electronic device, the reliability of storage hardware is governed by certain probability distributions, which in turn are influenced by the physical processes utilized to store the information. The traditional way to deal with the inherent unreliability of combined storage systems is to replicate the data several times. Another popular approach to achieve failure tolerance is to calculate the block-wise parity in one or more dimensions. With better understanding of the different failure modes of storage components, it has become evident that sophisticated high-level error detection and correction techniques are indispensable for the ever-growing distributed systems. The utilization of powerful cyclic error-correcting codes, however, comes with a high computational penalty, since the required operations over finite fields do not map very well onto current commodity processors. This thesis introduces a versatile coding scheme with fully adjustable fault-tolerance that is tailored specifically to modern processor architectures. To reduce stress on the memory subsystem the conventional table-based algorithm for multiplication over finite fields has been replaced with a polynomial version. This arithmetically intense algorithm is better suited to the wide SIMD units of the currently available general purpose processors, but also displays significant benefits when used with modern many-core accelerator devices (for instance the popular general purpose graphics processing units). A CPU implementation using SSE and a GPU version using CUDA are presented. The performance of the multiplication depends on the distribution of the polynomial coefficients in the finite field elements. This property has been used to create suitable matrices that generate a linear systematic erasure-correcting code which shows a significantly increased multiplication performance for the relevant matrix elements. Several approaches to obtain the optimized generator matrices are elaborated and their implications are discussed. A Monte-Carlo-based construction method allows it to influence the specific shape of the generator matrices and thus to adapt them to special storage and archiving workloads. Extensive benchmarks on CPU and GPU demonstrate the superior performance and the future application scenarios of this novel erasure-resilient coding scheme.
Das Projekt anan ist ein Werkzeug zur Fehlersuche in verteilten Hochleistungsrechnern. Die Neuheit des Beitrags besteht darin, dass die bekannten Methoden, die bereits erfolgreich zum Debuggen von Soft- und Hardware eingesetzt werden, auf Hochleistungs-Rechnen übertragen worden sind. Im Rahmen der vorliegenden Arbeit wurde ein Werkzeug namens anan implementiert, das bei der Fehlersuche hilft. Außerdem kann es als dynamischeres Monitoring eingesetzt werden. Beide Einsatzzwecke sind
getestet worden.
Das Werkzeug besteht aus zwei Teilen:
1. aus einem Teil namens anan, der interaktiv vom Nutzer bedient wird
2. und aus einem Teil namens anand, der automatisiert die verlangten Messwerte erhebt und nötigenfalls Befehle ausführt.
Der Teil anan führt Sensoren aus — kleine mustergesteuerte Algorithmen —, deren Ergebnisse per anan zusammengeführt werden. In erster Näherung lässt anan sich als Monitoring beschreiben, welches (1) schnell umkonfiguriert werden (2) komplexere Werte messen kann, die über Korrelationen einfacher Zeitreihen hinausgehen.
Modern experiments in heavy ion collisions operate with huge data rates that can not be fully stored on the currently available storage devices. Therefore the data flow should be reduced by selecting those collisions that potentially carry the information of the physics interest. The future CBM experiment will have no simple criteria for selecting such collisions and requires the full online reconstruction of the collision topology including reconstruction of short-lived particles.
In this work the KF Particle Finder package for online reconstruction and selection of short-lived particles is proposed and developed. It reconstructs more than 70 decays, covering signals from all the physics cases of the CBM experiment: strange particles, strange resonances, hypernuclei, low mass vector mesons, charmonium, and open-charm particles.
The package is based on the Kalman filter method providing a full set of the particle parameters together with their errors including position, momentum, mass, energy, lifetime, etc. It shows a high quality of the reconstructed particles, high efficiencies, and high signal to background ratios.
The KF Particle Finder is extremely fast for achieving the reconstruction speed of 1.5 ms per minimum-bias AuAu collision at 25 AGeV beam energy on single CPU core. It is fully vectorized and parallelized and shows a strong linear scalability on the many-core architectures of up to 80 cores. It also scales within the First Level Event Selection package on the many-core clusters up to 3200 cores.
The developed KF Particle Finder package is a universal platform for short- lived particle reconstruction, physics analysis and online selection.
On development, feasibility, and limits of highly efficient CPU and GPU programs in several fields
(2013)
With processor clock speeds having stagnated, parallel computing architectures have achieved a breakthrough in recent years. Emerging many-core processors like graphics cards run hundreds of threads in parallel and vector instructions are experiencing a revival. Parallel processors with many independent but simple arithmetical logical units fail executing serial tasks efficiently. However, their sheer parallel processing power makes them predestined for parallel applications while the simple construction of their cores makes them unbeatably power efficient. Unfortunately, old programs cannot profit by simple recompilation. Adaptation often requires rethinking and modifying algorithms to make use of parallel execution. Many applications have some serial subroutines which are very hard to parallelize, hence contemporary compute clusters are often homogeneous, offering fast processors for serial tasks and parallel processors for parallel tasks. In order not to waste the available compute power, highly efficient programs are mandatory.
This thesis is about the development of fast algorithms and their implementations on modern CPUs and GPUs, about the maximum achievable efficiency with respect to peak performance and to power consumption respectively, and about feasibility and limits of programs for CPUs, GPUs, and heterogeneous systems. Three totally different applications from distinct fields, which were developed in the extent of this thesis, are presented.
The ALICE experiment at the LHC particle collider at CERN studies heavy-ion collisions at high rates of several hundred Hz, while every collision produces thousands of particles, whose trajectories must be reconstructed. For this purpose, ALICE track reconstruction and ALICE track merging have been adapted for GPUs and deployed on 64 GPU-enabled compute-nodes at CERN.
After a testing phase, the tracker ran in nonstop operation during 2012 providing full real-time track reconstruction. The tracker employs a multithreaded pipeline as well as asynchronous data transfer to ensure continuous GPU utilization and outperforms the fastest available CPUs by about a factor three.
The Linpack benchmark is the standard tool for ranking compute clusters. It solves a dense system of linear equations using primarily matrix multiplication facilitated by a routine called DGEMM. A heterogeneous GPU-enabled version of DGEMM and Linpack has been developed, which can utilize the CAL, CUDA, and OpenCL APIs as backend. Employing this implementation, the LOEWE-CSC cluster ranked place 22 in the November 2010 Top500 list of the fastest supercomputers, and the Sanam cluster achieved the second place in the November 2012 Green500 list of the most power efficient supercomputers. An elaborate lookahead algorithm, a pipeline, and asynchronous data transfer hide the serial CPU-bound tasks of Linpack behind DGEMM execution on the GPU reaching the highest efficiency on GPU-accelerated clusters.
Failure erasure codes enable failure tolerant storage of data and real-time failover, ensuring that in case of a hardware defect servers and even complete data centers remain operational. It is an absolute necessity for present-day computer infrastructure. The mathematical theory behind the codes involves matrix-computations in finite fields, which are not natively supported by modern processors and hence computationally very expensive. This thesis presents a novel scheme for fast encoding matrix generation and demonstrates a fast implementation for the encoding itself, which uses exclusively either integer or logical vector instructions. Depending on the scenario, it is always hitting different hard limits of the hardware: either the maximum attainable memory bandwidth, or the peak instruction throughput, or the PCI Express bandwidth limit when GPUs or FPGAs are used.
The thesis demonstrates that in most cases with respect to the available peak performance, GPU implementations can be as efficient as their CPU counterparts.
With respect to costs or power consumption, they are much more efficient. For this purpose, complex tasks must be split in serial as well as parallel parts and the execution must be pipelined such that the CPU bound tasks are hidden behind GPU execution. Few cases are identified where this is not possible due to PCI Express limitations or not reasonable because practical GPU languages are missing.
The ALICE High-Level-Trigger (HLT) is a large scale computing farm designed and constructed for the purpose of the realtime reconstruction of particle interactions (events) inside the ALICE detector. The reconstruction of such events is based on the raw data produced in collisions inside the ALICE at the Large Hadron Collider. The online reconstruction in the HLT allows the triggering on certain event topologies and a significant data reduction by applying compression algorithms. Moreover, it enables a real-time verification of the quality of the data.
To receive the raw data from the various sub-detectors of ALICE, the HLT is equipped with 226 custom built FPGA-based PCI-X cards, the H-RORCs. The H-RORC interfaces the detector readout electronics to the nodes of the HLT farm. In addition to the transfer of raw data, 108 H-RORCs host 216 Fast-Cluster-Finder (FCF) processors for the Time-Projection-Chamber (TPC). The TPC is the main tracking detector of ALICE and contributes with up to 16 GB/s to over 90% of the overall data volume. The FCF processor implements the first of two steps in the data reconstruction of the TPC. It calculates the space points and their properties from charge clouds (clusters) created by charged particles traversing the TPCs gas volume. Those space points are not only the base for the tracking algorithm, but also allow for a Huffman-based data compression, which reduces the data volume by a factor of 4 to 6.
The FCF processor is designed to cope with any incoming data rate up to the maximum bandwidth of the incoming optical link (160 MB/s) without creating back-pressure to the detectors readout electronics. A performance comparison with the software implementation of the algorithm shows a speedup factor of about 20 compared with one AMD Opteron 6172 Core @ 2.1 GHz, the CPU type used in the HLT during the LHC Run1 campaign. Comparison with an Intel E5-2690 Core @ 3.0 GHz, the CPU type used by the HLT for the LHC Run2 campaign, results in a speedup factor of 8.5. In total numbers, the 216 FCF processors provide the computing performance of 4255 AMD Opteron cores or 2203 Intel cores of the previously mentioned type. The performance of the reconstruction with respect to the physics analysis is equivalent or better than the official ALICE Offline clusterizer. Therefore, ALICE data taking was switched in 2011 to FCF cluster recording and compression only, discarding the raw data from the TPC. Due to the capability to compress the clusters, the recorded data volume could be increased by a factor of 4 to 6.
For the LHC Run3 campaign, starting in 2020, the FCF builds the foundation of the ALICE data taking and processing strategy. The raw data volume (before processing) of the upgraded TPC will exceed 3 TB/s. As a consequence, online processing of the raw data and compression of the results before it enters the online computing farms is an essential and crucial part of the computing model.
Within the scope of this thesis, the H-RORC card and the FCF processor were developed and built from scratch. It covers the conceptual design, the optimisation and implementation, as well as the verification. It is completed by performance benchmarks and experiences from real data taking.
A new era in experimental nuclear physics has begun with the start-up of the Large Hadron Collider at CERN and its dedicated heavy-ion detector system ALICE. Measuring the highest energy density ever produced in nucleus-nucleus collisions, the detector has been designed to study the properties of the created hot and dense medium, assumed to be a Quark-Gluon Plasma.
Comprised of 18 high granularity sub-detectors, ALICE delivers data from a few million electronic channels of proton-proton and heavy-ion collisions.
The produced data volume can reach up to 26 GByte/s for central Pb–Pb
collisions at design luminosity of L = 1027 cm−2 s−1 , challenging not only the data storage, but also the physics analysis. A High-Level Trigger (HLT) has been built and commissioned to reduce that amount of data to a storable value prior to archiving with the means of data filtering and compression without the loss of physics information. Implemented as a large high performance compute cluster, the HLT is able to perform a full reconstruction of all events at the time of data-taking, which allows to trigger, based on the information of a complete event. Rare physics probes, with high transverse momentum, can be identified and selected to enhance the overall physics reach of the experiment.
The commissioning of the HLT is at the center of this thesis. Being deeply embedded in the ALICE data path and, therefore, interfacing all other ALICE subsystems, this commissioning imposed not only a major challenge, but also a massive coordination effort, which was completed with the first proton-proton collisions reconstructed by the HLT. Furthermore, this thesis is completed with the study and implementation of on-line high transverse momentum triggers.
Ultrarelativistic Quantum Molecular Dynamics is a physics model to describe the transport, collision, scattering, and decay of nuclear particles. The UrQMD framework has been in use for nearly 20 years since its first development. In this period computing aspects, the design of code, and the efficiency of computation have been minor points of interest. Nowadays an additional issue arises due to the fact that the run time of the framework does not diminish any more with new hardware generations.
The current development in computing hardware is mainly focused on parallelism. Especially in scientific applications a high order of parallelisation can be achieved due to the superposition principle. In this thesis it is shown how modern design criteria and algorithm redesign are applied to physics frameworks. The redesign with a special emphasise on many-core architectures allows for significant improvements of the execution speed.
The most time consuming part of UrQMD is a newly introduced relativistic hydrodynamic phase. The algorithm used to simulate the hydrodynamic evolution is the SHASTA. As the sequential form of SHASTA is successfully applied in various simulation frameworks for heavy ion collisions its possible parallelisation is analysed. Two different implementations of SHASTA are presented.
The first one is an improved sequential implementation. By applying a more concise design and evading unnecessary memory copies, the execution time could be reduced to the half of the FORTRAN version’s execution time. The usage of memory could be reduced by 80% compared to the memory needed in the original version.
The second implementation concentrates fully on the usage of many-core architectures and deviates significantly from the classical implementation. Contrary to the sequential implementation, it follows the recalculate instead of memory look-up paradigm. By this means the execution speed could be accelerated up to a factor of 460 on GPUs.
Additionally a stability analysis of the UrQMD model is presented. Applying metapro- gramming UrQMD is compiled and executed in a massively parallel setup. The resulting simulation data of all parallel UrQMD instances were hereafter gathered and analysed. Hence UrQMD could be proven of high stability to the uncertainty of experimental data.
As a further application of modern programming paradigms a prototypical implementa- tion of the worldline formalism is presented. This formalism allows for a direct calculation of Feynman integrals and constitutes therefore an interesting enhancement for the UrQMD model. Its massively parallel implementation on GPUs is examined.
High-energy physics experiments aim to deepen our understanding of the fundamental structure of matter and the governing forces. One of the most challenging aspects of the design of new experiments is data management and event selection. The search for increasingly rare and intricate physics events asks for high-statistics measurements and sophisticated event analysis. With progressively complex event signatures, traditional hardware-based trigger systems reach the limits of realizable latency and complexity. The Compressed Baryonic Matter experiment (CBM) employs a novel approach for data readout and event selection to address these challenges. Self-triggered, free-streaming detectors push all data to a central compute cluster, called First-level Event Selector (FLES), for software-based event analysis and selection. While this concept solves many issues present in classical architectures, it also sets new challenges for the design of the detector readout systems and online event selection.
This thesis presents an efficient solution to the data management challenges presented by self-triggered, free-streaming particle detectors. The FLES must receive asynchronously streamed data from a heterogeneous detector setup at rates of up to 1 TB/s. The real-time processing environment implies that all components have to deliver high performance and reliability to record as much valuable data as possible. The thesis introduces a time-based data model to partition the input streams into containers of fixed length in experiment time for efficient data management. These containers provide all necessary metadata to enable generic, detector-subsystem-agnostic data distribution across the entire cluster. An analysis shows that the introduced data overhead is well below 1 % for a wide range of system parameters.
Furthermore, a concept and the implementation of a detector data input interface for the CBM FLES, optimized for resource-efficient data transport, are presented. The central element of the architecture is an FPGA-based PCIe extension card for the FLES entry nodes. The hardware designs developed in the thesis enable interfacing with a diverse set of detector systems. A custom, high-throughput DMA design structures data in a way that enables low-overhead access and efficient software processing. The ability to share the host DMA buffers with other devices, such as an InfiniBand HCA, allows for true zero-copy data distribution between the cluster nodes. The discussed FLES input interface is fully implemented and has already proven its reliability in production operation in various physics experiments.
The relevant field of interest in High Energy Physics experiments is shifting to searching and studying extremely rare particles and phenomena. The search for rare probes requires an increase in the number of available statistics by increasing the particle interaction rate. The structure of the events also becomes more complicated, the multiplicity of particles in each event increases, and a pileup appears. Due to technical limitations, such data flow becomes impossible to store fully on available storage devices. The solution to the problem is the correct triggering of events and real-time data processing.
In this work, the issue of accelerating and improving the algorithms for reconstruction of the charged particles' trajectories based on the Cellular Automaton in the STAR experiment is considered to implement them for track reconstruction in real-time within the High-Level Trigger. This is an important step in the preparation of the CBM experiment as part of the FAIR Phase-0 program. The study of online data processing methods in real conditions at similar interaction energies allows us to study this process and determine the possible weaknesses of the approach.
Two versions of the Cellular Automaton based track reconstruction are discussed, which are used, depending on the detecting systems' features. HFT~CA Track Finder, similar to the tracking algorithm of the CBM experiment, has been accelerated by several hundred times, using both algorithm optimization and data-level parallelism. TPC~CA Track Finder has been upgraded to improve the reconstruction quality while maintaining high calculation speed. The algorithm was tuned to work with the new iTPC geometry and provided an additional module for very low momentum track reconstruction.
The improved track reconstruction algorithm for the TPC detector in the STAR experiment was included in the HLT reconstruction chain and successfully tested in the express production for the online real data analysis. This made it possible to obtain important physical results during the experiment runtime without the full offline data processing. The tracker is also being prepared for integration into a standard offline data processing chain, after which it will become the basic track search algorithm in the STAR experiment.
This thesis presents various algorithms which have been developed for on-line event reconstruction in the CBM experiment at GSI, Darmstadt and the ALICE experiment at CERN, Geneve. Despite the fact that the experiments are different — CBM is a fixed target experiment with forward geometry, while ALICE has a typical collider geometry — they share common aspects when reconstruction is concerned.
The thesis describes:
— general modifications to the Kalman filter method, which allows one to accelerate, to improve, and to simplify existing fit algorithms;
— developed algorithms for track fit in CBM and ALICE experiment, including a new method for track extrapolation in non-homogeneous magnetic field.
— developed algorithms for primary and secondary vertex fit in the both experiments. In particular, a new method of reconstruction of decayed particles is presented.
— developed parallel algorithm for the on-line tracking in the CBM experiment.
— developed parallel algorithm for the on-line tracking in High Level Trigger of the ALICE experiment.
— the realisation of the track finders on modern hardware, such as SIMD CPU registers and GPU accelerators.
All the presented methods have been developed by or with the direct participation of the author.
Conceptual design of an ALICE Tier-2 centre integrated into a multi-purpose computing facility
(2012)
This thesis discusses the issues and challenges associated with the design and operation of a data analysis facility for a high-energy physics experiment at a multi-purpose computing centre. At the spotlight is a Tier-2 centre of the distributed computing model of the ALICE experiment at the Large Hadron Collider at CERN in Geneva, Switzerland. The design steps, examined in the thesis, include analysis and optimization of the I/O access patterns of the user workload, integration of the storage resources, and development of the techniques for effective system administration and operation of the facility in a shared computing environment. A number of I/O access performance issues on multiple levels of the I/O subsystem, introduced by utilization of hard disks for data storage, have been addressed by the means of exhaustive benchmarking and thorough analysis of the I/O of the user applications in the ALICE software framework. Defining the set of requirements to the storage system, describing the potential performance bottlenecks and single points of failure and examining possible ways to avoid them allows one to develop guidelines for selecting the way how to integrate the storage resources. The solution, how to preserve a specific software stack for the experiment in a shared environment, is presented along with its effects on the user workload performance. The proposal for a flexible model to deploy and operate the ALICE Tier-2 infrastructure and applications in a virtual environment through adoption of the cloud computing technology and the 'Infrastructure as Code' concept completes the thesis. Scientific software applications can be efficiently computed in a virtual environment, and there is an urgent need to adapt the infrastructure for effective usage of cloud resources.
Quarks and gluons are the building blocks of all hadronic matter, like protons and neutrons. Their interaction is described by Quantum Chromodynamics (QCD), a theory under test by large scale experiments like the Large Hadron Collider (LHC) at CERN and in the future at the Facility for Antiproton and Ion Research (FAIR) at GSI. However, perturbative methods can only be applied to QCD for high energies. Studies from first principles are possible via a discretization onto an Euclidean space-time grid. This discretization of QCD is called Lattice QCD (LQCD) and is the only ab-initio option outside of the high-energy regime. LQCD is extremely compute and memory intensive. In particular, it is by definition always bandwidth limited. Thus—despite the complexity of LQCD applications—it led to the development of several specialized compute platforms and influenced the development of others. However, in recent years General-Purpose computation on Graphics Processing Units (GPGPU) came up as a new means for parallel computing. Contrary to machines traditionally used for LQCD, graphics processing units (GPUs) are a massmarket product. This promises advantages in both the pace at which higher-performing hardware becomes available and its price. CL2QCD is an OpenCL based implementation of LQCD using Wilson fermions that was developed within this thesis. It operates on GPUs by all major vendors as well as on central processing units (CPUs). On the AMD Radeon HD 7970 it provides the fastest double-precision D= kernel for a single GPU, achieving 120GFLOPS. D=—the most compute intensive kernel in LQCD simulations—is commonly used to compare LQCD platforms. This performance is enabled by an in-depth analysis of optimization techniques for bandwidth-limited codes on GPUs. Further, analysis of the communication between GPU and CPU, as well as between multiple GPUs, enables high-performance Krylov space solvers and linear scaling to multiple GPUs within a single system. LQCD calculations require a sampling of the phase space. The hybrid Monte Carlo (HMC) algorithm performs this. For this task, a single AMD Radeon HD 7970 GPU provides four times the performance of two AMD Opteron 6220 running an optimized reference code. The same advantage is achieved in terms of energy-efficiency. In terms of normalized total cost of acquisition (TCA), GPU-based clusters match conventional large-scale LQCD systems. Contrary to those, however, they can be scaled up from a single node. Examples of large GPU-based systems are LOEWE-CSC and SANAM. On both, CL2QCD has already been used in production for LQCD studies.
Blockchains in public administration : a RADIUS on blockchain framework for public administration
(2023)
The emergence of blockchain technology has generated a great deal of attention, as reflected in numerous scientific and journalistic articles. However, the implementation of blockchain for public administrations in Germany has encountered a setback owing to unsuccessful initiatives. Initial enthusiasm was followed by disillusionment. Nevertheless, technology continues to evolve. This paper examines whether the use of a blockchain can still optimize the processes of public administrations. Not only the failed projects are analysed, but also more current applications of the technology and their potential relevance for the administration, especially in the state of Hesse.
To answer if blockchains are promising to administrations, a Design Science Research (DSR) research approach is chosen. The DSR method is a research-based approach that aims to create new and innovative solutions to real-world problems through the development and evaluation of artefacts such as models, methods, or prototypes. For this work, the implementation of a framework to realize an Authentication, Authorization, and Accounting (AAA) system on the blockchain was identified as profitable. The framework aims to implement the aforementioned AAA tasks using a blockchain. The Remote Authentication Dial-In User Service (RADIUS) protocol has been identified as a potential protocol of the AAA system. The goal is to create a way to implement the system either entirely on a blockchain or as a hybrid system. Various blockchain technologies will be considered. Suitable for development, the framework AAA-me is named.
The development of AAA-me has shown that the desired framework for implementing RADIUS on the blockchain is possible in various degrees of implementation. Previous work mostly relied on full development. Additionally, it has been shown that AAA-me can be used to perform hybrid integration at different implementation levels. This makes AAA-me stand out from the few hybrid previous approaches. Furthermore, AAA-me was investigated in different laboratory environments. This was to determine the expected resilience against Single Point of Failure (SPOF). The results of the lab investigation indicated that a RADIUS system on top of a blockchain can provide benefits in terms of security and performance. In the lab environment, times were measured within which a series of authorization requests were processed. In addition, it was illustrated how a RADIUS system implemented using blockchain can protect itself against Man-in-the-Middle (MITM) attacks.
Finally, in collaboration with the Hessian Central Office for Data Processing (German: Hessische Zentrale für Datenverarbeitung) (HZD), another test lab demonstrated how a RADIUS system on the blockchain can integrate with the existing IT systems of the German state of Hesse. Based on these findings, this work reevaluated the applicability of blockchain technology for public administration processes.
The work has thus shown that the use of a blockchain can still be purposeful. However, it has also been shown that an implementation can bring many problems with it. The small number of blockchain developers and engineers also poses the risk of finding people to develop and maintain a system. In addition, one faces the problem of determining an architecture now that will be applied to many projects in the future. However, each project can, in turn, have an impact on the choice of architecture. Once one has solved this problem and a blockchain infrastructure is available, it can be established quickly and be more SPOF resistant, for example, for Public Key Infrastructure (PKI) systems.
AAA-me was only applied in lab and test environments. As a result, no real data ran over its own infrastructure. This allowed the necessary flexibility for development. However, system-related properties could appear in real situations that are not detectable here in this way. Furthermore, the initial stage of AAA-me’s development is still in its infancy. Many manual adjustments need to be made in order for this to integrate with an existing RADIUS system. Also, no system security effort in and of itself has been carried out in the lab environments. Thus, vulnerabilities can quickly open up on web servers due to misconfigurations and missing updates. For the above reasons, productive use should be discouraged unless major developments are carried out.
Data-parallel programming is more important than ever since serial performance is stagnating. All mainstream computing architectures have been and are still enhancing their support for general purpose computing with explicitly data-parallel execution. For CPUs, data-parallel execution is implemented via SIMD instructions and registers. GPU hardware works very similar allowing very efficient parallel processing of wide data streams with a common instruction stream.
These advances in parallel hardware have not been accompanied by the necessary advances in established programming languages. Developers have thus not been enabled to explicitly state the data-parallelism inherent in their algorithms. Some approaches of GPU and CPU vendors have introduced new programming languages, language extensions, or dialects enabling explicit data-parallel programming. However, it is arguable whether the programming models introduced by these approaches deliver the best solution. In addition, some of these approaches have shortcomings from a hardware-specific focus of the language design. There are several programming problems for which the aforementioned language approaches are not expressive and flexible enough.
This thesis presents a solution tailored to the C++ programming language. The concepts and interfaces are presented specifically for C++ but as abstract as possible facilitating adoption by other programming languages as well. The approach builds upon the observation that C++ is very expressive in terms of types. Types communicate intention and semantics to developers as well as compilers. It allows developers to clearly state their intentions and allows compilers to optimize via explicitly defined semantics of the type system.
Since data-parallelism affects data structures and algorithms, it is not sufficient to enhance the language's expressivity in only one area. The definition of types whose operators express data-parallel execution automatically enhances the possibilities for building data structures. This thesis therefore defines low-level, but fully portable, arithmetic and mask types required to build a flexible and portable abstraction for data-parallel programming. On top of these, it presents higher-level abstractions such as fixed-width vectors and masks, abstractions for interfacing with containers of scalar types, and an approach for automated vectorization of structured types.
The Vc library is an implementation of these types. I developed the Vc library for researching data-parallel types and as a solution for explicitly data-parallel programming. This thesis discusses a few example applications using the Vc library showing the real-world relevance of the library. The Vc types enable parallelization of search algorithms and data structures in a way unique to this solution. It shows the importance of using the type system for expressing data-parallelism. Vc has also become an important building block in the high energy physics community. Their reliance on Vc shows that the library and its interfaces were developed to production quality.
We live in age of data ubiquity. Even the most conservative estimates predict exponential growth in produced, transmitted and stored data. Big data is used to power business analytics as well as to foster scientific discoveries. In many cases, explosion of produced data exceeds capabilities of digital storage systems. Scientific high-performance computing environments cope with this problem by utilizing large, distributed, storage systems. These complex systems can only provide a high degree of reliability and durability by means of data redundancy. The most straight-forward way of doing that is by replicating the data over different physical devices. However, more elaborate approaches, such as erasure coding, can provide similar data protection while utilizing less storage. Recently, software-defined reliability methods began to replace traditional, hardware- based, solutions. Complicated failure modes of storage system components also warrant checksums to guaranty long-term data integrity. To cope with ever increasing data volumes, flexible and efficient software implementation of error correction codes is of great importance. This thesis introduces a method for realizing a flexible Reed-Solomon erasure code using the “Just-In-Time” compilation technique. By exploiting intrinsic arithmetic redundancy in the algorithm, and by relying on modern optimizing compilers, we obtain a throughput-efficient erasure code implementation. Additionally, exploitation of data parallelism is achieved effortlessly by instructing the compiler to produce SIMD code for desired execution platform. We show results of codes implemented using SSE and AVX2 SIMD instruction sets for x86, and NEON instruction set for ARM platforms. Next, we introduce a framework for efficient vectorized RAID-Z redundancy operations of ZFS file system. Traditional, table-based Galois field multiplication algorithms are replaced with custom SSE and AVX2 parallel methods, providing significantly faster and more efficient parity operations. The implementation of this framework was made publicly available as a part of ZFS on Linux project, since version 0.7. Finally, we propose a new erasure scheme for use with existing, high performance, parallel filesystems. Described reliability middleware (ECCFS) allows definition of flexible, file-based, reliability policies, adapting to customized user needs. By utilizing the block erasure code, the ECCFS achieves optimal storage, computation, and network resource utilization, while providing a high level of reliability. The distributed nature of the middleware allows greater scalability and more efficient utilization of storage and network resources, in order to improve availability of the system.
The future heavy-ion experiment CBM (FAIR/GSI, Darmstadt, Germany) will focus on the measurements of very rare probes, which require the experiment to operate under extreme interaction rates of up to 10 MHz. Due to high multiplicity of charged particles in heavy-ion collisions, this will lead to the data rates of up to 1 TB/s. In order to meet the modern achievable archival rate, this data ow has to be reduced online by more than two orders of magnitude.
The rare observables are featured with complicated trigger signatures and require full event topology reconstruction to be performed online. The huge data rates together with the absence of simple hardware triggers make traditional latency limited trigger architectures typical for conventional experiments inapplicable for the case of CBM. Instead, CBM will employ a novel data acquisition concept with autonomous, self-triggered front-end electronics.
While in conventional experiments with event-by-event processing the association of detector hits with corresponding physical event is known a priori, it is not true for the CBM experiment, where the reconstruction algorithms should be modified in order to process non-event-associated data. At the highest interaction rates the time difference between hits belonging to the same collision will be larger than the average time difference between two consecutive collisions. Thus, events will overlap in time. Due to a possible overlap of events one needs to analyze time-slices rather than isolated events.
The time-stamped data will be shipped and collected into a readout buffer in a form of a time-slice of a certain length. The time-slice data will be delivered to a large computer farm, where the archival decision will be obtained after performing online reconstruction. In this case association of hit information with physical events must be performed in software and requires full online event reconstruction not only in space, but also in time, so-called 4-dimensional (4D) track reconstruction.
Within the scope of this work the 4D track finder algorithm for online reconstruction has been developed. The 4D CA track finder is able to reproduce performance and speed of the traditional event-based algorithm. The 4D CA track finder is both vectorized (using SIMD instructions) and parallelized (between CPU cores). The algorithm shows strong scalability on many-core systems. The speed-up factor of 10.1 has been achieved on a CPU with 10 hyper-threaded physical cores.
The 4D CA track finder algorithm is ready for the time-slice-based reconstruction in the CBM experiment.